starred/mumble-mumble-voip

Fork 0

mirror of https://github.com/mumble-voip/mumble.git synced 2026-03-03 00:46:56 -05:00

Change UDP protocol #1904

New issue

Closed

opened 2026-02-20 21:18:15 -05:00 by deekerman · 8 comments

deekerman commented

2026-02-20 21:18:15 -05:00

Owner

Originally created by @Krzmbrzl on GitHub (Jun 11, 2020).

Background

Our current UDP protocol is defined here: https://mumble-protocol.readthedocs.io/en/latest/voice_data.html#packet-format

The gist is that each UDP packet contains a header-byte that defines its type and then it contains the actual payload of the message.

Due to the implementation we have only 3bits for the message type (allowing for values 0-7) of which 0-4 are already in use.

The problem arises when one thinks about attempting to extend the protocol without breaking backwards compatibility.

In the particular example I wanted to add 1 byte of meta-information to a voice packet. However due to the way the protocol works, I can't just append it to the payload if the payload doesn't contain positional data (3 float values (= 12 bytes) at the end of the payload). Instead I'd have to add dummy positional data and then append my value.

This introduces needless overhead and furthermore it also doesn't solve the problem of future extensibility.

We could solve this problem by introducing a new type but that trick will only work 3 times before we exceed the 3bits limit we have.

Suggestion

Thus my suggestion is to make a breaking change to the UDP packet format in such a way that future extensions (should they happen) won't have to break backwards compatibility and won't have to introduce needless overhead.

I think we basically have 2 options for this:

Use ProtoBuf for UDP as well
Design our own message format that contains a TOC. In there we'd have to define the contents of this packet by using an ID (the type of that content) as well as an offset (where is the content inside the binary data of this message). Both could be implemented using variable integer encoding in order to not put fixed size limits on what values these can have yet again.

In order to extend the protocol, we'd then simply have to define a new content type. Older clients that don't know that type, will simply ignore it.

I'm not sure how feasible it is to wrap our audio data into a protocol buffer and how much overhead that'd introduce compared to a self-baked version... 🤔

Backwards compatibility

In order to not break older clients, we'd have to implement some logic on murmur's side that'll send old-style UDP packets to clients that don't know the new version yet.
Likewise it has to be able to understand old UDP packets as well.

Goals

With this proposal I'd like to make the UDP packet definition more flexible and thus future-proof. The cost here is of course that we'd have a breaking change now, were we might get away without having one.

What do you think about this?

/cc @davidebeatrici @Kissaki
/cc @TerryGeng (as a 3rdparty software developer)

Originally created by @Krzmbrzl on GitHub (Jun 11, 2020). ## Background Our current UDP protocol is defined here: https://mumble-protocol.readthedocs.io/en/latest/voice_data.html#packet-format The gist is that each UDP packet contains a header-byte that defines its type and then it contains the actual payload of the message. Due to the implementation we have only 3bits for the message type (allowing for values 0-7) of which 0-4 are already in use. The problem arises when one thinks about attempting to extend the protocol without breaking backwards compatibility. In the particular example I wanted to add 1 byte of meta-information to a voice packet. However due to the way the protocol works, I can't just append it to the payload if the payload doesn't contain positional data (3 float values (= 12 bytes) at the end of the payload). Instead I'd have to add dummy positional data and *then* append my value. This introduces needless overhead and furthermore it also doesn't solve the problem of future extensibility. We could solve this problem by introducing a new `type` but that trick will only work 3 times before we exceed the 3bits limit we have. ## Suggestion Thus my suggestion is to make a breaking change to the UDP packet format in such a way that future extensions (should they happen) won't have to break backwards compatibility and won't have to introduce needless overhead. I think we basically have 2 options for this: 1. Use ProtoBuf for UDP as well 2. Design our own message format that contains a TOC. In there we'd have to define the contents of this packet by using an ID (the type of that content) as well as an offset (where is the content inside the binary data of this message). Both could be implemented using variable integer encoding in order to not put fixed size limits on what values these can have yet again. In order to extend the protocol, we'd then simply have to define a new content type. Older clients that don't know that type, will simply ignore it. I'm not sure how feasible it is to wrap our audio data into a protocol buffer and how much overhead that'd introduce compared to a self-baked version... :thinking: ## Backwards compatibility In order to not break older clients, we'd have to implement some logic on murmur's side that'll send old-style UDP packets to clients that don't know the new version yet. Likewise it has to be able to understand old UDP packets as well. ## Goals With this proposal I'd like to make the UDP packet definition more flexible and thus future-proof. The cost here is of course that we'd have a breaking change now, were we might get away without having one. ---- What do you think about this? /cc @davidebeatrici @Kissaki /cc @TerryGeng (as a 3rdparty software developer)

deekerman closed this issue

2026-02-20 21:18:16 -05:00

deekerman commented

2026-02-20 21:18:19 -05:00

Author

Owner

@Johni0702 commented on GitHub (Jun 11, 2020):

However due to the way the protocol works, I can't just append it to the payload if the payload doesn't contain positional data (3 float values (= 12 bytes) at the end of the payload). Instead I'd have to add dummy positional data and then append my value.

Yes, you could. It wouldn't be particularly nice but should certainly be possible.
Given we always know the receiver of a particular UDP packet before we transmit it and always know their sender when we receive them (i.e. we never broadcast UDP packets, cause we always need to en-/decrypt them), with https://github.com/mumble-voip/mumble/issues/4224 (or just by looking at the Version message), we should always be able to know whether to expect/include your new value or not (similar to how we only send positional audio to those clients that have the same positional audio context set).

If we repeat that procedure a few times, the respective code might get a bit messy though. Using protobuf for UDP would prevent that.
Similarly it'd make sense to use protobuf if we want to add many optional (i.e. may be in one packet but not necessarily the next one) fields.

The cost here is of course that we'd have a breaking change now

As explained above, this would not have to be a breaking change (well, it would be breaking for the UDP protocol itself but not for Mumble as a whole). IMO the cost here is "merely" implementation of the new protocol and maintenance of both versions (at least until the old one has practically died out, then we could make a breaking change by dropping it).

Design our own message format that contains a TOC. In there we'd have to define the contents of this packet by using an ID (the type of that content) as well as an offset (where is the content inside the binary data of this message). Both could be implemented using variable integer encoding in order to not put fixed size limits on what values these can have yet again.

That sounds pretty much like DIY protobuf, so I'd instead just go with the official one.
Protobuf's wire format is approximately:

<field id + type><length><data><field id + type><length><data>…

Both field id + type and length are varint-encoded (iirc not the same varint as mumble uses though!) and the latter may be omitted depending on the type (e.g. if the type of data is varint, its length is obvious).
See https://developers.google.com/protocol-buffers/docs/encoding

I'm not sure how feasible it is to wrap our audio data into a protocol buffer and how much overhead that'd introduce compared to a self-baked version... thinking

The issue I see with protobuf is that it adds a byte of overhead for each field even if we know it to always be present. That'd be 5 bytes of useless overhead (target, codec, session id, sequence number, payload).
One way to get rid of most of them is to just keep the current header, session id, sequence number block and only after that use protobuf for the remainder of the message:

message UDPPacket {
    bytes positional_data = 1;
    bytes opus = 2;
    uint64 your_new_thing = 3;

    // intentionally put these at later field indices to keep fields with single-byte varint encoding available for frequently used fields
    uint64 ping = 50;
    // repeated might make more sense for some of these (haven't checked their current encoding)
    bytes speex = 51;
    bytes celt_alpha = 52;
    bytes celt_beta = 53;
}

(from memory, didn't try to compile)
This way, there's one byte overhead for a simple opus packet (two bytes for older codec types) and an additional two bytes if it includes positional audio. That would seem acceptable to me.
We could also keep the encoded audio data (here opus, speex, etc. fields) in the fixed header block but given there's only a single byte of overhead and we'll run out of bits in the header soon-ish anyway, I've included them in the protobuf message, not sure if that byte's worth the additional special-casing for old codecs.

@Johni0702 commented on GitHub (Jun 11, 2020): > However due to the way the protocol works, I can't just append it to the payload if the payload doesn't contain positional data (3 float values (= 12 bytes) at the end of the payload). Instead I'd have to add dummy positional data and then append my value. Yes, you could. It wouldn't be particularly nice but should certainly be possible. Given we always know the receiver of a particular UDP packet before we transmit it and always know their sender when we receive them (i.e. we never broadcast UDP packets, cause we always need to en-/decrypt them), with https://github.com/mumble-voip/mumble/issues/4224 (or just by looking at the Version message), we should always be able to know whether to expect/include your new value or not (similar to how we only send positional audio to those clients that have the same positional audio context set). If we repeat that procedure a few times, the respective code might get a bit messy though. Using protobuf for UDP would prevent that. Similarly it'd make sense to use protobuf if we want to add many optional (i.e. may be in one packet but not necessarily the next one) fields. > The cost here is of course that we'd have a breaking change now As explained above, this would not have to be a breaking change (well, it would be breaking for the UDP protocol itself but not for Mumble as a whole). IMO the cost here is "merely" implementation of the new protocol and maintenance of both versions (at least until the old one has practically died out, then we could make a breaking change by dropping it). > Design our own message format that contains a TOC. In there we'd have to define the contents of this packet by using an ID (the type of that content) as well as an offset (where is the content inside the binary data of this message). Both could be implemented using variable integer encoding in order to not put fixed size limits on what values these can have yet again. That sounds pretty much like DIY protobuf, so I'd instead just go with the official one. Protobuf's wire format is approximately: ``` <field id + type><length><data><field id + type><length><data>… ``` Both `field id + type` and `length` are varint-encoded (iirc not the same varint as mumble uses though!) and the latter may be omitted depending on the type (e.g. if the type of data is varint, its length is obvious). See https://developers.google.com/protocol-buffers/docs/encoding > I'm not sure how feasible it is to wrap our audio data into a protocol buffer and how much overhead that'd introduce compared to a self-baked version... thinking The issue I see with protobuf is that it adds a byte of overhead for each field even if we know it to always be present. That'd be 5 bytes of useless overhead (target, codec, session id, sequence number, payload). One way to get rid of most of them is to just keep the current header, session id, sequence number block and only after that use protobuf for the remainder of the message: ```protobuf message UDPPacket { bytes positional_data = 1; bytes opus = 2; uint64 your_new_thing = 3; // intentionally put these at later field indices to keep fields with single-byte varint encoding available for frequently used fields uint64 ping = 50; // repeated might make more sense for some of these (haven't checked their current encoding) bytes speex = 51; bytes celt_alpha = 52; bytes celt_beta = 53; } ``` (from memory, didn't try to compile) This way, there's one byte overhead for a simple opus packet (two bytes for older codec types) and an additional two bytes if it includes positional audio. That would seem acceptable to me. We could also keep the encoded audio data (here `opus`, `speex`, etc. fields) in the fixed header block but given there's only a single byte of overhead and we'll run out of bits in the header soon-ish anyway, I've included them in the protobuf message, not sure if that byte's worth the additional special-casing for old codecs.

deekerman commented

2026-02-20 21:18:19 -05:00

Author

Owner

@Johni0702 commented on GitHub (Jun 11, 2020):

Updated my comment about overhead (I forgot that there are actually quite a few header fields in the packets).

@Johni0702 commented on GitHub (Jun 11, 2020): Updated my comment about overhead (I forgot that there are actually quite a few header fields in the packets).

deekerman commented

2026-02-20 21:18:19 -05:00

Author

Owner

@TerryGeng commented on GitHub (Jun 11, 2020):

Thanks for cc-ing me :). I think @Johni0702 had made a good point. I'm kind of suspicious about using protobuf for the audio data. The current design of this UDP protocol has lasted for years. There must be a reason why people use this DIY structure instead of protobuf itself.

It is true that using protobuf will grant us more flexibility over the structure of the packet. But at the end of the day, the question is do we really need this flexibility? I think the content of a packet shouldn't change in the middle of a session, so is there a way we can negotiate the structure of the packet first? E.g. a client tell the server it doesn't use the positional audio field, but it will use the other fields. Then the server won't expect positional audio data at that position inside the packet.

Another question is, do all audio packets need to contain these metadata? What exactly does the metadata contain? I remember @Krzmbrzl would like to append a field of "voice target". I think that is something that can be cached on the server. Assume the client has told the server the last field is the voice target. In that case, only the first packet needs to contain this field. Packets come after it can just truncate that field and the server immediately knows this field is the same as the last packet with that field.

In this way, we can have almost no overhead at all.

@TerryGeng commented on GitHub (Jun 11, 2020): Thanks for cc-ing me :). I think @Johni0702 had made a good point. I'm kind of suspicious about using protobuf for the audio data. The current design of this UDP protocol has lasted for years. There must be a reason why people use this DIY structure instead of protobuf itself. It is true that using protobuf will grant us more flexibility over the structure of the packet. But at the end of the day, the question is do we really need this flexibility? I think the content of a packet shouldn't change in the middle of a session, so is there a way we can negotiate the structure of the packet first? E.g. a client tell the server it doesn't use the positional audio field, but it will use the other fields. Then the server won't expect positional audio data at that position inside the packet. Another question is, do all audio packets need to contain these metadata? What exactly does the metadata contain? I remember @Krzmbrzl would like to append a field of "voice target". I think that is something that can be cached on the server. Assume the client has told the server the last field is the voice target. In that case, only the first packet needs to contain this field. Packets come after it can just truncate that field and the server immediately knows this field is the same as the last packet with that field. In this way, we can have almost no overhead at all.

deekerman commented

2026-02-20 21:18:19 -05:00

Author

Owner

@Krzmbrzl commented on GitHub (Jun 12, 2020):

I think the content of a packet shouldn't change in the middle of a session, so is there a way we can negotiate the structure of the packet first?

Absolutely, but that'd create a lot of maintenance overhead and would also be very painful for other server implementations to implement. Effectively they'd have to explicitly support each and every packet variation that exists in order to not refuse the connection for some clients.

E.g. a client tell the server it doesn't use the positional audio field, but it will use the other fields. Then the server won't expect positional audio data at that position inside the packet.

Well this is indeed something that can change mid-session. It could of course be communicated to the server first, but this also involves overhead in the implementation.

Assume the client has told the server the last field is the voice target. In that case, only the first packet needs to contain this field. Packets come after it can just truncate that field and the server immediately knows this field is the same as the last packet with that field.

IN my specific application the problem is not to get the voice target to the server (this is already done) but the other way around.

Another question is, do all audio packets need to contain these metadata? What exactly does the metadata contain? I remember @Krzmbrzl would like to append a field of "voice target".

Well the problem is that you can't be sure that the "first" packet is even received by the client (remember that UDP is a lossy protocol).
Besides: The type of data that could be appended shouldn't necessarily matter for this discussion here as the point is that if we use a more flexible approach, it doesn't matter (on the protocol side at least - neglecting overhead induced by additional packet size, etc.).

I'm kind of suspicious about using protobuf for the audio data. The current design of this UDP protocol has lasted for years. There must be a reason why people use this DIY structure instead of protobuf itself.

I guess by using a custom format you save a few bytes of overhead. However you gain a stable protocol for that that has been tested and used for years and that can be very easily adapted by 3rdParty software.
If you're using a DIY solution, you can potentially save a few bytes per packet, but have to write the implementation all by yourself which might be buggy. Furthermore any 3rdParty that wants to use another programming language would have to re-implement the same thing again.
Plus there might be edge cases you didn't think of when designing the DIY version that make it lex flexible than originally intended.

Furthermore I think bandwidth is not that much of a problem anymore than it used to be. A few bytes of overhead compared to the few hundred bytes of audio data that is transmitted, seems actually somewhat negligible 🤔

@Krzmbrzl commented on GitHub (Jun 12, 2020): > I think the content of a packet shouldn't change in the middle of a session, so is there a way we can negotiate the structure of the packet first? Absolutely, but that'd create a lot of maintenance overhead and would also be very painful for other server implementations to implement. Effectively they'd have to explicitly support each and every packet variation that exists in order to not refuse the connection for some clients. > E.g. a client tell the server it doesn't use the positional audio field, but it will use the other fields. Then the server won't expect positional audio data at that position inside the packet. Well this is indeed something that can change mid-session. It could of course be communicated to the server first, but this also involves overhead in the implementation. > Assume the client has told the server the last field is the voice target. In that case, only the first packet needs to contain this field. Packets come after it can just truncate that field and the server immediately knows this field is the same as the last packet with that field. IN my specific application the problem is not to get the voice target to the server (this is already done) but the other way around. > Another question is, do all audio packets need to contain these metadata? What exactly does the metadata contain? I remember @Krzmbrzl would like to append a field of "voice target". Well the problem is that you can't be sure that the "first" packet is even received by the client (remember that UDP is a lossy protocol). Besides: The type of data that could be appended shouldn't necessarily matter for this discussion here as the point is that if we use a more flexible approach, it doesn't matter (on the protocol side at least - neglecting overhead induced by additional packet size, etc.). > I'm kind of suspicious about using protobuf for the audio data. The current design of this UDP protocol has lasted for years. There must be a reason why people use this DIY structure instead of protobuf itself. I guess by using a custom format you save a few bytes of overhead. However you gain a stable protocol for that that has been tested and used for years and that can be very easily adapted by 3rdParty software. If you're using a DIY solution, you can potentially save a few bytes per packet, but have to write the implementation all by yourself which might be buggy. Furthermore any 3rdParty that wants to use another programming language would have to re-implement the same thing again. Plus there might be edge cases you didn't think of when designing the DIY version that make it lex flexible than originally intended. Furthermore I think bandwidth is not that much of a problem anymore than it used to be. A few bytes of overhead compared to the few hundred bytes of audio data that is transmitted, seems actually somewhat negligible :thinking:

deekerman commented

2026-02-20 21:18:19 -05:00

Author

Owner

@TerryGeng commented on GitHub (Jun 12, 2020):

I don't really think it is a neglectable overhead... I did a little test just now. Currently, my bot sends music packet with size varies from 200 to 500 bytes, and mumble's voice packet has a fixed length of 160 bytes, while Plumble on Android's packet is 100 bytes. Adding a few more bytes can increase network usage by 1% or more in the worst case. And, the maximum timespan of an opus packet is 60ms. Do we have to repeatedly send the metadata every 60ms? It's about repeating something 30 times per second. For a typical packet of 20ms, that means we repeat it 100 times per second.

But I also acknowledge that the missing packet could be a problem. Anyway, I think we need to discuss types of metadata to send with the audio packet first. The rule of thumb would be, we attach as least data as possible to the voice packet. Attaching a protobuf section that is optional (can be truncated) after the current packet is a way out (and merged the positional information into it as well).

We can even design another message type to handle these requirements.
But if that section comes with all packets, I think we need to be vigilant, since we don't really want the packet grows bigger and bigger in the future. :/

Edit: And another way is we repeat the metadata section once per 100ms, or even at s larger interval. Anyway, if some metadata is really urgent and needs to be sent as frequently as possible, we can always go back to one per packet.

@TerryGeng commented on GitHub (Jun 12, 2020): I don't really think it is a neglectable overhead... I did a little test just now. Currently, my bot sends music packet with size varies from 200 to 500 bytes, and mumble's voice packet has a fixed length of 160 bytes, while Plumble on Android's packet is 100 bytes. Adding a few more bytes can increase network usage by 1% or more in the worst case. And, the maximum timespan of an opus packet is 60ms. Do we have to repeatedly send the metadata every 60ms? It's about repeating something 30 times per second. For a typical packet of 20ms, that means we repeat it 100 times per second. But I also acknowledge that the missing packet could be a problem. Anyway, I think we need to discuss types of metadata to send with the audio packet first. The rule of thumb would be, we attach as least data as possible to the voice packet. Attaching a protobuf section that is optional (can be truncated) after the current packet is a way out (and merged the positional information into it as well). We can even design another message type to handle these requirements. But if that section comes with all packets, I think we need to be vigilant, since we don't really want the packet grows bigger and bigger in the future. :/ Edit: And another way is we repeat the metadata section once per 100ms, or even at s larger interval. Anyway, if some metadata is really urgent and needs to be sent as frequently as possible, we can always go back to one per packet.

deekerman commented

2026-02-20 21:18:19 -05:00

Author

Owner

@Krzmbrzl commented on GitHub (Jun 12, 2020):

The rule of thumb would be, we attach as least data as possible to the voice packet.

Of course

Attaching a protobuf section that is optional (can be truncated) after the current packet is a way out.

But leaves us with a potential breaking change in the future once we exceed the capabilities of the current package format.

Attaching a protobuf section that is optional (can be truncated) after the current packet is a way out.

Not really. For messages that do not contain positional data, you'd still have to append 12 bytes (3 floats) to the data before the ProtoBuf section as there is no other way of knowing whether the 12 bytes following the audio data is the ProtoBuf message or whether that should be the positional audio.

That's the original problem I was trying to frame in my initial post.

Therefore if the usage of ProtoBuf introduces, let's say 5 bytes of overhead, we'd still save 7 bytes compared to the "append it to the current packet format" approach.

But if that section comes with all packets, I think we need to be vigilant, since we don't really want the packet grows bigger and bigger in the future. :/

Well of course we should be careful with adding stuff to the UDP packets and what is added should always be as little as possible.

And, the maximum timespan of an opus packet is 60ms. Do we have to repeatedly send the metadata every 60ms? It's about repeating something 30 times per second. For a typical packet of 20ms, that means we repeat it 100 times per second.

Probably not but as I said before: It's the only way to make sure that the data arrives with the audio as you can't rely on single UDP packets to get to their destination

@Krzmbrzl commented on GitHub (Jun 12, 2020): > The rule of thumb would be, we attach as least data as possible to the voice packet. Of course > Attaching a protobuf section that is optional (can be truncated) after the current packet is a way out. But leaves us with a potential breaking change in the future once we exceed the capabilities of the current package format. > Attaching a protobuf section that is optional (can be truncated) after the current packet is a way out. Not really. For messages that do not contain positional data, you'd still have to append 12 bytes (3 floats) to the data before the ProtoBuf section as there is no other way of knowing whether the 12 bytes following the audio data is the ProtoBuf message or whether that should be the positional audio. That's the original problem I was trying to frame in my initial post. Therefore if the usage of ProtoBuf introduces, let's say 5 bytes of overhead, we'd still save 7 bytes compared to the "append it to the current packet format" approach. > But if that section comes with all packets, I think we need to be vigilant, since we don't really want the packet grows bigger and bigger in the future. :/ Well of course we should be careful with adding stuff to the UDP packets and what is added should always be as little as possible. > And, the maximum timespan of an opus packet is 60ms. Do we have to repeatedly send the metadata every 60ms? It's about repeating something 30 times per second. For a typical packet of 20ms, that means we repeat it 100 times per second. Probably not but as I said before: It's the only way to make sure that the data arrives with the audio as you can't rely on single UDP packets to get to their destination

deekerman commented

2026-02-20 21:18:19 -05:00

Author

Owner

@Krzmbrzl commented on GitHub (Jun 12, 2020):

Having looked into ProtocolBuffers a bit more, I'm also not really convinced that a custom implementation would actually gain a lot in terms of size-savings. The protocol is pretty mature and compresses data into a single byte were possible.

It is of course more overhead than not having any flexibility at all and instead using a fixed packet format that the client implicitly has to know. But I don't think that you can have a flexible approach without a little overhead in the message size

@Krzmbrzl commented on GitHub (Jun 12, 2020): Having looked into ProtocolBuffers a bit more, I'm also not really convinced that a custom implementation would actually gain a lot in terms of size-savings. The protocol is pretty mature and compresses data into a single byte were possible. It is of course more overhead than not having any flexibility at all and instead using a fixed packet format that the client implicitly has to know. But I don't think that you can have a flexible approach without a little overhead in the message size

deekerman commented

2026-02-20 21:18:20 -05:00

Author

Owner

@Krzmbrzl commented on GitHub (Jul 8, 2020):

In our last Team-meeting we have discussed this issue and concluded that moving the UDP channel to ProtoBuf messages as well has several advantages while the drawbacks (additional overhead of a few bytes) are comparatively small.

Thus I'll close this discussion issue and create a proper issue for it instead.

@Krzmbrzl commented on GitHub (Jul 8, 2020): In our last Team-meeting we have discussed this issue and concluded that moving the UDP channel to ProtoBuf messages as well has several advantages while the drawbacks (additional overhead of a few bytes) are comparatively small. Thus I'll close this discussion issue and create a proper issue for it instead.