mirror of
https://github.com/mumble-voip/mumble.git
synced 2026-03-03 00:46:56 -05:00
Change UDP protocol #1904
Labels
No labels
GlobalShortcuts
Hacktoberfest
accessibility
acl
asio
audio
bonjour
bsd
bug
build
certificate
ci
client
code
documentation
external-bug
feature-request
gRPC
github
good first issue
help wanted
help-needed
ice
installer
linux
macOS
needs-ckeck-with-latest-version
needs-more-input
overlay
positional audio
priority/P0 - Blocker
priority/P1 - Critical
priority/P2 - Important
priority/P3 - Somewhat important
priority/P4 - Low
public-server-registration
qt
recording
release-management
server
stale-no-response
stale-support
support
task
test
theme
translation
triage
ui
windows
wontfix
x64
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/mumble-mumble-voip#1904
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Krzmbrzl on GitHub (Jun 11, 2020).
Background
Our current UDP protocol is defined here: https://mumble-protocol.readthedocs.io/en/latest/voice_data.html#packet-format
The gist is that each UDP packet contains a header-byte that defines its type and then it contains the actual payload of the message.
Due to the implementation we have only 3bits for the message type (allowing for values 0-7) of which 0-4 are already in use.
The problem arises when one thinks about attempting to extend the protocol without breaking backwards compatibility.
In the particular example I wanted to add 1 byte of meta-information to a voice packet. However due to the way the protocol works, I can't just append it to the payload if the payload doesn't contain positional data (3 float values (= 12 bytes) at the end of the payload). Instead I'd have to add dummy positional data and then append my value.
This introduces needless overhead and furthermore it also doesn't solve the problem of future extensibility.
We could solve this problem by introducing a new
typebut that trick will only work 3 times before we exceed the 3bits limit we have.Suggestion
Thus my suggestion is to make a breaking change to the UDP packet format in such a way that future extensions (should they happen) won't have to break backwards compatibility and won't have to introduce needless overhead.
I think we basically have 2 options for this:
In order to extend the protocol, we'd then simply have to define a new content type. Older clients that don't know that type, will simply ignore it.
I'm not sure how feasible it is to wrap our audio data into a protocol buffer and how much overhead that'd introduce compared to a self-baked version... 🤔
Backwards compatibility
In order to not break older clients, we'd have to implement some logic on murmur's side that'll send old-style UDP packets to clients that don't know the new version yet.
Likewise it has to be able to understand old UDP packets as well.
Goals
With this proposal I'd like to make the UDP packet definition more flexible and thus future-proof. The cost here is of course that we'd have a breaking change now, were we might get away without having one.
What do you think about this?
/cc @davidebeatrici @Kissaki
/cc @TerryGeng (as a 3rdparty software developer)
@Johni0702 commented on GitHub (Jun 11, 2020):
Yes, you could. It wouldn't be particularly nice but should certainly be possible.
Given we always know the receiver of a particular UDP packet before we transmit it and always know their sender when we receive them (i.e. we never broadcast UDP packets, cause we always need to en-/decrypt them), with https://github.com/mumble-voip/mumble/issues/4224 (or just by looking at the Version message), we should always be able to know whether to expect/include your new value or not (similar to how we only send positional audio to those clients that have the same positional audio context set).
If we repeat that procedure a few times, the respective code might get a bit messy though. Using protobuf for UDP would prevent that.
Similarly it'd make sense to use protobuf if we want to add many optional (i.e. may be in one packet but not necessarily the next one) fields.
As explained above, this would not have to be a breaking change (well, it would be breaking for the UDP protocol itself but not for Mumble as a whole). IMO the cost here is "merely" implementation of the new protocol and maintenance of both versions (at least until the old one has practically died out, then we could make a breaking change by dropping it).
That sounds pretty much like DIY protobuf, so I'd instead just go with the official one.
Protobuf's wire format is approximately:
Both
field id + typeandlengthare varint-encoded (iirc not the same varint as mumble uses though!) and the latter may be omitted depending on the type (e.g. if the type of data is varint, its length is obvious).See https://developers.google.com/protocol-buffers/docs/encoding
The issue I see with protobuf is that it adds a byte of overhead for each field even if we know it to always be present. That'd be 5 bytes of useless overhead (target, codec, session id, sequence number, payload).
One way to get rid of most of them is to just keep the current header, session id, sequence number block and only after that use protobuf for the remainder of the message:
(from memory, didn't try to compile)
This way, there's one byte overhead for a simple opus packet (two bytes for older codec types) and an additional two bytes if it includes positional audio. That would seem acceptable to me.
We could also keep the encoded audio data (here
opus,speex, etc. fields) in the fixed header block but given there's only a single byte of overhead and we'll run out of bits in the header soon-ish anyway, I've included them in the protobuf message, not sure if that byte's worth the additional special-casing for old codecs.@Johni0702 commented on GitHub (Jun 11, 2020):
Updated my comment about overhead (I forgot that there are actually quite a few header fields in the packets).
@TerryGeng commented on GitHub (Jun 11, 2020):
Thanks for cc-ing me :). I think @Johni0702 had made a good point. I'm kind of suspicious about using protobuf for the audio data. The current design of this UDP protocol has lasted for years. There must be a reason why people use this DIY structure instead of protobuf itself.
It is true that using protobuf will grant us more flexibility over the structure of the packet. But at the end of the day, the question is do we really need this flexibility? I think the content of a packet shouldn't change in the middle of a session, so is there a way we can negotiate the structure of the packet first? E.g. a client tell the server it doesn't use the positional audio field, but it will use the other fields. Then the server won't expect positional audio data at that position inside the packet.
Another question is, do all audio packets need to contain these metadata? What exactly does the metadata contain? I remember @Krzmbrzl would like to append a field of "voice target". I think that is something that can be cached on the server. Assume the client has told the server the last field is the voice target. In that case, only the first packet needs to contain this field. Packets come after it can just truncate that field and the server immediately knows this field is the same as the last packet with that field.
In this way, we can have almost no overhead at all.
@Krzmbrzl commented on GitHub (Jun 12, 2020):
Absolutely, but that'd create a lot of maintenance overhead and would also be very painful for other server implementations to implement. Effectively they'd have to explicitly support each and every packet variation that exists in order to not refuse the connection for some clients.
Well this is indeed something that can change mid-session. It could of course be communicated to the server first, but this also involves overhead in the implementation.
IN my specific application the problem is not to get the voice target to the server (this is already done) but the other way around.
Well the problem is that you can't be sure that the "first" packet is even received by the client (remember that UDP is a lossy protocol).
Besides: The type of data that could be appended shouldn't necessarily matter for this discussion here as the point is that if we use a more flexible approach, it doesn't matter (on the protocol side at least - neglecting overhead induced by additional packet size, etc.).
I guess by using a custom format you save a few bytes of overhead. However you gain a stable protocol for that that has been tested and used for years and that can be very easily adapted by 3rdParty software.
If you're using a DIY solution, you can potentially save a few bytes per packet, but have to write the implementation all by yourself which might be buggy. Furthermore any 3rdParty that wants to use another programming language would have to re-implement the same thing again.
Plus there might be edge cases you didn't think of when designing the DIY version that make it lex flexible than originally intended.
Furthermore I think bandwidth is not that much of a problem anymore than it used to be. A few bytes of overhead compared to the few hundred bytes of audio data that is transmitted, seems actually somewhat negligible 🤔
@TerryGeng commented on GitHub (Jun 12, 2020):
I don't really think it is a neglectable overhead... I did a little test just now. Currently, my bot sends music packet with size varies from 200 to 500 bytes, and mumble's voice packet has a fixed length of 160 bytes, while Plumble on Android's packet is 100 bytes. Adding a few more bytes can increase network usage by 1% or more in the worst case. And, the maximum timespan of an opus packet is 60ms. Do we have to repeatedly send the metadata every 60ms? It's about repeating something 30 times per second. For a typical packet of 20ms, that means we repeat it 100 times per second.
But I also acknowledge that the missing packet could be a problem. Anyway, I think we need to discuss types of metadata to send with the audio packet first. The rule of thumb would be, we attach as least data as possible to the voice packet. Attaching a protobuf section that is optional (can be truncated) after the current packet is a way out (and merged the positional information into it as well).
We can even design another message type to handle these requirements.
But if that section comes with all packets, I think we need to be vigilant, since we don't really want the packet grows bigger and bigger in the future. :/
Edit: And another way is we repeat the metadata section once per 100ms, or even at s larger interval. Anyway, if some metadata is really urgent and needs to be sent as frequently as possible, we can always go back to one per packet.
@Krzmbrzl commented on GitHub (Jun 12, 2020):
Of course
But leaves us with a potential breaking change in the future once we exceed the capabilities of the current package format.
Not really. For messages that do not contain positional data, you'd still have to append 12 bytes (3 floats) to the data before the ProtoBuf section as there is no other way of knowing whether the 12 bytes following the audio data is the ProtoBuf message or whether that should be the positional audio.
That's the original problem I was trying to frame in my initial post.
Therefore if the usage of ProtoBuf introduces, let's say 5 bytes of overhead, we'd still save 7 bytes compared to the "append it to the current packet format" approach.
Well of course we should be careful with adding stuff to the UDP packets and what is added should always be as little as possible.
Probably not but as I said before: It's the only way to make sure that the data arrives with the audio as you can't rely on single UDP packets to get to their destination
@Krzmbrzl commented on GitHub (Jun 12, 2020):
Having looked into ProtocolBuffers a bit more, I'm also not really convinced that a custom implementation would actually gain a lot in terms of size-savings. The protocol is pretty mature and compresses data into a single byte were possible.
It is of course more overhead than not having any flexibility at all and instead using a fixed packet format that the client implicitly has to know. But I don't think that you can have a flexible approach without a little overhead in the message size
@Krzmbrzl commented on GitHub (Jul 8, 2020):
In our last Team-meeting we have discussed this issue and concluded that moving the UDP channel to ProtoBuf messages as well has several advantages while the drawbacks (additional overhead of a few bytes) are comparatively small.
Thus I'll close this discussion issue and create a proper issue for it instead.