This is a workaround for #19178 - but it does not fix the issue. The
caption recording events with invalid empty `<locale/>` are simply
dropped with a warning message, to allow the recording and any valid
caption streams present to be processed.
The indexes returned in recording events from BBB refer to positions
within a UTF-16 encoded string. Rather than attempt to untangle this in
the server (which might have a performance cost), it's easier to switch
the caption processing code to operate in UTF-16 encoding as well to
make it work consistently.
The PyICU library provides a UnicodeString type which is a UTF-16 string
similar to Java and JavaScript, but which supports all the python
indexing methods. It's fairly straightforwards to swap it in in place of
the types used previously, and works natively as an input to the ICU
line break iterator too.
Fixes#10531
Write a tool that generates the poll svg images directly from the
BBB poll description. This avoids the issues with special characters
in the gnuplot labels, and gives us a lot more flexibility in how
the polls are formatted and styled.
If you're inserting at position 0 (and there was no previous deleted text
from that position), you can't use the timestamp from the previous character
position, since there's no previous character. Use the timestamp of the
following character instead.
Current BBB client is generating invalid indexes when characters are
inserted with an IME - the edit indexes are as if the preedit text was
being removed, but the preedit text was never sent in the first place.
For now, just don't crash if there's an edit that would remove text
which is past the end of known text. The result might be broken, but
it won't prevent the rest of the recording from working.
I've had a chance to see how this script behaves with actual live
caption tracks now, and there's room for improvement. In particular,
it often generates cues that overlap - the next one appears before the
previous one disappears. The browser player position handles this
really poorly, and it's nearly unreadable.
The solution is to, if two cues would overlap, merge them into a
single multiline cue that displays for the full time. This is a lot
easier to read.
Some extra code is added to de-overlap any remaining cues (e.g. if
there's a third cue that would also overlap). This will reduce the
time that the earlier cue gets shown below my preferred minimum, but
not really much we can do about that if people are talking/typing
quickly.
The code can easily be tweaked to set a different number of maximum
lines per cue if desired.
When the next line break after a wrap induced due to line length was a
hard line break, the hard line break would override the soft line break,
resulting in over-long lines, and the last word would be repeated on the
last line.
Move the hard break code into an else branch so it's only applied if we
haven't already done a word wrap on this line.