VLC Output Thread Performance Profiling

Over the last couple of years I did a number of projects where the client approached me with a question along the lines of:

“Why do I get severe video stuttering and dropped frames in VLC when trying to play stream XYZ?”

In some cases it was an issue specific to some feature (such as closed captions or subtitles), in other cases the issue was specific to some platform (e.g. VDPAU playback, VA-API, low-end platforms with slow CPUs, etc).

Over time I developed a bit of methodology for debugging such issues, which I will share here.

First off, how do we objectively measure the performance? At a high level, we need to be able to reliably determine how bad the performance is, and be able to repeat those same benchmarks under different conditions. This boils down into two challenges:

  1. How do we consistently reproduce the behavior in a manner that allows for comparison? This usually involves deciding on a single test clip that can be used over and over. Run the same clip multiple times in an identical environment should produce consistent results. Running the same clip under different conditions will allow you to assess the effects of those conditions. In some cases this isn’t too hard where you can simply use a transport stream and the file input. However there are other cases where it can be much more difficult (such as where performance is ok when reading from a file but poor when reading from some other source such as a DVB tuner or an RTP stream).
  2. What unit of measure do we use to assess performance? Are we measuring frames dropped per second? The total number of frames dropped? CPU load? Having a concrete objective measure will allow you to experiment with different approaches and assess whether they are an improvement or a regression, and by how much.

Now because of VLC’s realtime nature, performance profiling can be unexpectedly difficult. This is because applications which play audio/video will make an active effort to keep up with realtime. This can mean dropping frames or in some cases speeding up or slowing down the audio/video. Because of this, simply measuring CPU usage is inadequate and profiling tools that result in the application running slower than realtime can skew the results. If VLC is dropping 50% of its frames to keep playing audio/video in realtime, naturally the CPU usage is not going to be reflective of the actual load needed to playback the full stream.

Early on I had a situation where I was thrilled to see CPU usage drop from 100% to 60% after making a code change, only to find that in reality the CPU performance improvement was because VLC started to drop 40% of the video frames rather than processing/rendering them.

In determining the source of stuttering video, an end user is likely to only ever see one of a few errors/warnings in the log:

[b1513a60] core video output warning: picture is too late to be displayed (missing 89 ms)
[b1513a60] core video output warning: picture is too late to be displayed (missing 56 ms)
[b1513a60] core video output warning: picture is too late to be displayed (missing 23 ms)

Without adding additional instrumentation to the source code, there is very little way to determine *why* video frames are being dropped. Even worse, in some cases only a small percentage of frames are being dropped, and hence it’s difficult to tell what went wrong while processing the frame immediately prior to the one being dropped. Why did *that* frame take as long as it did to process (causing the next frame to get dropped) when the other 90% of the frames were processed with plenty of time to spare.

As a result, I cooked up some patches that do exactly that. The patch in question records the time of a series of key points in the video output thread when processing frames. This output gets logged to a CSV file that can then be processed by external tools. In my case I wrote some Perl scripts which munge the data, doing some additional calculations, and using GNUPlot to visualize the results. This allows me to visualize how long it took to process every single frame of video, and for each frame see where the time was spent.

Let’s look at an example (click to enlarge, or right-click to put it in a new window so you can follow along with the text that follows):

beforeccimprov

The above chart shows a baseline for a reference video that renders EIA-608 closed captions, on a platform that has an AMD graphics chipset and uses VDPAU for hardware accelerated video decoding. It’s a stacked bar chart, and each of the colors represents a particular stage of the output pipeline for a single frame of video.

For NTSC video, we expect to put out a frame of video every 33ms. Hence in cases where we exceed that threshold there is a real risk that frames will be dropped. Using the above chart as an example, we can see in many frames an excessive amount of time SPU Construct and Render 1 phases. From this we can see pretty clearly that the generation of the graphics that make up the NTSC closed captions are pushing us over the edge, resulting in dropped video frames when closed captions are enabled.

It’s worth noting that if you look at the top left, you’ll see the average processing time is 18.86ms per frame, well within the 33ms maximum. This really demonstrates the value of the tool – a traditional CPU profiler would likely give you average times that suggested processing time was normal. However the key here is being able to look at the outliers (cases where frame processing time was > 30ms), which is where the frame dropping occurs. Being able to look at individual cases where a frame gets dropped, and look at the previous frame to see how much time it took to process and where that time was spent.

Now let’s look at the same reference clip, after a series of fixes:

afterccimprov

You can argue it’s cheating to skip all the way to the end of the process, since chasing down the underlying causes for those performance problems and coming up with fixes took several weeks worth of work (there were multiple optimizations required in different areas of the stack). The key point I’m trying to make though is that as I incrementally made code changes, I had a process where I could easily re-run the reference clip through VLC and assess the effects: Did that last patch make things better? Worse? By how much?

Comments/feedback are welcome in the comments section. If there’s sufficient interest, I can drill down in greater detail on some additional methodology for further debugging these sorts of issues.

Leave a Reply