TECHNOLOGY: Hearing Dialogue Is a Problem in the Streaming Age. Luckily, There Are Apps for That.

A scene from “For all Mankind.” PHOTO: APPLE TV
A scene from “For all Mankind.” PHOTO: APPLE TV

 

By Jennifer Walden

Have you heard? Audiences are complaining about unintelligible dialogue in non-theatrical content on their TVs, about having to use subtitles and having to “remix” content live using the remote to turn down loud scenes and turn up quiet ones.

It’s time to seriously address this issue. Where did it all go wrong? Why is this happening? And most important, how can this be fixed? Here to weigh in on the issue are some of the sound industries leading re-recording mixers: Karol Urban, CAS, MPSE, Lora Hirschberg, CAS, and Onnalee Blank, CAS, supervising sound editor Wylie Stateman, MPSE, and former sound editor/re-recording mixer Eric Hoehn, CAS, who is now Manager of the Creative Sound Pipeline for Netflix. They break down this multi-pronged thorn in everyone’s side and offer up practical and big-idea solutions to making non-theatrical mixes more enjoyable.

ASSET OUT OF CONTAINMENT

Karol Urban.
Karol Urban.

Outside of the quiet and calibrated environment of a movie theater, there is no control over how a film is played back. iPhone? Soundbar? Earbuds? TV speakers? Fully-calibrated Dolby Home Atmos setup? PA speaker in the backyard for outdoor movie night? A viewer might even start a film on one device and finish it on another. All options are valid, and one mix won’t work for all situations. Even with three typical nearfield mixes (5.1, Home Atmos, and stereo) it’s like throwing darts in the dark and hoping that one hits the mark. “We spend a lot of time, money, energy, thought, and skill on making the theatrical mix the best it can be. And frankly, a lot of people will never hear it. It seems a little bit inverted right now, the way people consume the content we create,” said Oscar and BAFTA Award-winning re-recording mixer Hirschberg, at Skywalker Sound in Marin County, Calif., who has mixed huge Hollywood films for studios like Disney and Marvel and for top directors like Christopher Nolan and Guillermo del Toro.

When creating nearfield mixes from theatrical ones, Hirschberg moves to a smaller dub stage with smaller speakers. “But for speakers, what do we pick?” she asked “We can’t dumb down the mix to the least common denominator, yet we can’t assume that everyone has the highest-end gear, so that’s a huge challenge.”

Re-recording mixer Urban, currently the President of Cinema Audio Society, is known for her mixes on series such as “For All Mankind,” “She-Hulk: Attorney at Law,” “Outlander,” “Project Blue Book,” and the newly released Disney+ series “National Treasure: Edge of History.” She said, “There are so many playback devices, and when you play a mix on these different devices, you hear different flavors of sound reproduction immediately. How does one combat that? That’s a big problem. We are now mixing for a cornucopia of devices and acoustics.”

Benedict Cumberbatch in “Dr. Strange.” PHOTO: DISNEY+
Benedict Cumberbatch in “Dr. Strange.” PHOTO: DISNEY+

Creating nearfield mixes from theatrical ones requires rebalancing the theatrical mix stems. Hirschberg starts with the dialogue track, making sure it’s understandable and plays consistently before adding in the other stems around it. All the while, she’s thinking about the director’s intent and how to make that translate at a lower volume with less dynamic range. “We have to consider the content itself. For instance, action movies are tricky because the experience is much bigger in the theater but when you dial it down for home theater, you’re really relying only on the dialogue track. That skews the mix in a way that wasn’t necessarily the filmmaker’s intention, so you’re not getting the full theatrical experience at home. You have to go back and forth between hitting the expectations of the home audience member who’s watching it for the first time and somebody who has this knowledge of what the intention was when we mixed it theatrically,” Hirschberg said.

Lora Hirschberg.
Lora Hirschberg.

Rebalancing stems is an art that requires the hands (and ears) of a skilled mixer to make sure the integrity of the original mix isn’t lost. It’s not just about hitting different loudness specs; therefore, it’s not a task that should be left to an algorithm. But according to Hirschberg, streaming services are now asking for only one nearfield mix. She said, “They want the Atmos mix and they fold it down into whatever. That’s a huge problem because that is way out of our control. A lot of people are hearing the worst possible mix because they’re getting some kind of downmix that we can’t control and didn’t create. If the mixer who mixed the Atmos version does the nearfield versions, they know where the attention should be sonically, and what the feel was supposed to be. A good mixer can recreate the experience of the original mix so the listener will be hearing what they should be at any given time in the home theater Atmos, 5.1, or stereo mix. But that also requires the streaming service to actually provide the end user their choice of the mix – so they can chose the right one for their listening environment – and they don’t routinely do that.”

Hirschberg feels that “it’s super-important to spend time making a stereo mix. We used to make that as an afterthought and let a formula crash it down, but now I spend three days making that stereo mix, trying my best to make the dynamic range a lot smaller but still punchy and also make sure the backgrounds and low-level sounds are apparent in the mix. But again, my work is lost if the streaming service doesn’t provide users access to that specific mix.”

THE TEST OF TIME

Anya Taylor-Joy in “The Queen’s Gambit.” PHOTO: NETFLIX
Anya Taylor-Joy in “The Queen’s Gambit.” PHOTO: NETFLIX

Dialogue editing and mixing, rebalancing theatrical mixes into nearfield mixes, and creating different nearfield mixes that hit specific loudness specs for different digital distribution platforms all happen at the end of the filmmaking process when time and funds are in short supply. Big-budget projects allot time for making the required nearfield mixes. But is it enough? “Maybe not,” said Hirschberg. “We’re asked to do three different mixes in three days that have to hit specs that are sometimes really rigid, and they’re all different. Disney has one spec, and Netflix has another. The number of mixes that we’re asked to create in a very short amount of time (that need to play on a wide variety of devices) is kind of ridiculous. Making nearfield mixes definitely takes time and money, and most often, we don’t get that, so it gets the least care that can be given in order to get it out the door as fast as possible. So what you might get is hard-to-hear dialogue because no one had the resources to spend time listening to it and trying to figure out how to make it work for the home theater.”

Eric Hoehn.
Eric Hoehn.

Simply adding time to the end of a post-production schedule isn’t feasible. So what about changing how dialogue is handled along the chain of custody from production through post? That’s the proposal from multi-award-winning supervising sound editor Stateman, at 247SND in Topanga, Calif., and his long-time collaborator sound editor Eric Hoehn, CAS. They both won Emmy, AMPS, and MPSE Golden Reel awards for their sound editing on Netflix’s “The Queen’s Gambit.” (Hoehn also won an Emmy and CAS Award for mixing that show.)

Before discussing this solution, let’s dissect the problem, starting with what Stateman calls the “dialogue chain of custody.” He said, “Dialogue intelligibility is a very solvable problem that requires an acute focus on the various issues affecting clarity, and, of course, the allocation of time deemed necessary to work the material appropriately. Everybody in the chain participates at some level. The production mixer often tries every trick to get the take, the dialogue editor tries their best to take out clicks and bumps and smooth out the ambiances, and an ADR editor tries to add to or substitute new recordings. In the end, it’s left to the re-recording mixer to bring the mix to completion at a time when the filmmakers really want to focus on music, story, and final delivery.”

A scene from “Game of Thrones.” PHOTO: HBO
A scene from “Game of Thrones.” PHOTO: HBO

A typical studio mix has two re-recording mixers. One mixer handles effects, Foley, and backgrounds; the other mixer handles dialogue and music, and that’s problematic because, as Urban explained, “During the final mix, we add in the music, which is extremely expensive. At this point, the directors and producers have heard the dialogue a million times. They feel absolutely positive that every word is super-clear because they have a preexisting knowledge of the script. As mixers, we’re the last people to hear the dialogue, so we’re more objectively able to recognize dialogue issues. That’s so important at the end of the process, but we have to prioritize that. We, as mixers, have to say, ‘Are we losing dialogue here? Let’s just take a listen and focus on diction.’”

Onnalee Blank.
Onnalee Blank.

But as Stateman pointed out, “Clients don’t have the patience to go syllable by syllable through the dialogue on the final re-recording stage. They’re really focused on how the overall balances with music are working – shortchanging the stage time necessary to go through the dialogue appropriately. Sadly, time on a re-recording stage is often less than ideal, driven by scheduling and budgetary limitations.”

Time is the issue. Time spent on the re-recording stage is more expensive than time spent in the editing suite. According to Hoehn, the average mix is afforded one week per hour of content. “You’re talking about 10 hours of content mixed in 10 weeks. The industry has taken on this survivalist sound mentality of ‘we have a week to get through this.’ And so we ask ourselves, ‘what’s high-yield and what’s low-yield?’ That’s how we prioritize elements in the mix. The filmmakers care about music and sound effects, so we’re balancing what they want to focus on with trying to encourage them to focus on dialogue for a while.”

Part of a dialogue editor’s responsibilities include cleaning up the dialogue, but what if that was expanded to include improving intelligibility as well – so that they take a pass at the dialogue ahead of the mixer, perhaps while listening to it on TV speakers? “If we’re not going to get more time on the mixing stage to address these things, whose responsibilities can we rethink to help with this problem?” asked Hoehn. “We think the dialogue editor is an obvious next choice for that, considering these are people who aren’t going to have the clients sitting behind their shoulders. And a lot of times, dialogue editors are already on headphones. So why not have a system where they can play the dialogue through a TV and check the intelligibility? And if the dialogue editor hears any problems, then they fix it there.”

Stateman clarified, “The separation between editor and mixer is something well worth discussing in the coming years. The creative issues are solvable by examining and refining the process. In terms of the union classifications, we really have to think about the future of editor-mixer. So rather than saying ‘the dialogue editor is now responsible for clarity,’ it’s really, ‘the dialogue editor keeps the chain of custody through the pre-mixing process’ and either becomes a mixer or works with a more qualified mixer.”

Hoehn pointed out that current Pro Tools workflows and technologies support this approach. The dialogue editor and the mixer are oftentimes dealing with the same Pro Tools session. “Gone are the days when we have a playback session that goes into an outboard console, like a DFC or System 5. It’s really a question of whether the union can rally behind this issue.  These are problems and challenges that are show-specific or studio-specific, and there’s a lot of nuance to navigating them. But what we can do within the union and the Editor’s Guild is start to look at those roles and responsibilities in more of a classification kind of way,” he said.

NOT SO SPECTACULAR
Creating an assortment of mixes that hit different specs for different digital distribution platforms is another huge problem. Re-recording mixer Onnalee Blank , CAS, at WB Post Production Creative Services in Burbank, CA, has won five Emmys and six CAS Awards for her work on HBO’s “Game of Thrones.” She said, “The reason that content sounds so bad on TV has nothing to do with mixing. It’s not at all about the mix. The mix sounds good when it leaves the stage. It has to do with the distribution companies.”

Blank explained that since different platforms have different specs, once a company acquires another and brings that content (which had a different audio spec) onto their platform, “they compress it and normalize it and that’s what you hear. They come up with some spec that all the content needs to fit. They have engineers who don’t necessarily know what it is that we do and how things should sound, and they don’t talk to us. They use an overall loudness spec so if one scene has really loud music then the dialogue in the next scene is quiet in order to hit this spec, and so you’re reaching for the remote to turn down the music and turn up the dialogue. Or, you’re watching with subtitles on. It wasn’t a huge deal when streaming was new, but now it is,” she said.

Hitting an overall loudness spec is particularly frustrating when mixing action-packed content, as Urban experienced on shows like “For All Mankind.” She said, “Having the earth implode in an apocalypse (or whatever needs to sound super-big and massive) uses a lot of energy. That is going to cause your overall loudness to increase. You may need to average the show as a whole in order to hit a spec, and still express the filmmaker’s desire to have that massive sound moment, so your dialogue ends up being pushed down.”

Urban is hopeful that more platforms will introduce Loudness Range (LRA) requirements in addition to a loudness spec. According to the European Broadcasting Union, “Loudness Range quantifies the variation in a time-varying loudness measurement.” Typically, a Loudness Range meter will ignore short but very loud events and short but very quiet events, and calculate an average range for the difference between the loud and quiet moments of the entire program. For reference, Netflix recommends an LRA value of between 4 – 18 LU (loudness units) for the full 5.1 program mix, an LRA between 4 – 18 LU for the full stereo program mix, and a Dialogue LRA of 10 LU or less.

“With a Loudness Range, in addition to having a specified anchor dialogue level (dialogue is typically the anchor element), you will also have a set dynamic range to play with for the program. The end result is that the level of the dialogue can be closer to the other elements in the mix,” Urban said.

Adding an intelligibility meter to a post sound workflow can also help to improve dialogue clarity. There are several options available, like iZotope’s Insight 2 and Nuendo 11’s built-in intelligibility meter. “If you’re coming from the theatrical mix and doing the nearfield mix for streaming, or you’ve been listening to the content over and over, this meter may help to indicate that a bit more intelligibility is needed in this section or that sequence,” Urban said.

Blank also suggests standardizing the loudness spec across all streaming services. But which spec should set the standard? “Netflix spec is the best. They have a lot of really talented people working to get that right. It would be great if all the streaming services could use that spec, or something similar. It would be great if there could be a once-a-year meeting with re-recording mixers and all these different engineers to agree on the spec. It would be interesting to hear it from their point of view, to hear what they’re doing to our mixes and why,” said Blank.

Urban agreed that a group discussion would be a major benefit for raising the standard of non-theatrical mixes. “We also need to include the production mixers, to back up their needs. They’ve got a tough job of battling obscene amounts of noise on set, and more productions are using the “wide and tight” multi-camera approach which can make it impossible to get a boom mic on the actors in the scene. They’ve got quite the enigma to solve; we should make sure that we’re supporting our sound team from the very beginning to the very end,” said Urban. “There are multiple people in the process, and we all need to be aware. From the director all the way down to the people deciding on the encoding parameters used by digital distribution platforms, if we’re all aware that this is a problem and focus on it, we’ll change our priorities to make it better.”

THE RISE OF SUBTITLES
The issues mentioned above aren’t the only reasons why subtitle usage is on the rise. Factors like increased hearing loss in young adults and adolescents, viewing content while engaged in other tasks like cooking dinner, folding laundry, or commuting, watching TV in a busy environment or when trying not to wake others all prompt people to put on subtitles. “I was on a dub stage working with a mix tech who is younger than me, who said he used subtitles all the time at home – even while watching content we had mixed here in the studio. His reason was that his home is louder than the studio and there is usually a lot going on. Audiences are watching content while they’re doing other things and use subtitling to literally stick their focus to the screen,” said Urban.

She also explains that creative trends in filmmaking, such as using “realism” for dialogue performance instead of the classic stage yell or stage whisper, make the dialogue harder to understand. “’Realism’ is a problem for Western languages like English because they’re extremely dependent on the pronunciation and definition of consonants in order to have intelligibility. A stage yell or a stage whisper doesn’t have as much dynamic range; it keeps the treatment of consonants and vowels closer in volume, increasing audibility. There’s a movement away from that in realism. Directors and producers accept dialogue that is harder to hear but more realistic in its performance,” explained Urban.

Productions are also embracing darker imagery and using heavier shadowing to help blend in CG elements. Urban noted, “We’re crushing blacks a lot more. This more dramatic style and darker imagery make it more difficult to see mouth movements. Viewers don’t have as much visual information to help them understand what’s being said, or they misunderstand what’s being said because it doesn’t match what they see, this is known as the McGurk effect. If viewers can’t understand the dialogue, they’ll turn the subtitles on.”

Hoehn feels that directors should think about how end users are going to be watching their content. “It might sound cool on the mixing stage with 20 speakers in the room, but does it help the success of their project? I think that’s where, as sound people, we need to shift the conversations with producers and directors to focus more on dialogue intelligibility as a means to helping the audience enjoy and engage with the content more. Maybe it requires asking the director to take the two-track mix home and listen to it in their home environment, or listen to it on an iPad or a laptop. They may find it is a very different experience,” he concluded.

Jennifer Walden is a freelance writer who specializes in post-production technology.