What if you could use AI to fix the lips in dubbed movies? This Hollywood director is doing it
Director Scott Mann talks about using the revolutionary new tech in his latest film, Fall
From bringing Transformers to life to illustrating the world of Pandora, CGI has been creating movie magic for years now. But, the familiar tech can't change an actor's performance in post-production quite as seamlessly as AI can — and that can make a huge difference for filmmakers like Scott Mann.
The director and tech entrepreneur is on a mission to show others in the industry how generative AI can revolutionize their work through his company, Flawless. For instance, the tech can analyze and recreate an actor's performance with such detail that it can accurately replicate how their face would look even if they delivered different lines entirely.
When used responsibly, this capability could change how international language dubbing is done in the industry at large, and save production teams time and money in post while rendering virtually imperceptible changes.
He joined host Elamin Abdelmahmoud from Los Angeles to talk about founding his company and using the technology in his latest movie, Fall.
We've included some highlights below, edited for length and clarity. For the full discussion, listen and follow the Commotion with Elamin Abdelmahmoud podcast, on your favourite podcast player.
Elamin: I'm really excited to talk to you about this. You've been directing movies for years. Tell me, when did this journey with AI begin for you?
Scott: It was really on the back of finishing that film Heist. I'd gone to extraordinary lengths to try and make the film I wanted to make, and had the opportunity to work with these great actors, giving great performances. And it was really when I saw a foreign dub of the movie that I realized the damage that gets done in that process and how much different that was to what I'd originally made.
The dubbing process changes the dialogue to try and fit these missing mouth movements, and the performances are completely different. Really, it fundamentally destroys a film. So, realizing how limited films were by their language and that I was really only making a film for an English-speaking audience, was the driving factor. Coming across a technology that could change all that was really where my journey for Flawless started.
Elamin: It reminds me of that thing that the director of Parasite, Bong Joon-ho, said once he won the Golden Globe: "Once you overcome the one-inch-tall barrier of subtitles, you'll be introduced to so many more amazing films." And in many ways, this could kind of do away with that barrier.
Scott: 100 per cent right. There are so many great movies around the world that we have no exposure to whatsoever because language is a barrier to those experiences. People just don't enjoy watching dubbed content. I think subtitled content can work really well for a certain part of the audience and for certain types of films, but on mass it doesn't really solve it, and that's why you see the lack of films traveling.
Elamin: I've got to say, I've watched plenty of dubbed films before and every time that I watch them, in my mind the first 4 minutes are really jarring because I've got to get used to different mouth movements and the things that the actors are actually saying. But then eventually I think, "I guess that's the reality of dubbed films." But for you, you kind of had this moment where you go, "I will not accept this. This is the thing I have to fix." Take me back to this moment.
Scott: For me, filmmaking is about immersion. It's about combining all the sound of visual elements and really taking an audience through an experience and a feeling — and to do that you have to empathize and connect. I looked at all the different ways that this could possibly work with all the technology. This is going back about six years ago now, where I tested out the very best in the old versions of the effects — that's kind of full head-scanning, and all the different very expensive, cumbersome ways that at that time we were essentially doing human renderings — and none of it worked. Because what really happens is, as humans we've looked at faces all our lives, so we're very conscious of lots of subliminal cues that tell us whether something is real or not.
And for me, it was when I came across this paper by scientists Hyeongwoo Kim and Pablo Garrido, led by Christian Theobalt over at the Max Planck Institute. They basically cracked this new field of technology, really, that had AI rendering human images. To me, seeing that paper was a complete game changer because it introduced a new tool with the potential to change how we make films. I was just blown away. The guys very graciously invited me over to Germany and we worked on a collaboration together to really build out a fix for film dubbing. From that, the company was founded.
Elamin: The idea of literally manipulating the speaker's mouth is so fascinating to me. You wanted people to have this deeply immersive, authentic experience of watching a film. Can you just explain in layman's terms if that's possible, how does the technology that you develop actually work?
Scott: I think in the simplest terms, what it's doing is it's able to capture a scene in a much greater level of detail just by looking at a normal finished film frame without any additional hardware or any additional cameras. It's able to kind of look at that and capture much more information than we typically can see. And from that it's able to build, essentially, a 3D representation of the scene.
And when you're in that [3D representation], you're able to change things around and you're able to take a performance from, say, one shot, and put it onto another shot, or you're able to generate mouth movements that are the actor's mouth movements with all the nuance of how they would have performed it, and essentially re-render out a new version that is as authentic as the original.
Elamin: I think that's an important point to stress because, that is literally how that actor would move their mouth if they were trying to say that word; they just are not actually doing it. And yes, we're talking about you developing this for dubbing, but then you used it in post-production for your most recent film, Fall, to completely change your actors' performances in ways that are, I think, imperceptible if you're watching it. For people who have not seen this movie — which is totally gripping, by the way — your two characters climb this massive telecommunications tower. But you needed to clean up the dialogue?
Scott: Well, when we came to distribution time and the film ratings up here rated it an R, because it turned out that we had 36 F-bombs in it. That was the only thing that was making the film an R, and Lionsgate, the studio that was distributing the movie in the U.S., wanted a PG-13 for delivery. Typically to fix these problems, you would go back to set and you would reshoot the scenes where you see them speaking like that with new lines. And we looked at doing that, but that process literally costs millions — you're building the set again, you're bringing everyone back, you're flying everyone in. It's a very common practice actually, in films, but for a film like this and the time scale, it just wasn't practical or realistic that we were able to do that.
So we used the technology to essentially record new lines with the actors in a sound booth that were F-bomb alternatives, let's call them, and then use that audio information to generate the new shots. What was really great, to be honest, is that you're not touching the actual performance; you're actually retaining the performance, with all the emotion of the characters, but you're able to change out the dialogue to the non-swearing version, which is just vitally important because that's what's conveying the experience to the audience and that's, for me, what needs to be untouched. And so having the ability to change out dialogue without touching the performance itself is the key to that working, really.
Elamin: So we should also note you didn't actually shoot this movie thousands of feet up on top of a telecommunications tower — thank God, because I would be so stressed out if you were going to tell me you did. But it really looks like you did it. How do you understand the difference between computer generated imagery and AI? How are those two things different?
Scott: Well, computer generated imagery typically is trying to recreate something and kind of fake every layer — like a rendering, essentially, of a 3D model. AI uses deep learning, and that deep learning is where all the magic happens. Having a system that's able to understand all the nuances of how someone speaks and articulates is as important as understanding where the light is in the rendering and all these other details. And so, it works very differently in terms of a system. You've essentially got a brain going on and it's creating neural networks, as opposed to rendering things out in the old way.
But for Flawless, it was about simplifying all that for filmmakers so that the tools that we've built can really make films and content better, and then have it reach larger audiences by traveling globally. That's what Flawless has been about — harnessing the power of AI and having these tools so we can use it responsibly in filmmaking.
You can listen to the full discussion from today's show on CBC Listen or on our podcast, Commotion with Elamin Abdelmahmoud, available wherever you get your podcasts.