Hey, welcome to a presentation on Emotia where we are making creating audiobooks cheap, automated, extremely easy.
00:11
Imagine you have a chapter, you have a chapter of a book that you would like to convert into an audiobook.
00:15
You have the text available.
00:18
If you want to do it cheaply by yourself, you'll go to a tool like this which will produce an output like that.
00:40
Same narrator, robotic wise.
00:45
Not a lot of emotion, not a lot of engagement with the person who's listening to it.
00:52
What we have created is an agentic pipeline that goes from this to something significantly better than that.
01:02
the first version of that that we are making available are ready to be uploaded into an audiobook platform like Audible, audio chapters.
01:12
let's listen to an example of what this output looks like.
01:40
Knowledge you're working on.
02:00
It's different tool.
02:01
It's a little bit, it's very different.
02:04
You talk to accents that match the characters based on the story, sound effects that are contextual and helps the listener transported to the scene where the story is being told.
02:23
The third thing, it's just very immersive.
02:28
like this entire experience.
02:31
we can stop here.
02:33
But we didn't, we went one step further.
02:36
Let's look at what would this experience look like if we had another modality in images or sort of a scene to accompany this.
03:00
The background music, caffeine noises in the background products.
03:58
So what are we trying to solve?
04:01
Most of the books are not available as audiobooks because it's extremely pricey to convert one into an audiobook.
04:08
You can see the numbers based on our industry research.
04:12
We think with Boson AI and our project on the application layer built on top of Bosson AI can get the cost down to less than $50 per chapter, which is incredibly cheap and will make the process of generating audiobooks and making it available to listeners much more cheaper and easier for indie publishers, indie authors, and eventually for publishing houses as well.
04:38
what should Poisson care about this?
04:41
Well, it's a massive market which, and the impact can be made exponential by branching out to other languages all over the world.
04:50
you have like a group of customers who'll be very interested in trying out this product because of the cost savings that it brings and the kind of output we can create with barely any inputs from their end, if that's what they want.
05:06
And, and yeah, there are a lot of listeners of audiobooks.
05:10
Everybody likes to switch one on when they watch their utensils or their gardening.
05:16
I would know.
05:16
I work for A company which does audiobooks at scale, amongst other things.
05:20
so, how do we go about doing this?
05:24
Well a few things.
05:26
we first take the text of the chapter, run it through a few agentic workflows to be able to extract important information about the characters, about the scene, convert the text into like a multi person sort of dialogue just so that we can vocalize it correctly, as well as some other processing to make the downstream process of audio generation a little better and easier for us.
05:52
the next step is of course generating the audio.
05:55
In this case we leverage ElevenLabs voice IDs to be able to generate reference voices that are associated with different characters based on the characteristics.
06:05
and we use these reference voices to be able to ground output from boson AI, for the dialogues of these different characters in a given chapter.
06:13
We also use elevenlabs, Sound Effect Generation API to be able to generate very specific sound effects to help set the scene for a given chapter.
06:24
And the output is like a full chapter's audio stitched together that we heard before.
06:31
The third stage of course is the multi agent director where we take the modality of text dialogues, we take in the generated audio files and we take in the entire story to pass it to an LLM director whose job is to suggest more audio effects to be added to the background, suggest how things should be stitched together, suggest timestamps when one image generated using AI should switch to the next based on the scene.
07:00
so we basically have a workflow that goes from what was produced before and was taken as input into the kind of emotional sort of experience that you saw.
07:10
so yeah, that is sort of what we have built so far.
07:16
The code has been provided in the GitHub repo that we have attached.
07:19
some of the things that we were quite proud of that we sort of stumbled with a little bit was a blocker.
07:27
But then we solved it.
07:28
The first one being found it a little hard to get consistent accents and voice types from boson AI realized a few things about it.
07:37
The first one was having longer reference text usually led to more consistent results.
07:44
not having short dialogues for a given character led to more consistent results.
07:49
We do some processing for this upstream in the process.
07:53
And the third one being generating the reference text related to the chapter that we want to get vocalized usually also produce better outputs from us in our experimentation.
08:02
the last thing I Also want to mention on this is that we are extensively using AI evals, sort of in this entire pipeline where every single time we generate a dialogue or let's say sound effect, we send it to the Audio Understanding API from Boson AI to be able to, and ask it like very specific questions relating to the quality of output.
08:26
if the quality is unsatisfactory, we do up to three retries to be able to produce a better quality output.
08:32
And the next step here would be to do this agentically where we can change the prompt for generation of voice to sort of address the issue that we are seeing based on what the Audio Understanding API is telling us.
08:45
some other things that we dealt with.
08:47
It's quite hard to generate the same type of character, across a chapter, across different dialogue groups.
08:55
we found it quite hard to do this consistently but realized that providing very, very expressive and rich character descriptions helped us with this a lot still, not completely solved but 90% there.
09:10
And the next thing of course is being able to sort of layer different audios, layer like audio and images and then do transitions all using large language models without any viral intervention was very hard.
09:24
but my teammates are amazing and we figured it out.
09:28
where do we go from here?
09:30
Multiple language support of course.
09:31
that would be like a stretch goal for tomorrow and personalization at scale.
09:36
So we want to give more power to our end users to define the narration type, the voices they want to use, etc.
09:43
Etc.
09:43
so that seemed like a logical next step, where we would like to take this project.
09:48
thank you for your time and attention.
09:50
please do listen to the audio.
09:52
samples we have shared with you for different types of of text or books.
09:58
In this case, we are very excited and looking forward to being included in top 16 to sort of discuss this idea more and flesh it out a little bit more.