19th Ave New York, NY 95822, USA

The voice makes synthesized voices sound like humans

Advances in A.I. Technology has paved the way for breakthroughs in speech recognition, natural language processing and machine translation. A new startup called Voicery now wants to take advantage of these same advances to improve speech synthesis, too. The result is a fast, flexible voice engine that sounds more human – and less like a robot. His machine voices can then be used wherever a synthesized voice is needed – including in new applications such as automatically generated audio books or podcasts, voiceovers, TV dubbing and elsewhere.

Before creating Voicery, co-founder Andrew Gibiansky was working at Baidu Research, where he was leading the voice synthesis team of In-Depth Learning


While the team was developing advanced techniques in the field of machine learning, published articles on discourses constructed from deep neural networks and the production of artificial speech and commercialized its technology. in Baidu production systems.

Now, Gibiansky brings the same skill set to Voicery, where he is joined by co-founder Bobby Ullman, who previously worked at Palantir on databases and scalable systems.

"At the time when I was at Baidu, what became very obvious was that the revolution in deep learning and machine learning was on the point of arriving at the speech synthesis, "says Gibiansky. "During the last five years, we have had seen that these News Technical brought and astonishing gains in Computer ] vision, speech recognition, and in other industries – but did not for the time being with synthesizing a human speech.We saw that if we could use this new technology to build speech synthesis engines, we po We should do it much better than anything that exists today. "

Specifically, the company is exploiting new in-depth learning technologies to create better synthesized voices faster than before.

In fact, the founders built Voicery's speech synthesis engine in just two and a half months.

Unlike traditional speech synthesis solutions, where one person records hours and hours of speech that are then used to create the new voice, Voicery drives his system to hundreds of voices at a time.

He may also use varying amounts of voice input from the same person. Due to the amount of data that it receives, the system seems more human when it learns the correct pronunciations, inflections and accents of a wider variety of source voices.

The company claims that its voices are almost indistinguishable from humans – it even posted a quiz on its website that asks visitors to see if they can identify which ones are synthesized and which ones are real. I found that you were still able to identify voices as machines, but they are far better than the voices of the machine readers to which you are accustomed.

Of course, given the rapid pace of technological development in this field – not to mention the fact that the team built its system in a few months – one has to wonder why the main voice actors could not do something similar with their own internal engineering teams.

However, Gibiansky says that Voicery has the advantage of being the first to get out of the door with its technology that capitalizes on the progress of machine learning.

"None of the currently published research is good enough for what we wanted to do, so we had to expand that a bit," he notes. "Now we have many voices ready, and we're starting to find clients to work with."

Voicery already has some customers driving the technology, but nothing to announce at the moment because these discussions are in several stages.

The company charges customers an upfront fee to develop a new voice for a customer, and then charges a fee for use.

Technology can be used where voice systems exist today, such as in translation applications, GPS navigation applications, voice assistant applications or screen readers, for example. But the team also sees the potential to open new markets, given the ease of creating synthesized voices that really resemble people. This includes things like podcast synthesis, news reading (think: Alexa's "Flash Briefing"), TV dubbing, voices for characters in video games, and more.

"We can move into spaces that fundamentally do not use technology because it has not been of sufficient quality, and we have some interest from the companies that try to do it, "says Gibiansky.

Voicery, based in San Francisco, is primed except for the funding it has received by participating in Y Combinator's 2018 Winter Class. He is looking to raise additional funds after the YC demonstration day.

Leave a comment