Achievements of SteosVoice speech synthesis technology in 2022

The SteosVoice team has achieved amazing results in the field of speech synthesis in 2022. While many developers positioning themselves as text-to-speech services use publicly available "as is" solutions, our team is working to develop these solutions and create our own.

0:00
/
For those who love to listen

What do we rely on when we create these solutions?

When we analyze synthesized speech, we assess plausibility by factors such as intonation, letter pronunciation, and sound defects. It's important to us that our users get quality results when using SteosVoice, so we improve on each of the above criteria.

What steps have already been taken?

1) To improve the pronunciation of letters and their more flexible control, it is common to teach the model on phoneme representation. Phonemes themselves are the sounds a person pronounces when reading a text. Since phoneme dictionaries of Russian words are not adapted to the task of speech synthesis, we conducted research and developed our own phoneme set, which effectively describes the sounds to use our models. The result was more flexible pronunciation control and a reduction in pronunciation error from 6% to 0.4%.

2) One of the important factors in the perception of synthesized speech is the sound quality. By modifying the model of the vocoder system, we were able to reduce the number of artifacts in the synthesized speech, while maintaining quality in a critically small amount of data - up to 10 minutes of the original recording. Such a small amount of data for training and a new approach to creating voices allowed us to significantly speed up their production - about 7 times faster.

3) Besides improving SteosVoice engine, we also develop different features which can be used in different areas. Our most interesting developments are: changing intonations, generating non-existent voices and transposing voice from one language to another. Some of them you can already try out on SteosVoice platform or Telegram chatbot. For example, you can try voice transposition from one language to another by voicing Russian text with one of the English voices. And to listen to the generated voices of non-existent people, synthesize the text by choosing the voice of Jack or Arthur.

4) We have tested many existing solutions for controlling synthesized speech, and are now combining all the best practices into one model to demonstrate the full potential of SteosVoice technology very soon.