It no longer speaks like a robot: CyberVoice technology learned to clone any voice

Speech synthesis is the process of converting text into speech. Using machine learning methods, a computer programme simulates human speech based on a textual representation. There are various ways to synthesize speech but one of the most effective is the use of neural networks.

The result of the synthesis can be divided into three components:

Textual representation;
Cloned voice;
Style of reading (intonation, timbre, emotional coloring, etc.).

CyberVoice is based on its own neural network-based approach. Feature and advantage of neural networks for solving a problem of this kind is the ability to approximate any complex function. However, the task of converting the written speech into the audio is a very time-consuming task even for neural networks. Therefore, a compressed representation of sound, the spectrogram, is now used in speech synthesis work. Audio is converted into spectrograms using the Fourier transform.

Next, the audio goes through a certain stage of training, and the neural network learns to match the text with the appropriate spectrograms. After training, such a network can generate (predict) spectrograms from a given text. Subsequently, the spectrograms return to the audio format, for instance, using the Griffin-Lim algorithm with some distortions.

The first iteration used the only one neural network, the principle of which is described above. It is not difficult to notice the disadvantages of such a system, for example: it is impossible to control accents, the style of pronunciation is based on training data, and it is easy to determine by the sound quality that the text was read by a computer.

The next iteration became the addition of a vocoder. Generally, a vocoder is a system that translates the acoustic features of a signal into speech. In our case, acoustic features are spectrograms, and the vocoder is a neural network that learns to translate spectrograms into audio. Thus, we receive a more natural sound in comparison to the Griffin-Lim algorithm.

CyberVoice does not work with the textual representation of a word, but uses the phonetic representation that implies an audio description of the word. This decision allows us to flexibly adjust the pronunciation of a certain word or even a sound. Since people do not pronounce words as they are written, translating text into phonetic representation ensures that spectrograms are correctly matched to sounds, which speeds up the learning process and increases model accuracy.

To describe words with sounds, we use a dictionary of our own content, where each word has its own phonetic representation. For words that are not in the dictionary, we use a special neural network that translates any text into a phonetic representation. We also use a specific symbol to control word stress. It is placed before the stressed phoneme during the learning and speech synthesis.

Thanks to such methods of training neural networks as Transfer Learning, Fine Tuning, as well as our own developments, we were able to reduce the minimum required data for training our models to one minute and achieve the original sound quality of speech at a high sampling rate.

Today we are officially launching the CyberVoice project, which will act as the artificial intelligence vocal cords of live NPCs in games, as well as provide a new generation with creative tools for content creators, game and mod developers, voiceover for various content, and form a unified and transparent voice licensing environment.