Instead of stitching together short human sound bites to create words and sentences, Google’s latest system generates vocals from the text alone
When machines speak, they sound stilted, robotic and mechanical – but they’re getting better. Google’s latest text-to-speech system, called Tacotron 2, generates sounds entirely from scratch, and the search giant claims the results are as good as those built using professional voice artists.
Previous systems normally produce speech by assembling human-recorded vocal sounds into words and sentences. In comparison, Tacotron 2 was trained on over 24 hours of human speech and corresponding transcripts, and could then generate completely new audio of phrases from a given text even if it had never seen some of the words before. You can listen to the results here.
Stephen Cox at the University of East Anglia in the UK says the Google system is impressive because it learns all aspects of speech – including punctuation, prosody (the “tune” of the voice) and intonation – without expert intervention.