AWS Polly

Amazon Polly is a Text-to-Speech (TTS) cloud service that converts text into human-sounding speech. Amazon Polly supports multiple languages and includes a variety of human voices, so you can build speech-enabled applications that a wide range of users can be comfortable listening to.

Amazon Polly offers both new neural TTS and best-in-class standard TTS technology to synthesize superior natural speech with high pronunciation accuracy (including abbreviations, acronym expansions, date/time interpretations, and homograph disambiguation).

Amazon Polly ensures fast responses, which makes it a viable option for low-latency use cases such as dialog systems.

Amazon Polly converts input text into life-like speech. You call one of the speech synthesis methods, provide the text that you want to synthesize, choose one of the Neural Text-to-Speech (NTTS) or Standard Text-to-Speech (TTS) voices, and specify an audio output format. Amazon Polly then synthesizes the provided text into a high-quality speech audio stream.

Common use cases for Amazon Polly include mobile applications such as newsreaders, games, eLearning platforms, and accessibility applications for visually impaired people.

On-device TTS solutions require significant computing resources, including CPU power, RAM, and disk space. These can result in higher development costs and higher power consumption on battery-powered devices such as tablets and smartphones. In contrast, TTS conversion done in the AWS Cloud dramatically reduces local resource requirements. This enables support of all the available languages and voices at the best possible quality. Moreover, speech improvements are instantly available to all end-users and do not require additional updates for devices.

The input to Amazon Polly can be plain text or formatted in Speech Synthesis Markup Language (SSML) format. With SSML you can control various aspects of speech, such as pronunciation, volume, pitch, and speech rate.

Amazon Polly provides a portfolio of languages and a variety of voices, including a bilingual voice (for both English and Hindi). For most languages, you can choose from several voices, both male and female. When launching a speech synthesis task, you specify the voice ID, and then Amazon Polly uses this voice to convert the text to speech.

Amazon Polly is not a translation service, the synthesized speech is in the same language as the text. However, if the text is in a different language than designated for the voice, numbers represented as digits (for example, 53, not fifty-three) are synthesized in the language of the voice and not the text.

Amazon Polly can deliver synthesized speech in multiple formats including MP3 or Ogg Vorbis format for consumption by web and mobile applications.