How does SSML work?

We show you how to use SSML to customize your voices

Here we would like to explain what you can do with Speech Synthesis Markup Language (SSML). With SSML it is possible to customize the generated language. For example, you can specify details about pauses and audio formatting for acronyms, dates, times, abbreviations or text to be censored. To demonstrate this in an example, open VoiceOverMaker and the audio editor:

SSML VoiceOverMaker

The <break> element

There you enter the following text as shown in the screenshot:

This is a pause <break time="3s"/> and now I'll continue.

As you can see here, the break element inserts a break of 3 seconds. It would also be possible to insert a pause with SSML in milliseconds, e.g. 500ms. Normally, the `' element is used for the output with SSML; this is not necessary in VoiceOverMaker.

The <say-as> element

Use this element to specify information about the type of text construction contained in the element. This also allows you to determine the level of detail of the representation of the text contained in the element. The <say-as> element has the required interpret-as attribute, which determines the pronunciation of the value. Depending on the value in interpret-as, you can use the optional attributes format and detail.

The following example is spoken as an integer:

<say-as interpret-as="cardinal">12345</say-as>

The following example is spoken as "First":

<say-as interpret-as="ordinal">1</say-as>

The following example is spoken as "C A N" (English):

<say-as interpret-as="characters">can</say-as>

In the following example, a beep is emitted as for censoring:

<say-as interpret-as="expletive">censor this</say-as>

Adjusts units to the number when distinguishing between singular or plural. The following example is spoken as "10 feet":

<say-as interpret-as="unit">20 foot</say-as>

The following example is spoken letter by letter (in English)

<say-as interpret-as="verbatim">abcdefg</say-as>

The following example is spoken as "The tenth of September, nineteen sixty":

<say-as interpret-as="date" format="yyyymmdd" detail="1"> 1960-09-10 </say-as>

The following example is spoken as "The tenth of September":

<say-as interpret-as="date" format="dm">10-9</say-as>

The following example is spoken as "Two thirty P.M.":

<say-as interpret-as="time" format="hms12">2:30pm</say-as>

These were examples of how numbers can be pronounced differently. The following options are available as parameters for the attribute 'interpret-as':

  • cardinal
  • ordinal
  • characters
  • fraction
  • expletive / bleep
  • unit
  • verbatim / spell-out
  • date
  • time
  • telephone

The <audio> element

Supports the insertion of recorded audio files and other audio formats in conjunction with synthesized voice output.


  • src
  • clipBegin
  • clipEnd
  • speed
  • repeatCount
  • repeatDur
  • soundLevel

The paragraph <p>,<s> elements


<p><s>This is sentence one.</s><s>This is sentence two.</s></p>

If you want a voice break to be long enough for you to hear it, use <s></s> tags and insert the appropriate pause between sentences.

The alias <sub> element

<sub alias="World Wide Web Consortium">W3C</sub>

Specifies that the contained text is replaced by the text in the attribute value "alias" when pronounced.

The <prosody> element

This adjusts the pitch, speaking rate and volume for the text in the element. The attributes rate, pitch and volume are currently supported.

The <emphasis> element

This is used to emphasize the text of the element or remove the emphasis. With the element <emphasis> you change the language similar to <prosody>, but without having to specify individual language attributes.

The level attribute can have the following values:

  • strong
  • moderate
  • none
  • reduced

This was an excerpt of the most common SSML elements. Try it out now with VoiceOverMaker.