A small experiment, to demonstrate and pin down a workflow for a state-of-the-art ‘expressive audiobook’ reading in 2018, done by affordable consumer text-to-speech software and voice.
Result: The final audio file (42 minutes).
Input text: a difficult one, the complex essay “Cats and Dogs” (1926) by H.P. Lovecraft. Pulp fiction, with simple-sentences and obvious words, might work far better. But this was a stress-test.
Voice used: Ivona ‘Brian’ (British English, 22hz, about $50). ‘Brian’ does not flow across words as smoothly and blandly as the default Windows 8 Microsoft Zira does. As a result Brian sometimes has occasional mis-emphasis of words and a slight slurring, yet is far more expressive in an audiobook than Zira.
1. The text was read by ‘Brian’ in the text-to-speech software TextAloud 4, with the text read out to a standard MP3 file.
* Speed: Normal.
* Pitch: -5 (to deepen the voice slightly).
* Volume: 100% (perhaps too high, you might also try 70%).
* Pauses between sentences: 0.7 seconds (default in TextAloud is 0.5).
* Pauses between paragraphs: two seconds.
(Why not use the free Balabolka reader? Because it doesn’t offer pause adjustment Update: it now offers markup to add pauses and pitch shifts. Further update: Now you can also set universal pauses).
2. I loaded the resulting MP3 output file into the free audio editor Audacity. An Equalisation filter was run to try to cut the 5Khz – 7Khz sibilance. The same preset tried to slightly boost 1KHz – 5KHz, for overall speech intelligibility.
3. The simple free Spitfish De-esser was then run inside Audacity, to further reduce sibilance. (Select All | Effect | Spitfish | Apply | Close). This runs far more quickly than Audacity’s native de-essing filter, as well as being simpler to control. You may have problems seeing the download button so here is a direct .ZIP download.
4. Ran the Effects | Limiter, using its default ‘Soft Limit’ preset.
5. Added Reverb filter, with its default ‘Voice I’ preset.
6. Ran the Spitfish De-esser again, to make a final attempt to reduce the remaining sibilance. Same settings as before.
7. Saved as an MP3, 320bk/s quality, resulting in a 50Mb file for a 42 minute reading.
Incidentally, it’s apparently possible to “chain” these steps (like a Photoshop Action) in Audacity, as a preset, and then play them back automatically. I couldn’t find that option in my Audacity, but that’s perhaps because I have an older version.
Results:
The results were fairly listenable, and (once the raspy ‘synthetic voice sibilance’ was reduced) definitely seems like an advance on previous robo-voices. But the test result was certainly not ideal, due to the ‘Brian’ voice’s unnatural unexpected stresses placed on certain words and the slurring of others. It’s rather like listening to a ‘sticky’/’wobbly’ old cassette tape from the 1980s, and becomes rather wearing after a while. It can result in an aural equivalent of the motion-sickness that one encounters in many videogames.
Perhaps there may be some search-and-replace script that automatically tweaks a text so that ‘Brian’ reads it better, but I couldn’t find one. Simple and immediate global fixes are:
* Change Mr. and Mrs. to Mister and Misses.
* Change capitalised acronyms such as NASA to Nasser, or they will be said ‘En-Ay-Ess-Ay’.
* Change crunched up hyphenation, such as and “then-as you all know-he did something” to “then – as you all know – he did something”.
It also helps to have a good Text Cleaner software running when you copy-paste your text into TextAloud, which will fix line-wrapping and other problems.
There are of course various machine-learning services, such as Amazon Parrot, which claim to offer smoother reading voices for text-to-speech. But they appear to be for big-budget developers, are Cloud-based, and it seems unlikely that owners such as Amazon will ever allow them to be unleashed on the making of long audiobooks (which would compete with Audible). What’s being tested above are the tools available to consumers for less than $100 total.