Will the real Siri please stand up?

James Hodgens
5 min readJun 29, 2021

An exploration of Speech-to-Text technology and how to start to implement it in Python.

In Apple’s iOS 14.5 update, Siri received some enhancements, which included several new voice options (https://www.apple.com/newsroom/2021/04/ios-14-5-offers-unlock-iphone-with-apple-watch-diverse-siri-voices-and-more/). It also leverages a new Neural Text to Speech technology, which Apple claims will result in more natural sound (the original Siri used voice recordings as the basis for its responses).

Apple’s continued investment in Siri (as well as similar investments by other companies) demonstrates how important virtual assistants will be in the future. Since it was first introduced on the iPhone almost a decade ago, Siri’s ability to understand and respond to users has come a long way. Furthermore, virtual assistants have permeated our homes via smart speakers and other smart devices. They may even replace customer service representatives in the future. Imagine never having to wait on hold again because you’re speaking to an algorithm instead of having to wait for a real person to look up the answer to your question!

Virtual assistants like Siri (including Amazon Alexa and Google Assistant) rely on several technologies such as voice-to-text, natural language processing, and text-to-speech. The rest of this blog post will look at the first step in the process (voice-to-text) and how we can begin to build our own virtual assistants in Python.

There are several Python packages available to work with speech recognition. In this blog post, we will explore one of those packages (appropriately named SpeechRecognition — https://github.com/Uberi/speech_recognition#readme) to begin to play around with this tool.

SpeechRecognition’s functionality is dependent on the Recognizer class and several methods that rely on APIs. Here’s how to create a Recognizer instance:

And here are a few of the API methods:

Let’s try out the Google method with a recording of me saying the word ‘test’:

Wow! That was pretty easy! Let’s try it with something a little longer:

Let’s take it a step further…

Courtesy of soundviz.com

This is actually the waveform of the beginning of Eminem’s ‘Lose Yourself’. We can easily get the text lyrics for this song in our Jupyter notebook:

Everyone was pretty happy with that result —

Let’s keep going:

That was supposed to stay ‘mom’s spaghetti’, not ‘Monster Jam’…

Marshall’s perplexed…

Let’s try a different method:

The crowd approves of this method (even if there still wan’t any mom’s spaghetti) —

So that was fun, but let’s take a look at the high-level process for what’s going on under the hood of this package. First, the audio file is broken down into very small segments, which are represented by vectors. The next step of speech recognition is taking those vectors, and identifying the phonemes in the audio clip. Phonemes are the basic sound components of speech. In our original example of “test”, there would be 4 phonemes.

Next, the speech recognition system classifies each phoneme accordingly and assigns a confidence level to the word (or words) that the audio clip contains.

In the above example, we can see that if we had passed the parameter ‘show_all = True’, the output would have included the confidence level (as well as several other possibilities).

After my exploration of speech-to-text and the rap game, I wondered if we might one day see AI break into freestyle rap battles. AI has already beaten the best chess players in the world. Would they be able to beat B Rabbit (Eminem’s character in 8 Mile)? Turns out, Apple beat me to it…

I guess Siri’s more of a Snoop Dogg fan than an Eminem fan…

--

--

James Hodgens

Guy interested in data science, health, and a few other things.