An In-Depth Look at the Speech Recognition Engine
When you speak the speech signal is analyzed by the front-end that takes care of the acoustic modeling, by comparing the speech spectrum with stored statistical models for each phoneme it hears. These statistical models tell Speech how much variation it can expect for each sound.
Thanks to this comparison Speech then determines which sequence of phonemes is more likely to have produced the sound and what the next-most-likely sequence will be. It assigns a score to each sequence -- a sequence representing a hypothesis about what has been spoken. Since this system relies on (statistically trained) guesses, hundreds or thousands of hypotheses can be generated for even simple strings.
This data is then passed to a second component of Speech that takes care of language modeling. This component knows some language models that represent what is possible and (even trickier) likely for you to say.
Speech uses what is called a "Finite State Grammar." In less technical terms this can be compared to the list of all spoken commands that Speech can expect at a given time. For example, the items stored in your "Speakable items folder" are part of this grammar. On a technical note, let's add that Apple's FSG can be infinite, thanks to recursion.
The hypotheses generated at the first stage are then confronted with this grammar. The one with the best overall score -- grammar combined with acoustics -- is considered the recognized string.
Although this "guessing" system may seem primitive, it is in fact the most robust of all. It is truly speaker independent, and as long as you pronounce a command that is part of the grammar, your chances of being recognized correctly are extremely high.
The main drawback is that you cannot say something that isn't part of the grammar with good chances of it being recognized. This is why, for example, Speech does not feature dictation -- although the Speech group at Apple has dictation technology running in their research lab.
Facing the Competition
Some competitors use another system called Statistical Language Model. To understand how this works, you have to imagine a big, two-dimensional table where every row and every column represents a word in the dictionary. Then, imagine that every cell contains the probability of seeing the row-word preceded by the column-word. I agree, it's not the kind of table we could draw ourselves, and even for scientists, these probabilities are quite complex to determine.
For every string of words you pronounce, an SLM system has to assign a probability for the string to occur. We Mac owners are very lucky since the already amazingly powerful G4 and G5 processors are specially designed to optimize these kinds of calculations.
The system will then pick the hypothesis that has the best chance to occur and doesn't care about grammar.
As you can see, this system has the advantage of allowing dictation. However, it is very limited since the structure of language is extremely difficult to systematize. This forces the developers of such systems to add checks. In fact, the table I talked about at the beginning of this paragraph has more than two dimensions. Indeed, the fewer dimensions there are, the less complex are the dependencies that the program can correctly take into account. Plus, training such a model takes, according to experts, the equivalent of a 15-years subscription to the Wall Street Journal!
That's why today such systems are stuck with "trigrams" that are far from perfect but can be managed both by the developers and the computers on which the program runs.
Can you imagine how big it gets? The dictionary also limits the recognition power. That's why software manufacturers have to sell you special "expansion packs" for it to work in specialized environments.
But this technology is also sensitive to background noise and requires training not only to the environmental noises but also to your voice and to the way you speak, which must be calm, smooth, and regular.
On a side note, even if this technology is able to adapt to a wider range of situations and, therefore, allows dictation, the original texts used to train the software will affect how well it will understand you. For example, a system trained to help doctors at a hospital write their reports (a common application) won't perform well in another hospital since the doctors there won't have the same style.
Getting to Know Speech
Now that we have discussed what Speech can do for you, it's time to take a hands-on approach and actually try it. In this section, we will go step-by-step so that even beginners can use Speech right now. If you already feel comfortable with it, feel free to skim through the reading, although you may actually find some interesting tips.
Turning on Speech
The Speech preferences can be accessed through the "System Preferences" application. In order to turn the Speech recognition on, follow these steps:
- Open System Preferences
- Click on Speech to open the Speech Preferences pane
- If it is not already selected, click on the Speech Recognition tab
- Then click on the On/Off tab to reveal the switch we are looking for
- Click on On (nothing new ;-) to turn the Speech recognition on and make sure that you check "Turn on Speakable Items at login" so that Speech starts up automatically at each login. This is especially important if you are setting up a computer for a user with disabilities that may not be able to turn it on by using the mouse.
As soon as you turn Speech on, you should see a small, round palette appear somewhere on the screen. This palette will be here for as long as the Speech Recognition engine is running -- whether the computer is actually listening or not.
Its contents are actually simple but provide you with very valuable feedback about what the engine does. Here is, from top to bottom what you will see:
- The icon can represent either a microphone or a speaker. When you see the microphone, this means that your Mac is ready to listen to you. As soon as you start to speak -- or your microphone hears a sound -- two arrows will point toward the microphone. In human terms, this is the equivalent of scratching one's head. Your Mac is trying to figure out whether what he heard made sense. Sometimes, you may even see a "???" appear at the top of the window, meaning that your Mac realized that you asked for something but couldn't understand it. Seeing a speaker simply means that your Mac is talking to you. This catches your attention in case your speakers are turned off.
- Under the microphone, you may see text on the light blue bar. Whenever text is present this means that you need to perform an action prior to talking. It can be a button you need to press or a name you need to pronounce to catch your computer's attention. If there is nothing to see, this means that your Mac listens continuously and processes every single thing it hears as a potential command. We will talk later about the pros and cons of each method.
- The few bars that continuously dance under the message represent your microphone's volume. Ideally, whenever you speak, they should stay in the green area. We will see later why this is important and how to fine-tune this setting.
- The button underneath them allows you to open Speech Preferences and a list of the commands that you can say. These two functions can be replaced by the spoken commands "Open the Speech Preferences window" and "Show me what to say."
The window is as unobtrusive as possible and really shouldn't interfere with your workflow. However, keep in mind that you can minimize it in the Dock by double-clicking on it. Isn't the "Genie effect" super-cool with this round window? In a typical Mac OS X way, it will continue to interact and behave normally, including displaying the animations we talked about previously.
To test Speech, tell your Mac to listen and say in a clear voice "What time is it?" Your computer will then answer, probably using Vicki's charming voice. If this is the very first time you try Speech, you may need to repeat this command a few times to let your Mac adapt to your voice and the background noise.
Setting Speech for Optimum Results
Now that you have turned the Speech recognition engine on, it's time to set it up for optimum results. This can be done through the Listening tab.
Listen Continuously or Not?
Listening continuously is a good idea if you use Speech frequently and work in a relatively quiet environment where the risk of a false positive is low (although Speech is extremely reliable). But since in this mode your Mac listens to everything and tries to understand it, it may mix up "Don't Worry, Be Happy" on iTunes with a "Get my mail" command.
In most cases, however, you should not encounter any issues, especially since you can fine-tune this command with the computer name. You can ask your Mac to wait until your say its name to listen to you -- like your pet knows you are talking to him when you say his name.
The Name is: pop-up menu allows you to fine-tune your Mac's behavior. Since I live dangerously, I have set it to "Optional before commands," allowing me to indulge in talking impulsively with my Mac.
If you want to give a name to your computer, make sure you give it a name that you can easily pronounce but are unlikely to include in a normal conversation. Calling it "Coffee" or "Good morning" is likely to wake it up more than you would like. Good names are iMac, Zarvox, etc. For security reasons, never use your password, hostname, or IP address. (If you can pronounce your password, it is not secure anyway.)
Picking a Microphone
Speech is able to work with a wide range of microphones, even the one built into your Mac. I'm particularly impressed with the one built into the G4 iMac screen, far away from the fan.
I wouldn't recommend trying to use your iSight as your microphone since you probably have better uses of the camera than controlling your computer. Anyway, Speech v. 3.3 does not support the iSight as an input device.
Would you want to use a headset with noise cancellation? If you work in a hostile environment, sure, as long as it is natively recognized by Mac OS X. Just make sure you select it in the Microphone pop-up menu. Keep in mind, however, that such devices may need time to process the signal and forward it to your Mac, making the speech-recognition process less efficient than it would be with a slightly higher error-rate but better speed.
I discovered with surprise that an old USB headset that came in an IBM ViaVoice box works on Mac OS X without drivers. For Mac OS 9 you have to install IBM's software first.
Of course, make sure that you pick the right microphone in the Microphone pop-up menu. Keep in mind that all the microphones located near you will hear your voice. Seeing the Speech window flash is in no way insurance that the right one is selected.
Fine Tuning Your Microphone Settings
Have you noticed how you cannot understand someone who yells in your ear? The volume is too loud and you are unable to understand the meaning of the sentences. The same applies to your Mac.
Ensuring that your microphone operates at an optimum volume can make the Speech recognition system even more accurate. In order to set it up, click on the Volume button located next to the Microphone pop-up menu.
This will display a small window with a volume slider and commands on the right. Talk normally in your microphone and look at the volume indicator. If the red parts light up, the volume is too loud. If only the blue parts light up, it is too low. Ideally, the volume should consistently stay in the upper part of the green area.
Once you have found the optimum volume setting, make sure that your headset -- if applicable -- is locked into place and that the microphone cannot move too easily -- effectively lowering or heightening the volume.
Then, to make sure that Speech understands you, read all the commands listed on the left part of the window until they blink. This is not "training" per se since Speech doesn't require training, but will allow you to adjust your voice volume, pitch, and speed to the best levels. It will also allow Speech to get used to the background noise in your environment to ensure optimum accuracy.
Practicing a Bit
Now that the speech-recognition system is turned on, let's try a few things. Get ready to talk and read the following commands in a clear, normal voice.
- Quit this application.
- Get my mail.
- Hide this application.
- Open my browser.
Impressive, huh? Without touching your keyboard or mouse, you have quit System Preferences, checked your email, hidden Mail, and visited your browser's home page.
To learn more about the basic commands you could use, simply instruct your Mac, "Show me what to say." This will open a small palette with the list of all the commands you can use. To get rid of it, say "Close the Speech commands window."
Remember, Speech has a learning curve and you may need to practice a bit to reach 99% of accuracy. It needs to learn to distinguish between the commands that are really useful to you and the ones that you'll only use to entertain your friends at your next dinner party.
One of my personal favorites is "Tell me a joke." You didn't know your Mac knew Knock-Knock jokes, did you?
Adapting to a New Environment
If you change the environment you're in, keep in mind that Speech may need to adapt to the different background noise to provide you with perfect accuracy.
The easiest way to help the Mac adjust is to let it listen for some background noise for a few seconds and then to speak a command that it cannot misunderstand. One of them is "What time is it?" Wait until the reply arrives and you're ready to go.
The recommended way is, of course, to read again the commands listed in the "Volume" sheet. This will greatly speed up the adaptation process and is worth doing if you plan to stay where you have moved. If the speech recognition engine seems suddenly slower, give it a try.