Editor's Note -- Apple's recent announcement of Spoken Interface has moved speech recognition to the forefront. However, Mac OS X has included speech recognition and synthesis technologies for quite some time, and in this article we delve into the often misunderstood world of talking to your Mac.
The documentation provided by Apple states that the Speech Manager -- the component that takes care of piping the text into the Speech Synthesizer -- was first introduced in 1993. Once again, this shows how innovative Apple can be. Computers of that time were very different from what we know today, and adding speech capabilities to a consumer product -- even thinking about it -- was a real breakthrough. If not somewhat crazy.
However, Speech really was born again with the introduction of Mac OS X and especially in the two latest releases, Jaguar and Panther. The new audio capabilities of Mac OS X, along with the renewed commitment from Apple to this amazing technology have concurred to produce what is widely considered to be the most convenient and advanced speech technology available in this field.
Therefore, if you have tried and abandoned Speech during the last century -- the Mac OS 9 days, in other words -- give it another try.
Users who got used to the voice verification feature in Mac OS 9 (vocal password) should not despair. It is not currently built into Mac OS X, but theoretically nothing prevents Apple from adding it again if it is widely requested. This feature actually worked OK and can be considered to be very secure since, even if an attacker knows your pass phrase, he cannot "borrow" your voice. And no, recordings of your voice won't fool the system.
When Apple began to built speech into the Mac OS, they formed a team composed of some of the world's leading speech and language scientists, aiming to bring the user-computer interaction mechanisms to a whole other level.
The Speech technology is in fact built in two parts: a speech synthesizer that your Mac can use to communicate with you -- read text on demand but also keep you informed about the status of a process. And a speech-recognition technology that allows you to talk to your Mac to send commands to it -- what you usually do with a keyboard and mouse.
Since Speech is built-in right at the core of Mac OS X, there is no need to install a special application or devices to make it work. Although the way a specific application reacts to spoken commands in detail is up to the developer, any Mac OS X application can, to a certain extent, be controlled by voice.
This amazing integration is mainly due to the development tools and elements that Apple provides to developers. Once Apple builds speech-controlling capabilities into the standard elements produced by the Interface Builder it hands out to developers, for example, all the applications built with this application can be controlled using standard commands.
Of course, since there is always customization, you can, at any time, add your very own commands to the speech recognition engine -- more on that later. There is, however, no need to worry: Apple ships Mac OS X with a predefined set that will allow you to perform the most common tasks -- browse the Web, check your emails, etc. -- right out of the box.
With a hint of practice, you will be able to forget about your keyboard and mouse and do much of what you already do leaning back on your chair, therefore diminishing the risk of physical injuries. In fact, several HR departments are now encouraging people to use these features whenever possible for exactly this reason, i.e. to decrease the incidence of workplace injury through repetitive strains.
Speech can also, when used along with more traditional input devices, make your computing experience more productive and enjoyable. If you want to check your mail while working on an important report for your boss, you do not need to stop what you're doing. Simply say "Get my mail" and let your Mac do the work for you.
Of course, Speech is also very handy for users with disabilities since it allows them to interact with their computer without having to ask for assistance. Thanks to Speech, the Mac has become the computer of choice for visually impaired users who can enjoy quality voices and excellent voice recognition. Indeed, the feedback provided by Speech can allow a user who does not see the screen to determine whether the command he gave to the computer took effect or not and what the status of the request is.
There is also, let's face it, a "coolness" factor that will convince many users to turn Speech on. But before doing so, you should be warned that Speech is highly addictive!
When asked, most Mac users will tell you that they have tried Speech, asked the computer to give them the time, then turned it off because they did not see it as valuable. Usually they thought the voice recognition was unreliable or the voices used by the computer weren't pleasant.
As with any technology, there is a short learning curve before you can really master it and feel comfortable speaking with your computer. After all, this is a brand new way of interacting with a machine and you may need a few hours to feel relaxed and speak normally again.
Voices are also computationally expensive and, up until recently, many computers couldn't deal with extremely complex, natural-sounding voices. The good news is that the incredible computing power packed in the latest Macs allows the Speech team to release increasingly natural-sounding voices and speech synthesizers, making the interaction with a computer even more pleasant. This will be especially noticeable for Panther users.
No synthetic voice sounds perfectly natural. Keep in mind that the specialized speech-synthesis technologies on which some phone systems rely are heavily trained and are "specialized." Ask your virtual reception desk to pronounce the word "asteroids" and you will probably hear the most unnatural voice ever. Your Mac is able to pronounce any word you give it in a natural way. Developers can go even further and use the various tools Apple puts at their disposition to fine-tune the speech synthesis in their applications. Few take the time to do that right now, but when they do the results are striking.
As you can see, the quality of voices has increased over the time. For example, Vicki, the new default Panther voice -- and last in this demo -- is 27.6 MB large instead of the more traditional 1.5 MB that older voices used to take up.
The Speech Synthesizer has also evolved a lot and is now able to distinguish common abbreviations and to add emphasis to long sentences and paragraphs automatically, making speech sound much more natural. This is especially noticeable when you read long text documents. The voice is now much livelier and lifelike since it better duplicates the emphasis a real-life speaker would put on different parts of the text.
Understanding how Speech works can provide you with valuable information to better take advantage of this technology. In this part, I will try to provide you with an in-depth look at the speech recognition engine as well as answer a few basic structural questions.
The process that converts the text string that must be read into the sound that goes out of your speakers can be roughly divided in four steps:
As a general rule, the more RAM and processing power your computer has, the better the voice will sound. This is because a Speech Synthesizer heavily relies on your computer's resources to perform its calculations. Of course, any modern Mac is able to speak perfectly, but do not expect your old Performa to do as well as a PowerMac G5.
A voice is a set of characteristics defined in parameters that specify a particular quality of speech. They are like natural voices -- all of them are different and, from their characteristics, you can guess the age and sex of the speaker. Voices can talk slower or faster, but in the end you cannot change their base characteristics -- just like you can alter your voice but never entirely change it. Panther comes with 22 voices, but you can theoretically add more if you like.
Indeed, the Speech architecture is very flexible and, as years go by, more and more developers have created add-ons for it, to extend its capabilities and provide even more natural-sounding voices.
Do you remember when, at the beginning of this article, I told you that you may have in fact used Speech for months without knowing it? That's because the work going on with the Speech group at Apple doesn't stop at making voices and recognizing what you say. Far from it!
Indeed, for the Apple team speech cannot be distinguished from language and there can be no reliable speech technology without a good understanding of the language that is spoken. That's why Apple developed a complex set of rules that allows your Mac to truly analyze the text before it is spoken. Of course, Speech relies on a 121,000-word dictionary that tells it how the most common words are pronounced ... but what about the others? What about the context in which these words are placed? While some other technologies don't care, your Mac does.
This very same technology allows the Mail Junk Mail feature to reach 98% accuracy when it is properly trained and serves as the basis for the Japanese input method. If, like millions of Mac users, you have wondered how Mail does its magic, keep in mind the phrase "adaptive latent semantic analysis."
However, although this attention to the context in which a word or phrase is spoken is essential, the end user is more likely to notice something even more appealing: the speech recognition is speaker independent and does not require any training.
In other words, you do not need to read some predefined text for hours to allow Speech to get used to you or your environment. This means that you do not need to worry about switching between Macs only because you would need to retrain the system.
Much in the same way, Speech can adapt itself to very diverse environments and is able to cancel out the background noise. Therefore, there is no need to pay much attention to your environment as long as the background noise stays constant -- think a restaurant where all the background conversations mix to create a relatively constant noise.
Thanks to this flexibility, Speech does not require any additional hardware. Of course Speech addicts may wish to purchase a headset to further increase the accuracy of the speech recognition in hostile environments, but this really isn't needed as long as you plan to use it in regular conditions, such as a room of reasonable size with no strong echo -- an office, your living room, or the company's cafeteria, as opposed to an empty lecture hall, an underground cave, or an acoustic rock concert. This also means you do not have to wear these special noise-cancellation headphones provided by some other manufacturers. These are nice, but in many cases they do not provide a real help and are impractical.
However, to truly understand what makes the difference, we need to get a bit geeky and see in-depth how Speech works.
When you speak the speech signal is analyzed by the front-end that takes care of the acoustic modeling, by comparing the speech spectrum with stored statistical models for each phoneme it hears. These statistical models tell Speech how much variation it can expect for each sound.
Thanks to this comparison Speech then determines which sequence of phonemes is more likely to have produced the sound and what the next-most-likely sequence will be. It assigns a score to each sequence -- a sequence representing a hypothesis about what has been spoken. Since this system relies on (statistically trained) guesses, hundreds or thousands of hypotheses can be generated for even simple strings.
This data is then passed to a second component of Speech that takes care of language modeling. This component knows some language models that represent what is possible and (even trickier) likely for you to say.
Speech uses what is called a "Finite State Grammar." In less technical terms this can be compared to the list of all spoken commands that Speech can expect at a given time. For example, the items stored in your "Speakable items folder" are part of this grammar. On a technical note, let's add that Apple's FSG can be infinite, thanks to recursion.
The hypotheses generated at the first stage are then confronted with this grammar. The one with the best overall score -- grammar combined with acoustics -- is considered the recognized string.
Although this "guessing" system may seem primitive, it is in fact the most robust of all. It is truly speaker independent, and as long as you pronounce a command that is part of the grammar, your chances of being recognized correctly are extremely high.
The main drawback is that you cannot say something that isn't part of the grammar with good chances of it being recognized. This is why, for example, Speech does not feature dictation -- although the Speech group at Apple has dictation technology running in their research lab.
Some competitors use another system called Statistical Language Model. To understand how this works, you have to imagine a big, two-dimensional table where every row and every column represents a word in the dictionary. Then, imagine that every cell contains the probability of seeing the row-word preceded by the column-word. I agree, it's not the kind of table we could draw ourselves, and even for scientists, these probabilities are quite complex to determine.
For every string of words you pronounce, an SLM system has to assign a probability for the string to occur. We Mac owners are very lucky since the already amazingly powerful G4 and G5 processors are specially designed to optimize these kinds of calculations.
The system will then pick the hypothesis that has the best chance to occur and doesn't care about grammar.
As you can see, this system has the advantage of allowing dictation. However, it is very limited since the structure of language is extremely difficult to systematize. This forces the developers of such systems to add checks. In fact, the table I talked about at the beginning of this paragraph has more than two dimensions. Indeed, the fewer dimensions there are, the less complex are the dependencies that the program can correctly take into account. Plus, training such a model takes, according to experts, the equivalent of a 15-years subscription to the Wall Street Journal!
That's why today such systems are stuck with "trigrams" that are far from perfect but can be managed both by the developers and the computers on which the program runs.
Can you imagine how big it gets? The dictionary also limits the recognition power. That's why software manufacturers have to sell you special "expansion packs" for it to work in specialized environments.
But this technology is also sensitive to background noise and requires training not only to the environmental noises but also to your voice and to the way you speak, which must be calm, smooth, and regular.
On a side note, even if this technology is able to adapt to a wider range of situations and, therefore, allows dictation, the original texts used to train the software will affect how well it will understand you. For example, a system trained to help doctors at a hospital write their reports (a common application) won't perform well in another hospital since the doctors there won't have the same style.
Now that we have discussed what Speech can do for you, it's time to take a hands-on approach and actually try it. In this section, we will go step-by-step so that even beginners can use Speech right now. If you already feel comfortable with it, feel free to skim through the reading, although you may actually find some interesting tips.
The Speech preferences can be accessed through the "System Preferences" application. In order to turn the Speech recognition on, follow these steps:
As soon as you turn Speech on, you should see a small, round palette appear somewhere on the screen. This palette will be here for as long as the Speech Recognition engine is running -- whether the computer is actually listening or not.
Its contents are actually simple but provide you with very valuable feedback about what the engine does. Here is, from top to bottom what you will see:
The window is as unobtrusive as possible and really shouldn't interfere with your workflow. However, keep in mind that you can minimize it in the Dock by double-clicking on it. Isn't the "Genie effect" super-cool with this round window? In a typical Mac OS X way, it will continue to interact and behave normally, including displaying the animations we talked about previously.
To test Speech, tell your Mac to listen and say in a clear voice "What time is it?" Your computer will then answer, probably using Vicki's charming voice. If this is the very first time you try Speech, you may need to repeat this command a few times to let your Mac adapt to your voice and the background noise.
Now that you have turned the Speech recognition engine on, it's time to set it up for optimum results. This can be done through the Listening tab.
Listening continuously is a good idea if you use Speech frequently and work in a relatively quiet environment where the risk of a false positive is low (although Speech is extremely reliable). But since in this mode your Mac listens to everything and tries to understand it, it may mix up "Don't Worry, Be Happy" on iTunes with a "Get my mail" command.
In most cases, however, you should not encounter any issues, especially since you can fine-tune this command with the computer name. You can ask your Mac to wait until your say its name to listen to you -- like your pet knows you are talking to him when you say his name.
The Name is: pop-up menu allows you to fine-tune your Mac's behavior. Since I live dangerously, I have set it to "Optional before commands," allowing me to indulge in talking impulsively with my Mac.
If you want to give a name to your computer, make sure you give it a name that you can easily pronounce but are unlikely to include in a normal conversation. Calling it "Coffee" or "Good morning" is likely to wake it up more than you would like. Good names are iMac, Zarvox, etc. For security reasons, never use your password, hostname, or IP address. (If you can pronounce your password, it is not secure anyway.)
Speech is able to work with a wide range of microphones, even the one built into your Mac. I'm particularly impressed with the one built into the G4 iMac screen, far away from the fan.
I wouldn't recommend trying to use your iSight as your microphone since you probably have better uses of the camera than controlling your computer. Anyway, Speech v. 3.3 does not support the iSight as an input device.
Would you want to use a headset with noise cancellation? If you work in a hostile environment, sure, as long as it is natively recognized by Mac OS X. Just make sure you select it in the Microphone pop-up menu. Keep in mind, however, that such devices may need time to process the signal and forward it to your Mac, making the speech-recognition process less efficient than it would be with a slightly higher error-rate but better speed.
I discovered with surprise that an old USB headset that came in an IBM ViaVoice box works on Mac OS X without drivers. For Mac OS 9 you have to install IBM's software first.
Of course, make sure that you pick the right microphone in the Microphone pop-up menu. Keep in mind that all the microphones located near you will hear your voice. Seeing the Speech window flash is in no way insurance that the right one is selected.
Have you noticed how you cannot understand someone who yells in your ear? The volume is too loud and you are unable to understand the meaning of the sentences. The same applies to your Mac.
Ensuring that your microphone operates at an optimum volume can make the Speech recognition system even more accurate. In order to set it up, click on the Volume button located next to the Microphone pop-up menu.
This will display a small window with a volume slider and commands on the right. Talk normally in your microphone and look at the volume indicator. If the red parts light up, the volume is too loud. If only the blue parts light up, it is too low. Ideally, the volume should consistently stay in the upper part of the green area.
Once you have found the optimum volume setting, make sure that your headset -- if applicable -- is locked into place and that the microphone cannot move too easily -- effectively lowering or heightening the volume.
Then, to make sure that Speech understands you, read all the commands listed on the left part of the window until they blink. This is not "training" per se since Speech doesn't require training, but will allow you to adjust your voice volume, pitch, and speed to the best levels. It will also allow Speech to get used to the background noise in your environment to ensure optimum accuracy.
Now that the speech-recognition system is turned on, let's try a few things. Get ready to talk and read the following commands in a clear, normal voice.
Impressive, huh? Without touching your keyboard or mouse, you have quit System Preferences, checked your email, hidden Mail, and visited your browser's home page.
To learn more about the basic commands you could use, simply instruct your Mac, "Show me what to say." This will open a small palette with the list of all the commands you can use. To get rid of it, say "Close the Speech commands window."
Remember, Speech has a learning curve and you may need to practice a bit to reach 99% of accuracy. It needs to learn to distinguish between the commands that are really useful to you and the ones that you'll only use to entertain your friends at your next dinner party.
One of my personal favorites is "Tell me a joke." You didn't know your Mac knew Knock-Knock jokes, did you?
If you change the environment you're in, keep in mind that Speech may need to adapt to the different background noise to provide you with perfect accuracy.
The easiest way to help the Mac adjust is to let it listen for some background noise for a few seconds and then to speak a command that it cannot misunderstand. One of them is "What time is it?" Wait until the reply arrives and you're ready to go.
The recommended way is, of course, to read again the commands listed in the "Volume" sheet. This will greatly speed up the adaptation process and is worth doing if you plan to stay where you have moved. If the speech recognition engine seems suddenly slower, give it a try.
Surprisingly, yes. Indeed, to a certain extent, Speech will try to understand your command even if you do not get it right immediately. For example, "Get my mails" and "Get my mail" will work the same way.
However, you should not expect Speech to understand sentences that are too different from what the developer intended. If you think that a command is so unnatural that you won't be able to learn it, you may want to create a custom command that will be more natural to you.
If you're ready to explore the latest developments of the Speech technology, you can turn on Panther's Semantic Inference feature. Under this strange-sounding name hides a technology that allows Speech to understand what you say, even if you do not speak the predefined command.
When this is turned on, you can replace "What time is it?" with "What is the time?", "Tell me the time," or even "How late is it?"
Since this technology is still at its early stages of development, Apple chose to turn it off by default. Its accuracy may not be perfect (yet) and it may slow the speech-recognition engine down a bit. In my experience however, all worked perfectly well, so I would encourage you to give it a try.
To do so, follow these steps:
To test it, read the sentences suggested by the activation sheet and be amazed.
Now that you have discovered the joy of Speech, it's time to go one step further and learn how to almost completely get rid of your keyboard and mouse.
For now, you may have noticed that many commands are still out of your reach, including menu items, toolbar buttons, etc. The good news is that you can control them with Speech too, making your keyboard and mouse almost obsolete.
In order to turn this option on, follow these steps:
Now a whole new world is open to you. Try to say the following commands to show or hide the volume in the menu bar:
This gives you a lot of power over your applications and dialog boxes. Unfortunately, some nonstandard controls will not work with this method. Also, you probably will not be able to pick items in complex lists by using Speech. However, most of the functionality of most applications will be available via voice commands.
Even more powerful and more universal is the menu bar. Indeed, you can control it by voice. Since almost all menus are standard, you can without any issue access most of the menu commands from your applications.
To shut your Mac down, you would say:
This is all very nice but, sometimes, giving a menu and a menu-item name to perform a simple action can be a bit bothersome. That's why the Speech development team introduced a very nifty command that allows you to enter any keyboard shortcut simply by saying "Define new keyboard shortcut."
A palette will then pop up, allowing you to enter the keyboard shortcut and the voice command you wish to associate to it. You can use such a command to, for example, create a "Close tab" command in Safari or a "New chat with" feature in iChat. Users with disabilities could create a custom command for "Zoom in" and "Zoom out."
Of course, since Panther allows you to define custom shortcuts through the Keyboard preferences pane, this feature is even more powerful than one could think at first sight.
Spending your day in front of your screen isn't always fun, as enjoyable as using a Mac can be. Therefore, you may from time to time, wish to be able to step away from your computer -- when a long task is running, for example -- but without losing contact with your Mac in case something important happens.
That's pretty simple. Indeed, Mac OS X now features "talking alerts" -- this feature will cause your Mac to read the alert messages that may pop-up on your screen if you do not reply to them after a predefined delay.
This feature can also be very useful in an environment where multiple computers run at the same time -- a print shop or a computer lab in a school. Wouldn't it be nice to hear in a clear, distinctive voice "The PowerMac G5 next to the window needs your attention. The printer is out of paper," instead of a "Bong!" that you would need to track down?
In order to benefit from this feature, use the "Spoken User Interface" tab of the "Speech" preferences.
You can then define what the computer will do and after how long it will talk. I wouldn't recommend that you set a short delay since having the Mac read the alert while you are already reading and reacting to it may be annoying. Setting it to 10 seconds gives you the time to react if you already in front of the screen.
Your Mac can also read alert windows that, for any reason, would pop up behind your current application or working document.
The "Announce when an application requires your attention" option can also be a time saver. Indeed, while you are working, you may not notice the icons furiously bouncing in your Dock but will certainly hear "Safari needs your attention."
Like many users, your workflow may require you to access documents that are buried in your folder hierarchy. Luckily, you can easily create a "command" that tells Speech to open them in the blink of an eye.
In order to do that, simply create an alias of the folders that you commonly use in the following folder:
[Home] -> Library -> Speech -> Speakable Items
Now, wherever you are, you simply need to say the name of the folder to open it. To make the alias creation process easier, remember than holding the option and Apple keys while dragging an icon creates an alias.
Making your own items able to be invoked by speech can itself be achieved by speech. Merely click on the item in the Finder and say, "Make this speakable." Speech will take care of making the alias, putting it in the Speakable Items folder, and removing the word "alias" from the alias.
Of course, you have to be careful not to drop any alias with a name that would match the one of an existing command too closely. Otherwise, you may end up opening this folder unwillingly. To avoid this, simply change the name of the alias and all will be well again.
Even cooler, you can put in there aliases to documents that you open often or the HTTP files that Mac OS X creates when you drag an URL from a browser's address bar onto the desktop. Just make sure that you give to these files a name that will be relatively easy to pronounce -- for example, remove the extensions if possible or you will have to pronounce "filename dot extension."
When adding aliases and interacting with buttons or menu items simply is not enough, keep in mind that both AppleScript and the Terminal can work closely with the Speech technology.
For example, here is how to write a script that will read a string of text ...
... in AppleScript:
Say "This is something very cool very cool very cool this is something
very cool that every Mac can do!" using "Cellos"
... in the Panther Terminal:
say -v Cellos "This is something very cool very cool very cool this is something very cool that every Mac can do"
Note that the voice you pick will be ignored by AppleScript if Voice Recognition is turned on. This is a feature that allows users to enjoy consistency in the dialog they have with their computer.
When using the "Saving to file" option, however, the voice you pick is used, since the consistency of the interaction with the user is no longer a concern.
The ability to interact with the Speech Synthesizer even if you are not a developer will allow you to add speech capabilities to the Terminal scripts or AppleScripts that you already use in your daily workflow without having to learn a whole new set of commands or language.
Now that your existing scripts have gained the ability to speak to interact with you, wouldn't it be even better if they could listen? Well, Apple already thought of it and all the information that you need to create complex listen-and-tell scripts can be found on this page.
That way, you can create even more complex speakable items that will start a true dialog with you and react depending on your needs and answers.
Indeed, I do! The first thing to do is to over-use the "Show me what to say" command and to try to do as much as you can with Speech. At first, it may look like you are actually losing time since you need to learn the commands and sometimes learn to speak into the microphone.
However, very quickly, you will see that you can do almost everything with Speech and get completely rid of meaningless alert sounds, creating a true dialog with your computer.
Many applications are speech-ready -- iChat, for example, can read aloud the name of the persons who invite you to a chat but this option is turned off by default. It is worth taking the time to learn what each one can -- and cannot do.
After a few days of practice, I am glad to say that I now can use my Mac without a keyboard or mouse for most of the day, except when typing, of course.
In some occasions, you may want to create a sound file from the text generated by the speech engine. The easiest way to do so is to use an AppleScript command like this:
say "This is something very cool very cool very cool this is something
very cool that every Mac can do!" using "Cellos" saving to "Cool.aiff"
When you run this script, it creates a file at the root level of your hard drive, containing the sound that you would hear if the synthesis had happened on-the-fly.
To achieve the same effect, you can also use the demo pages of the AT&T "Natural voices" technologies. Indeed, to demonstrate their system, AT&T allows you to type text into a web form and to download the resulting file. The main advantage of it is that it allows you to read text in many languages.
Here is the demo page. Of course, since there are certain limitations and copyrights that apply, I encourage you to read the Terms and conditions first. You should also keep in mind that this system is targeted at professional frameworks and that it runs on powerful servers.
During the preparation of this article, I had the opportunity to talk with Kim Silverman, principal research scientist, manager, spoken language technologies at Apple. May he find here the expression of my gratitude for the information he so kindly provided.
Needless to say, any errors or inaccuracies in the preceding pages remain entirely my responsibility.
FJ de Kermadec is an author, stylist and entrepreneur in Paris, France.
Return to MacDevCenter.com.
Copyright © 2009 O'Reilly Media, Inc.