Assistive Technology Research Institute
College Misericordia - Dallas, PA 18612
Founded and Sponsored by the Sisters of Mercy of Dallas


Speech Recognition Deserve Recognition


Does Speech Recognition Deserve Recognition? A Study Comparing the Efficacy of Speech Recognition vs. the Standard Keyboard

Denis Anson, MS, OTR; Luann Daveski, OTS, Patrick Chavannes, OTS, and Karen Shaughnessy, OTS

Please address all correspondence related to this article to the first author. Email:


Voice recognition has been in development for over 40 years, and has been entertained as an alternative access method for individuals with disabilities for almost as long. With each generation of voice recognition software, the recognition speed and accuracy has improved. Each release has been accompanied by advertising suggesting that, at last, voice recognition is a viable replacement for the keyboard.

Studies of voice recognition by its advocates tend to use formats where the system excels, while those by detractors use formats where the method is weakest. Neither of these approaches gives good information about the performance of voice recognition in the production of real-world documents.

This study used a repeated measures design to compare the rate of a current speech recognition product (Dragon Naturally Speaking 7) with the keyboard to produce documents including headings, tables, and special symbols as well as prose.

The results indicate that, with current technology, speech recognition times for complex documents are approximately twice those for keyboard use, and accuracy is substantially lower. While recognition was good, mouse navigation and error correction remain difficult for the typical user.

These results suggest that speech recognition technology does not yet provide an alternative access method to allow equal performance for the individual with a disability who must work hands-free, but may be acceptable for the person who is able to use the mouse and keyboard intermittently for those occasions where speech is not a strong input method.

Key Words: Computer Access, Voice Recognition, Disability, Assistive Technology


Today, computers are a very important part of our lives; indeed, for many people they are a necessity. Computers are now a part of the curriculum at schools, are instrumental in most places of business, and are found in most people's homes. In the school system, students use the computer in every grade level. In kindergarten students are primarily learning how to use the keyboard and the computer for basic activities. As students move up in grade level, they use the computer for more complex tasks and learn how to access the internet (US Department of Labor, 2004) . In the work place, computers are no longer a luxury. Modern computer software is essential in order for businesses to thrive. As the Nobel prize-winning economist Robert Solow explains, "[The] application of information technology does improve productivity. Since the 1970s, productivity has grown about 1.1 percent per year for sectors that have invested heavily in computers and approximately 0.35 percent for sectors that have invested less heavily. Research by MIT economists shows that in the 1990s, computers contribute significantly to firm-level output and productivity. But the effects have been concentrated in a limited number of firms and industries" (The Progressive Policy Institute, 2002) .

Word processors are used in place of typewriters and have a variety of tools, such as grammar and spell check, to improve the written text. Electronic databases allow the organization and manipulation of vast stores of information. Data mining is a relatively new field which "uses statistical analysis and modeling techniques to uncover predictive patterns and relationships hidden in organizational databases-patterns that ordinary methods might miss" (Two Crows Corporation, 2004) . Worldwide, organizations of all types are benefiting from this technology because data mining helps to increase revenues, through improved marketing, and helps to reduce costs, through detecting and preventing waste and fraud (Two Crows Corporation, 2004) . Spreadsheets are used not only for tracking this year's budget, but also for predicting the consequences of suggested changes to next years budget as well. Graphic programs allow the user to create or manipulate illustrations. Communication tools include e-mail, bulletin boards, and chat rooms which enable users to communicate with people from all over the world (Anson, 1997) . In addition to these, instant messaging has become a popular communication tool within the office environment (Granite State Internet, 2002) .

Computers are used in the home for various purposes or tasks. An individual may use the computer to complete a research project for school, to write a report for work, or to pursue a hobby or other leisure activity. In addition to this, shopping can be done via the internet for those who lead busy lives. "Total e-commerce sales for 2002 were estimated at $45.6 billion" (Scheleur, 2002) . This number does not reflect the number of people who research products on-line before making a purchase from a traditional store. Some family members may use the computer to play games, while others may use it as a means of communicating with others. By the end of the 20th century, the majority of households had at least one computer (Newburger, 2001) .

The computer can give individuals access to meaningful activities. An individual who has moved away from family and friends might use a computer to remain connected to local happenings. A student may be studying computer programming. An individual may use the computer to shop for clothing via the internet at a time and place of his or her choosing rather than during store hours or at the physical location of the retailer. These goal-directed activities can be meaningful to individuals.

Although it was originally marketed as an alternative access system for individuals with disabilities, speech input is now becoming a mainstream access method. There are several motivating factors for able-bodied individuals to use a speech recognition system which vary among computer users. For instance, "Some executives, in particular, may regard using the keyboard as demeaning" (Hawkins, 1994) . Some people may see voice input as a replacement for poor typing skills. Finally, some settings demand that the individual record thoughts while their hands are busy with another task (Karampahtsis, 2001) . For the person who cannot use their hands to manipulate the keyboard or mouse, either because of job demands or physical limitations, speech is often seen as a means of controlling the computer.

Occupational therapy is the art and science of helping people do the day-to-day activities that are important to them despite impairment, disability, or handicap (Neistadt & Crepeau, 1998) . "Occupation" in occupational therapy does not simply refer to jobs or job training; occupation in occupational therapy refers to all of the activities that occupy people's time and give meaning to their lives" (Neistadt & Crepeau, 1998) . Since computer access has become a "day-to-day" and "meaningful" activity for many, providing the means to participate in this activity has become part of occupational therapy.

The clients for whom computer access may be an issue are varied. For some, computer access is already a meaningful activity, but an acquired disability interferes with computer access. Others with no history of computer access wish to participate in an activity that, in light of functional limitations, can best be enabled through computer technology. In these cases, the occupational therapist will need to be knowledgeable about the different options available for computer access. In order to provide optimal access to activities enabled by the computer, the occupational therapist must understand both the demands of various alternative access devices, and the potential productivity offered by those devices (Pedretti & Early, 2001) .

Difficulty in accessing a computer may stem from a sensory loss such that a person cannot easily get information from a computer or may be caused by a loss in movement so that a person has difficulty getting information into a computer. In those cases where the limitations are produced by difficulty in getting information into the computer, a wide range of alternative input devices are available. Perhaps no input device since the mouse has stirred as much interest and controversy as voice input. In order to recommend this type of device, the occupational therapist must be sure that the demands of this device do not exceed the client's capabilities.

While many see computer input by voice as equivalent to talking to a person, experience has shown that the demands are much greater. For instance, the client must have the cognitive ability to use this system. "Voice recognition.requires a heavy cognitive load, including memorization of multiple commands and the ability to differentiate when to use each command. In addition to this, an individual must have the ability to track multiple sequences of events, be able to complete multiple step commands as well as have the ability to generate spontaneous text. A strong respiratory support is also needed when using this system" (Johnson & Morris, 1999) . The client's voice quality and ability to enunciate clearly are important aspects to consider as well. In addition to this, the OT must consider the environment in which the voice input system will be used. Because voice input is inherently noisy, the client may disturb others while talking into the computer (Pedretti & Early, 2001) .

While speech recognition has been recommended for individuals with deficits ranging from repetitive stress injury to high-level spinal cord injury, its utility has not been clearly demonstrated. In the absence of controlled research on the usefulness of speech recognition, it is being recommended based on trial and error (Hubbard & Spaeth, 2003) . Many people think that they want speech recognition, but in the real world, most individuals with disabilities abandon this technology rather quickly (Koester, 2003a) .

History of Speech recognition

Research on speech recognition systems began some forty years ago. Early versions of speech input were plagued by slow performance, high error rates and exorbitant costs. Early speech recognition systems required discrete speech , in which the user paused briefly after each word. While advertisements claimed very high rates of input, practice did not always bear out these claims. "Text generations speeds of 50 to 100 wpm for products such as Dragon Dictate have been reported. but our clinical experience indicates an average speed of 25 to 30 w.p.m" (Kotler & Tam, 2002) . Today Dragon Systems, the producer of Naturally Speaking, claims that a person can dictate up to 160 w.p.m., although this rate would not be attainable by the average user (Scansoft, 2004b) . In 2003, Horstmann-Koester explored the long-term experience of individuals with disabilities using automatic speech recognition (ASR). In this study, she found that "Text entry rate with speech ranged from 3.5 to 31.7 w.p.m., with an average of 16.7 w.p.m." (Koester, 2003b) .

In addition to the issues of input rate, accuracy has also been lower than claimed by the vendors of speech input technology. Alwang claims, "IBM ViaVoice for windows Pro USB Edition Release 9 . transcribes your spoken words more quickly than previous versions. As with past versions ViaVoice returns high recognition accuracy from the start. On our tests, initial accuracy was just above 92 percent. After a couple hours of correction and use of the Analyze Documents feature, it climbed to 98.5 percent" (Alwang, 2002) . When used with individuals with disabilities, the outcomes are less rosy. Kotler and Tam (Kotler & Tam, 2002) reported initial speech recognition accuracy of 74% for people without speech difficulties and average accuracy of 57% for people with speech difficulties."

The cost of speech recognition has also changed significantly. The cost of Dragon Dictate version 1.0 in 1995 was $395.00. Currently, "Computer users can upgrade to Dragon Naturally Speaking Preferred 7.0 for $99.00" (Scansoft, 2004a) . In Office XP, speech recognition is included at no additional charge (Microsoft, 2004) .

While developers claim to have resolved many of the initial problems of speech recognition, there remain other, more persistent problems. For instance, the use of speech recognition technology may produce voice strain. "There have been many reports of people contracting voice strain from the use of both discrete and continuous speech. Depending on your predisposition and usage of the speech recognition system, voice strain can happen as quickly as within one month of use, or it may never happen at all" (Fox, 1998) .

While much has been claimed for speech recognition as a method of input, clinical experience does not always bear out these claims. The outcomes reported from studies of speech recognition vary widely, possibly because of the changing nature of the technology, and possibly because of differing methods of assessment. It is important, therefore, to continually monitor the state of the technology. This information is essential to making informed decisions.

Utilization of Speech recognition

While early speech recognition systems required discrete speech, and therefore performed well for the entry of single words or numbers, current speech recognition engines are much more accurate in recognizing long phrases. In a study by Mitchard and Winkles, the speed of entering numbers, lists, and phrases was compared between speech and keyboard. While the results varied, the researchers concluded, "If you type faster than 45 WPM, you can enter data more quickly by typing than by error-free speech entry. A slower typist would be quicker by speech entry if it could be done without errors (Italics ours)" (Mitchard & Winkels, 2002) . Hartley, Sotto & Pennebaker (2003) explored changes in writing style following the introduction of speech recognition technology. While there were no differences in typographical or grammatical errors based on input methods, the dictated documents generally used shorter sentences and showed an increase in the first-person pronoun.

In general, the positive claims for speech recognition technology use extreme cases and those conditions where the technology is strongest. The studies of rate generally use methodologies where speech input is weakest and used contrived language samples that do not reflect real world usage. Neither of these approaches gives significant information about the use of speech recognition in every-day typing. In order to explore the utility of voice input for daily typing, it should be evaluated for a combination of prose and single utterance inputs.

Because of the attractiveness of speech input for people with conditions that limit use of the keyboard, and because of the confusing mix of claims and evidence, it is important for an occupational therapist to have evidence on which to base recommendations about input method.

This need leads us to our research questions:

  • Does a modern voice input systems allow creation of documents including prose, tables, and formatting at speeds that are comparable to the rates for keyboard entry?
  • Will a modern speech recognition system allow the user to produce fewer errors than the standard keyboard for similar documents?


Research Design

This study used a single subject, successive intervention design.


This study used a convenience sample of 9 able-bodied individuals with ages between 27 and 49 years. The sample included 7 females and 2 males . The participants had fairly unaccented speech except for one Chinese and one Haitian, both of whom had significant accents. The participants presented with no apparent physical or cognitive disability. All participants in the study were able to read printed English fluently from samples in 12-point Times Roman font, and none used speech recognition for more than 50% of their typing.


Computer System/Software

The computers used in the study had a minimum 700 MHz Pentium III processor, and more than 190 MB of RAM. The headset microphones that the participants used for this study included two Logitech Premium Stereo Headsets and a USB Digital Stereo Computer Headset Telex Model, H-551.These microphones were designed for use with speech recognition. The speech recognition system used was Dragon Naturally Speaking Preferred, Version 7.

Each researcher used a Sportsline, model 226 stopwatch to time the keyboard and speech recognition entries.

The text entered in this study consisted of eight versions of an extract taken from Visual Frequency Feedback and its Effect on Wheelchair Propulsion Kinetics . Each document had the same content: prose, headings, and a table which included special characters (e.g. "±"). In each of the eight versions of the document, the content of the paper was rearranged (at the paragraph level), so that the user could not dictate the paper from memory. The similarity between versions of the document reflected the similarity of the work of a typist who writes using a specialized and job-specific vocabulary.

Operational definitions

Error - An error was defined as any deviation in the user-entered text from the source document as detected by the Microsoft Word "Compare Documents" feature. Deviations included, among others, incorrect words or words in incorrect order. Each identified deviation from the original text, although it may have included several word differences, was considered a single error. For the data tables, all differences within a single cell were counted as a single error. While this method carried the potential to undercount differences between the source and subject-generated documents, it allowed for unambiguous scoring when complex patterns of deviation occurred.

Plateau - A participant was considered to have achieved a plateau when three consecutive trials had entry times within 7% of each other. The 7% standard has been derived from past studies, in which it was found that many individuals have speed variations of greater than 5% in normal typing, while a 10% standard can be achieved while a participant continues to make significant gains in performance. Using a standard of 7% seems to balance these two difficulties.


Work Station Setup

The printed text for each session was attached to a vertical paper holder and then placed on the participants preferred side of the computer monitor. Prior to the beginning of each voice input session, a guard was placed over the keyboard to keep the participants from using the keyboard when they became frustrated or had difficulty. The headset microphone was placed on the right side of the desk within the participants reach.

Prior to the beginning of data collection, each participant was given a standardized training in the word processor and speech input technologies. First, each participant was asked to type, using the keyboard, a document that was similar to the test document in complexity, but which was derived from a different source. During this typing, each participant was given verbal assistance as needed in the entry of headings, tables, and special characters.

Following the keyboard training, the participants were introduced to speech recognition. The participants trained the speech recognition program using the standard enrollment procedure. Each participant trained the system using the "Talking to a computer" and "3001" training samples to control for possible difference in efficiency of the different training options. Following this training, the participants were asked to enter the training document using only voice input. Again, verbal assistance was provided as needed for formatting, tables, and special characters, as well as mouse movement. In addition, the Dragon Naturally Speaking program imported the source documents to learn the word order and frequency of the task.

After the training was completed, the participant began the testing phase of the study. This study used a balanced order of initial devices to control for possible effects of learning to use the input devices. Half of the participants began the trials using the keyboard and finishing with speech recognition. The remaining half started with speech recognition, and then used the keyboard for the second half.

For each trial, the participants were seated at a table or desk in a comfortable position (determined by the participant). The computer was set up on the table or desk with the screen from 18-36 inches from the participant, with the precise distance determined by the participant. A "cuing sheet" was provided to each participant on how to accomplish the entry of non-keyboard characters and formatting tasks, to assist with performance during the session.

Preparation of the Participant:

Once the participant and computer were prepared, the researcher instructed the participant, "When I say go, I would like you to use (the keyboard, the speech recognition system) to reproduce this document as quickly and as accurately as you can. Are you ready? Go." The researcher started the stopwatch, and the participant began (typing or using speech recognition). Once the participant finished duplicating the document, the participant said "print" to indicate that he or she was finished. At this time, the researcher stopped the timer and saved the participant's document along with the trial number and trial time.

This procedure was repeated by the participant using different versions of the document until reaching plateau. If a participant began the trials using the keyboard, after reaching a plateau, he or she then began the trials using speech recognition and continued until a plateau had again been reached.

Data Analysis

The participants' typed and speech recognition texts were compared to the original document using Microsoft Word's "Compare Documents" feature. The amount of time necessary to reproduce the documents was recorded for each trial. The percent accuracy of the document was calculated using the formula:

Percent Accuracy = ((Total words - number of errors) / Total words) * 100

The speed and accuracy of voice versus typing input method was plotted on line graph. Fluency was obtained when the celeration lines displayed a plateau (when the typing and speech of three consecutive trials were within seven percent of each other).


Table 1. Mean times and errors at plateau for keyboard and voice.

Subject No.

Keyboard Time (Seconds)

Voice Time (Seconds)

Keyboard Percent Accuracy

Voice Percent Accuracy














































Text entry by voice, once the system had accommodated to each subject's voice profile, was relatively fast and simple. Most of the problems and delays occurred when a word was misrecognized, or the document required additional formatting.

With speech recognition, recognition of the error correction commands depended highly on the voice quality of the speaker. What worked for one person did not necessarily work for another. To delete a misrecognized word, some of the command options included "scratch that," "delete that", "delete next word", "delete previous word", and "backspace." (Note: these commands do not all have the same effect of subsequent recognition accuracy.) Participants with accented speech had marked difficulty with error corrections. For example, while the command "scratch that" worked for most of the participants, it was not recognized for one Chinese participant. Instead she used "undo that" to delete erroneous words or characters.

Mouse navigation also presented significant problems when using speech recognition. The insertion point would often jump around while entering text and words would be inserted in the wrong place. At times, it would seem to disappear altogether. At these times, manual placement of the cursor was necessary in order to proceed. For example, in order to merge cells in a Microsoft Word table, the user must select the cells to be merged, then issue the "merge cells" command. Using voice, it was necessary to manually insert cursor in a cell before using the command "drag mouse left" to select the desired cells. On many occasions, when giving the command to stop dragging the mouse, it would not stop. On other occasions, the command "drag mouse left" would erroneously produce text instead of dragging the mouse. At times the cursor would freeze and not respond to any commands. After inserting symbols, some participants had difficulty closing the option dialog. At these times, the researcher would close the window using the physical mouse. Once the document table was completed, most participants were unable to position the cursor to continue with text.


The results of this study indicate that, for the typical individual, speech recognition does not yet offer a reasonable alternative to the keyboard and mouse. For the typical keyboard typist, one word in 100 included an error. For the same individuals using voice input, one word in 20 was incorrect. A typical typist had about three times as many errors in the finished document using voice as using the keyboard.

Similarly, the time to produce the document was typically about twice as long as using the keyboard. (Note: while these were experienced typists, none of the participants was primarily a professional, or even very accomplished typist.) The primary limitation in processing speed appeared to be the effort required to correct misspoken or misrecognized words. The participants often expressed frustration when attempting to correct an error because the commands did not respond consistently. As a result, the process was particularly time consuming. Mitchard and Winkes (2002) noted that "A slower typist would be quicker by speech entry if it could be done without errors (Italics ours). " Our results indicate that it is less likely that entry can be done without errors by voice, and that correcting the errors poses a substantial barrier to productivity.

Mouse movement by voice was another area of inefficiency and delay. The mouse commands appeared to be unreliable, even for an individual user. A command might work at one time, and not at another. At times the mouse would seem to move spontaneously through the document, or to freeze, and refuse to move at all.

While simple prose insertion by voice is fairly successful, any additional formatting seems to be a substantial burden. Recognition accuracy of text is less an issue with state-of-the-art speech recognition than is error correction and navigation commands. The focus of claims for each new version of speech recognition software has been improved recognition accuracy. It would appear, from the findings of this study that developers would be well served investing additional effort into finding better ways of correcting the errors that inevitably occur, and on the non-keyboard aspects of document generation.

Current speech recognition systems do not appear to provide a viable input system for the individual who is not able to use the keyboard and mouse at all. Rather, it might be useful for the person with limited endurance who can use the keyboard and mouse for short periods, but not for continuous text generation. Such an individual might insert text into the computer by voice, and then use the keyboard and mouse to correct errors and insert formatting as needed. (Microsoft, in the Speech Recognition help files of Office XP, does not even suggest attempting to use voice to correct errors.) For the individual who must work hands-free, hybrid input systems might be useful.


Current speech recognition systems do not appear to present viable alternatives to conventional input systems, either for able-bodied typists or those with disabilities. While text entry was relatively easy by voice, error correction and formatting presented substantial barriers. We found that a typical individual required approximately twice the time to complete a document containing a mixture of elements by voice than with the keyboard. The accuracy of the resulting document was also lower than for one produced by the keyboard. A typical individual could reproduce a complex document with 97% to 99% accuracy using the keyboard, but with only 90% to 95% accuracy by voice. Further, while it is possible to begin using speech recognition with only 20 minutes of training, as proclaimed by the manufacturers of speech systems, we found that over two hours of training were necessary to learn the navigation and correction commands to a level of beginning fluency.

It may be that voice does not provide a viable model for mouse navigation, one of the largest problems discovered in this study. Further study might show that hybrid voice/mouse emulation systems would provide a stronger alternative input system than either system alone. Such combinations might include the use of speech recognition and head-pointing for example, as an input system for the individual who is not able to use the keyboard and mouse at all.


Alwang, G. (2002, Feb. 26). Voice recognition: Getting better . Retrieved May 10, 2004, from,1759,54634,00.asp

Anson, D. (1997). Alternative Computer Access: A Guide to Selection . Philadelphia: F. A. Davis.

Fox, D. (1998). Avoiding Voice Strain . Retrieved May 10, 2004, from

Granite State Internet. (2002). Services for Business: Instant Messaging . Retrieved May 10, 2004, from

Hartley, J., Sotto, E., & Pennebaker, J. (2003). Speaking versus typing: a case-study of the effects of using voice recognition software on academic correspondence. British Journal of Educational Technology, 34 (1).

Hawkins, D. T. (1994). Breaking the keyboard barrier: Voice input to Information retrieval systems. Online, 18 (6), 66-71.

Hubbard, S. L., & Spaeth, D. M. (2003). Rate, Accuracy and Efficiency of Text Entry As a Function of Different Computer Access Methods. Paper presented at the RESNA 26th International Annual Conference, Atlanta, Georgia.

Johnson, K. L., & Morris, S. (1999, November 1999). Update on Voice Recognition: Will it work for you? Retrieved May 10, 2004, from

Karampahtsis, A. (2001). Early Targets for Speech Recognition Technology Are Becoming Defined. Speech Technology Magazine, 6 (5).

Koester, H. H. (2003a). Abandonment of Speech Recognition by New Users. Paper presented at the RESNA 26th International Annual Conference, Atlanta Georgia.

Koester, H. H. (2003b). Performance of Experienced Speech Recognition Users. Paper presented at the RESNA 26th International Annual Conference, Atlanta, Georgia.

Kotler, A., & Tam, C. (2002). The Effectiveness of Using Discreet Utterance Speech Recognition Software. AAC Augmentative and Alternative Communication, 18 , 137-145.

Microsoft. (2004). Speech Recognition Frequently Asked Questions . Retrieved May 10, 2004, from;en-us;283159

Mitchard, H., & Winkels, J. (2002). Experimental comparisons of data entry by automated speech recognition, keyboard and mouse. Human Factors, 42 (2), 198-209.

Neistadt, M. E., & Crepeau, E. B. (Eds.). (1998). Willard and Spackman's Occupational Therapy (9th ed.). Philadelphia: Lippincot-Raven.

Newburger, E. C. (2001, Sept. 2001). Home Computers and Internet Use in the United States: August 2000 . Retrieved May 10, 2004, 2004, from

Pedretti, L. W., & Early, M. B. (Eds.). (2001). Occupational therapy: Practice skills for physical dsyfunction . St. Louis: Mosby.

Scansoft. (2004a). Dragon NaturallySpeaking7 Preferred Upgrade . Retrieved May 10, 2004, 2004, from

Scansoft. (2004b). Scansoft - Dragon NaturallySpeaking 7 . Retrieved May 10, 2004, 2004, from

Scheleur, S. (2002, August 22, 2002). Retail E-Commerce Sales in Second Quarter 2002 Were $10.2 Billion, Up 24.2 Percent from Second Quarter 2001, Census Bureau Reports . Retrieved May 10, 2004, from

The Progressive Policy Institute. (2002). The 2002 State New Economy Index . Retrieved May 10, 2004, 2004, from

Two Crows Corporation. (2004, May 12, 2003). Two Crows: About data mining . Retrieved May 10, 2004, 2004, from

US Department of Labor. (2004). Teachers-Preschool, Kindergarten, Elementary, Middle, and Secondary . Retrieved May 10, 2004, from

Visual Frequency Feedback and its Effect on Wheelchair Propulsion Kinetics, Jeff D. Collins, Michael L. Boninger, Alicia M. Koontz, Rory A. Cooper, Guo Songfeng,