WO1992005517A1 - Audio-augmented handwriting recognition - Google Patents

Audio-augmented handwriting recognition Download PDF

Info

Publication number
WO1992005517A1
WO1992005517A1 PCT/US1991/006874 US9106874W WO9205517A1 WO 1992005517 A1 WO1992005517 A1 WO 1992005517A1 US 9106874 W US9106874 W US 9106874W WO 9205517 A1 WO9205517 A1 WO 9205517A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
speech portion
word
spoken
speech
Prior art date
Application number
PCT/US1991/006874
Other languages
French (fr)
Inventor
Richard G. Roth
Original Assignee
Roth Richard G
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Roth Richard G filed Critical Roth Richard G
Publication of WO1992005517A1 publication Critical patent/WO1992005517A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/32Digital ink
    • G06V30/36Matching; Classification
    • G06V30/373Matching; Classification using a special pattern or subpattern alphabet

Definitions

  • This invention relates generally to apparatus for receiving handwritten (calligraphic) input, and relates more particularly to a novel technique for augmenting the handwritten input with audio (spoken) input.
  • an input device such as a touchpad
  • information indicative of the position generally in X-Y coordinates, of the stylus and whether or not the stylus is touching the touchpad.
  • the recognition system receives no signal unless the stylus is touching the touchpad; when the stylus is touching, signals provide the X and Y coordinates.
  • the information is stored for later analysis, in which case the task becomes closely analogous to ordinary optical character recognition.
  • the information from the touchpad is analyzed at or near real time, and the information analyzed may include not only the X and Y coordinates of the stylus but also temporal information regarding the number of strokes making up the character, the order of the strokes, the direction of the writing for each stroke, and even the speed of the writing within each stroke.
  • the stylus is able to resolve several degrees of contact pressure. Few if any of the known on-line character recognition systems are really quite satisfactory. A few systems are fairly successful but require computing power far in excess of that available in hardware on the scale of a personal computer (PC) .
  • PC personal computer
  • an improved data-entry apparatus for entering handwritten data.
  • the user enters a speech portion (a word, a part of a word, or multi-word phrase) by writing it on a tablet and speaking it into a microphone.
  • Features are extracted from the spoken information.
  • the touchpad data are indicative of any of a number of candidate words, and a multiplicity of speech portion templates are generated, each template indicative of one of the candidate words.
  • the apparatus evaluates the correlation between the extracted features and the features of each generated speech portion template, and the speech portion template having the highest correlation with the extracted features determines the recognized word.
  • a speech synthesizer speaks the word, or the recognition result is provided to the user on a visual display, and an opportunity is provided so the user may confirm the correctness of the recognition.
  • the recognition algorithm need not satisfy the standard of yielding a single recognition result to the exclusion of all other results. Instead, the algorithm is permitted to return (upon analysis of the touchpad input) with a list of candidates. Pronunciation-by-rule algorithms are employed to arrive at a pronunciation (or in the case of allophones, a plurality of pronunciations) for each candidate. Speech recognition features are extracted from the voice input of the user, who will have spoken the word as well as having entered it at the touchpad. The features are compared with the pronunciations, and the nearest match is used as the result of the recognition. Optionally, the system may synthesize and replay for the user the result, giving the user a chance to indicate whether or not the recognition was unsuccessful, either by spoken response or by touchpad entry.
  • Fig. 1 is a diagram, in functional-block form, of the Audi- augmented handwriting recognition device of the invention
  • Fig. 2 is a system configuration diagram, showing the system according to the invention.
  • Fig. 1 shows in functional block diagram form the system according to an embodiment of the invention.
  • Handwritten input is provided by the user at touchpad 1, which preferably has resolution in each of two axes of at least 200 divisions per inch.
  • Data regarding stylus position is collected as indicated by block 2 as a function of time, preferably at least as often as 100 samples per second.
  • the user entry is typically less than a second, and only a few kilobytes (K) of data are collected.
  • K kilobytes
  • the invention shows promise for recognizing Chinese characters, in connection with which the data storage may be tens or hundreds of K bytes. Also discussed below is the case where the invention is applied to alphabetic entry of entire words or phrases, in which case the data storage may be at least several tens of K bytes.
  • the data points are configured as indicative of the strokes making up the character to be recognized.
  • Several known categories of preprocessing are performed, symbolized by block 3. These preferably include smoothing of strokes, filtering of stroke data, correcting wild points such as outlying points in the data, dehooking (removing idiosyncratic movements at ends of strokes) , reducing multi-point dots of stylus movement to a single dot, and stroke correction in cases where a spurious indication of, say, lifting the stylus from the pad wrongly results in a single stroke being recorded as two strokes.
  • a number of types of normalization are preferably also performed as part of the pre- recognition processing. These include the known types of normalization such as deskewing, baseline drift correction, size normalization, stroke length normalization, all known to those skilled in the art.
  • recognition In prior art handwriting recognition systems the next step, recognition, is quite difficult.
  • Known recognition methodologies which have been employed include feature matching, in which features of the to- be-recognized character are tested for match with characters in a feature database; time sequence of zones, directions, or extremes, where a simple database match, or a binary tree analysis, may be employed regarding relatively discrete information about the character-formation temporal data; curve matching, in which the curves making up the character are compared with standard curves in a database; or stroke codes, where parts of a character are classified and enumerated, then compared with a database.
  • the next stage of processing is the employment of pronunciation-by-rule methods to arrive at a pronunciation for each of the candidates.
  • Some letters or characters may have more than one pronunciation, and if so both pronunciations are arrived at. Such letters (or other speech portions) are called allophones, and do not cause a problem for the apparatus of the invention.
  • Feature templates for each of the pronunciations are stored; the number of feature templates to be compared with the spoken features may thus exceed the number of candidate speech portions.
  • the user is asked to speak the character into a microphone.
  • Well-known speech recognition steps are taken, starting with storage of the spoken information, preferably in RAM. A few megabytes suffice for storage of several seconds of speech, collected by microphone 6 and audio storage means 7.
  • Features are extracted by known means, such as determination of the amount of speech energy within various frequency ranges as a function of time.
  • the novel step of correlation is performed.
  • the extracted features of the spoken information are compared one by one with the features of the candidates mentioned before.
  • This step shown as functional block 9, assumes a meaningful correlation measure which, in the general case, can be expected to permit selection of one of the candidates (call it candidate #1) as being a closer feature match than the others.
  • two requirements must be met:
  • the extracted features of the spoken information must match the features of candidate #1 with a correlation value exceeding some predetermined threshold (i.e. it must be a "sufficiently close” match) ;
  • the extracted features of the spoken information must match the features of candidate #1 with a correlation value exceeding the correlation values for the other candidates by some predetermined margin (i.e. it must be a "better” match than all the runners-up) .
  • the system may:
  • one example of the last option is having the user enter the information by kana.
  • one example of the last option is having the user enter the word letter by letter.
  • each feature match is followed as in block 10 with an audio synthesis of the match character.
  • the synthesized speech is replayed for the user, and assuming the user agrees with the match, the process continues with the next handwriting to be recognized.
  • the match was incorrect, the user will so indicate, either by touchpad input or by speech input to the system.
  • a visual display of the match character as indicated by block 28. Assuming the user agrees with the match, the process continues with the next handwriting to be recognized. As with the audio synthesis technique just described, in the (hopefully rare) case that the match was incorrect, the user will so indicate, either by touchpad input or by speech input to the system.
  • Fig.2 there is shown a block diagram of a system according to the invention based on a known personal computer.
  • CPU 12 preferably a fast processor such as a high-speed 80386 microprocessor, forms an AT-compatible or EISA-compatible system with BIOS 13, fixed disk 16 (and disk controller 15), optional floppy disk 17, RAM 14, and I/O interface hardware 18 to drive keyboard 20, display 29, and printer 19. All the above-mentioned are in communication by means of bus 27.
  • peripherals to the microcomputer are touchpad tablet 1 with interface 22 plugged into an expansion slot of bus 27, microphone 6 with speech recognition (feature extraction) interface 24, and an optional speaker with speech synthesis card 26. All the three expansion peripherals comprise known hardware, available off-the-shelf.
  • the steps depicted functionally in Fig. 1, blocks 2 through 5 and 7 through 9 (and optionally 10 and 28) are all performed essentially by the CPU 12 under program control. That is, nearly all the steps are performed in software.
  • a variant of the preferred embodiment is possible in which, instead of pronunciation-by-rule, a template store is employed.
  • the candidates obtained by handwriting recognition are pointers into a template store (preferably in fixed disk 16) containing speech templates, one (or more, in the case of allophones) for each candidate.
  • the correlation step is performed, to find out which entry (and thus which handwriting candidate) is the closest match to the spoken input.
  • the embodiment just described has some elements in common with known speech recognition systems. But the embodiment offers many advantages over such known systems. Foe example, it will be appreciated that some prior art speech recognition systems try to match the received features with every entry in a template store.
  • a hybrid embodiment is also possible, in which the candidate words (or other speech portions) are used as pointers into the template store for those words which happen to be in the template store; for other words (namely for words not having entries in the template store) then pronunciation-by-rule is used to derive pronunciation templates.
  • the features extracted from the spoken input are compared to template candidates from either or both of the sources, namely from the template store or from the pronunciation-by-rule activity.
  • a "best match" is found which preferably matches more closely by a predetermined margin than any of the other matches. It will be appreciated that it is not strictly necessary first to store the correlation coefficients for all of the candidates, and only later to determine which candidate had the highest coefficient (and thus the best match) .
  • one approach is to establish and maintain only two coefficients, each initialized to zero. One is the "best” coefficient found thus far in the comparison process, and the other is the “second best” coefficient found thus far. Only if a particular comparison yields a coefficient better than either the "best” or “second Best” found thus far will it be necessary to update one or the other, or both. It will be noted that the above-described embodiment is disclosed with respect to one particular sequence, namely (1) analyzing the handwritten input to derive a number of candidate writing portions (e.g. words) and (2) receiving spoken input to disambiguate (e.g. narrow down the candidates to one result) .
  • candidate writing portions e.g. words
  • the invention may be optionally embodied with the opposite sequence, namely (1) analyzing the spoken input to derive a number of candidate speech portions (e.g. words) and (2) receiving written input to disambiguate (e.g. narrow down the candidates to one result) .
  • candidate speech portions e.g. words
  • disambiguate e.g. narrow down the candidates to one result
  • the relative merits of the two approaches depend on the relative effectiveness of the two component technologies as applied to a particular pair of inputs. For example, if the user input a relatively high-quality (in terms of the system's ability to recognize) graph, and a low-quality speech-bite, the preferred approach would be more effective.
  • a system could, in fact, apply both approaches to a set of inputs, and choose the result with the higher degree of confidence.
  • the system according to the invention would receive spoken words from the user by means of the data collection 7 and feature extraction 8 processes of Fig. 1. With reference to Fig. 2, the extracted features would be made available to CPU 12 by data channel 27. As mentioned above in connection with " the preferred embodiment, it is known in the speech-recognition art to extract these features, which are indicative typically of the intensity of sound energy in each of a number of preselected frequency bands, indicative of the overall sound intensity, and so on.
  • CPU 12 has access to a template store located in fixed disk 16 (or alternatively, in floppy disk 17 or elsewhere) which contains all the words (or other speech portions, such as syllables, phrases, and the like) to be potentially recognized, which may range from a few hundred to as many as a hundred thousand.
  • a template store located in fixed disk 16 (or alternatively, in floppy disk 17 or elsewhere) which contains all the words (or other speech portions, such as syllables, phrases, and the like) to be potentially recognized, which may range from a few hundred to as many as a hundred thousand.
  • the candidates in the template store are winnowed out to arrive at a relatively small number of candidates from the template store.
  • An attempt is then made to attempt to correlate, on some appropriate measure, the data received at the touchpad with one or another of the candidates from the template store.
  • Measures of correlation would, in the simple case, take into account number of strokes and other relatively unambiguous aspects of the data received at the touchpad, to find a closest match with one or another of the candidates from the template store.
  • the "closest match” will generally be defined as a match that is closer by a predetermined amount than any of the other candidates, and is objectively a match at least as close as betters some predetermined standard. In the absence of a close enough match (either because the best match was not very good, or because it was not much better than the second-place match) remedial steps are taken as described above in connection with the first embodiment.
  • a transcript is derived from the spoken input, where the transcript is a series of characters or symbols capable of being stored in the computer which approximate the spoken input. The transcript of the spoken input is then compared with'transcripts associated with each of the candidates.
  • the apparatus of the invention may also be set up to recognize katakana or hiragana characters, or alphabets other than the Roman alphabet, without departing from the spirit of the invention.
  • These include Chinese, Japanese, Korean, and Vietnamese, as well as editing marks, mathematical and scientific symbols, and any other graphical shape having an associated sound.
  • any of the above-mentioned embodiments may be improved upon by employing ranking methodologies.
  • one of two recognition inputs is used to derive a set of candidates and the other recognition input is used to disambiguate from among the candidates.
  • all the candidates on the list are treated as equally plausible in light of the recognition input that was used to generate the list; any information as to whether one of the candidates was better in light of that input is discarded.
  • the other of the two inputs gives rise to the correlation coefficients that permit disambiguation and selection of the "best match".
  • the apparatus stores not only the list of five words resulting from the handwriting recognition, but also stores for each candidate a corresponding confidence level weighting value. Then, when the speech recognition features are compared for a match with each of the candidates, the closeness of the match is factored in together with the confidence level weighting value.
  • the results of the recognition process take into account not only how closely the spoken input agrees with one of several candidates resulting from the handwritten input, but also how "close” any given candidate is to being the right one for the handwritten input.
  • s(i) be the correlation coefficients (where i ranges from 1 to n) between the features received from the speech input and the n candidates.
  • s(j) the correlation coefficients (where i ranges from 1 to n) between the features received from the speech input and the n candidates.
  • another set of coefficients h(i) represent the higher or lower confidence levels representing the relative likelihood that a given candidate is the correct one relative to the handwriting input.
  • the overall conclusion as to which of the candidates is correct results from a combination of the coefficients s(i) and h(i) . If the s(i) and h(i) are scaled, for example, to fall between 0 and 1, then the sum or the product of the coefficients may be used.
  • weighting the two inputs may be static or dynamic.
  • static weighting one of the two inputs is given a greater weight than the other; for example if the ambient noise level is always great the handwriting coefficients may be scaled to fall between 0 and 2, and then summed with the handwriting coefficients.
  • dynamic weighting the relative weight given to the s(i) and the h(i) may vary from speech portion to speech portion; if ambient noise made one particular speech portion harder to extract features from then the s(i) could be given slightly less weight relative to the h(i).
  • the ranking methodology may be applied to any of the above embodiments — where either of the two recognition inputs is used to generate a list of candidates and where the other of the two is used to disambiguate.

Abstract

An improved data-entry apparatus for entering handwritten data is disclosed. The user enters a word (or portion thereof, depending on the language and notation used) into a touchpad and speaks the word. Features are extracted from the spoken information. The touchpad data are indicative of any of a number of candidate words, and a multiplicity of speech portion templates are generated, each template indicative of one of the candidate words. The apparatus evaluates the correlation between the extracted features and the features of each generated speech portion template, and the speech portion template having the highest correlation with the extracted features determines the recognized word. Optionally a speech synthesizer speaks the word and an opportunity is provided so the user may confirm the correctness of the recognition.

Description

Description
Audio-Augmented Handwriting Recognition
Background of the Invention
This invention relates generally to apparatus for receiving handwritten (calligraphic) input, and relates more particularly to a novel technique for augmenting the handwritten input with audio (spoken) input. Much attention has been paid in recent decades to the prospect of computerized recognition of handwritten input. Typically an input device such as a touchpad is employed, which receives user input by a stylus (or finger) and provides information about the position of the stylus in an orthogonal reference frame. Provided to the recognition system is information indicative of the position, generally in X-Y coordinates, of the stylus and whether or not the stylus is touching the touchpad. With other touchpad designs the recognition system receives no signal unless the stylus is touching the touchpad; when the stylus is touching, signals provide the X and Y coordinates.
In some systems the information is stored for later analysis, in which case the task becomes closely analogous to ordinary optical character recognition. In other systems the information from the touchpad is analyzed at or near real time, and the information analyzed may include not only the X and Y coordinates of the stylus but also temporal information regarding the number of strokes making up the character, the order of the strokes, the direction of the writing for each stroke, and even the speed of the writing within each stroke. In a few systems the stylus is able to resolve several degrees of contact pressure. Few if any of the known on-line character recognition systems are really quite satisfactory. A few systems are fairly successful but require computing power far in excess of that available in hardware on the scale of a personal computer (PC) . To achieve high accuracy of recognition it is generally necessary to constrain closely the permitted range of inputs, such as to the numerical digits 0 and 1 through 9. Where the range of possible inputs is intended to include, say, a Roman alphabet, most systems have poor recognition accuracy or constrain closely the type of letters (e.g. capital, block letters) that are entered. Other systems can only achieve acceptable accuracy after having the opportunity to "learn" the handwriting of a particular user, and do not perform well with arbitrary users whose handwriting has not been "learned".
One particularly large factor in the limited success of most handwriting recognition systems is that the system either recognizes a character or it does not. That is, the algorithm that accepts the touchpad input information and yields a recognized character faces the design constrain of coming up with a single character result, or none at all. And the result has to be correct most of the time or the overall system will be of little utility. The requirement that the single result must be right most of the time dictates, in most systems, one or another of the compromises mentioned above: giving up on reading everyone's handwriting, giving up on reading a wide range of characters, giving up on reading both upper- and lower- case letters, or giving up on obtaining the results in real time. Summary of the Invention
In accordance with the invention there is provided an improved data-entry apparatus for entering handwritten data. The user enters a speech portion (a word, a part of a word, or multi-word phrase) by writing it on a tablet and speaking it into a microphone. Features are extracted from the spoken information. The touchpad data are indicative of any of a number of candidate words, and a multiplicity of speech portion templates are generated, each template indicative of one of the candidate words. The apparatus evaluates the correlation between the extracted features and the features of each generated speech portion template, and the speech portion template having the highest correlation with the extracted features determines the recognized word. Optionally a speech synthesizer speaks the word, or the recognition result is provided to the user on a visual display, and an opportunity is provided so the user may confirm the correctness of the recognition.
Among the salient features of the apparatus of the invention is that the recognition algorithm need not satisfy the standard of yielding a single recognition result to the exclusion of all other results. Instead, the algorithm is permitted to return (upon analysis of the touchpad input) with a list of candidates. Pronunciation-by-rule algorithms are employed to arrive at a pronunciation (or in the case of allophones, a plurality of pronunciations) for each candidate. Speech recognition features are extracted from the voice input of the user, who will have spoken the word as well as having entered it at the touchpad. The features are compared with the pronunciations, and the nearest match is used as the result of the recognition. Optionally, the system may synthesize and replay for the user the result, giving the user a chance to indicate whether or not the recognition was unsuccessful, either by spoken response or by touchpad entry. Brief Description of the Figures
The invention will be described in more detail by a drawing, of which:
Fig. 1 is a diagram, in functional-block form, of the Audi- augmented handwriting recognition device of the invention;
Fig. 2 is a system configuration diagram, showing the system according to the invention. Detailed Description Fig. 1 shows in functional block diagram form the system according to an embodiment of the invention.
Handwritten input is provided by the user at touchpad 1, which preferably has resolution in each of two axes of at least 200 divisions per inch. Data regarding stylus position is collected as indicated by block 2 as a function of time, preferably at least as often as 100 samples per second. Where the invention is applied to recognition of, say, individual letters in a Roman alphabet, the user entry is typically less than a second, and only a few kilobytes (K) of data are collected. As will be discussed further below, the invention shows promise for recognizing Chinese characters, in connection with which the data storage may be tens or hundreds of K bytes. Also discussed below is the case where the invention is applied to alphabetic entry of entire words or phrases, in which case the data storage may be at least several tens of K bytes.
Once the data are collected, the data points are configured as indicative of the strokes making up the character to be recognized. Several known categories of preprocessing are performed, symbolized by block 3. These preferably include smoothing of strokes, filtering of stroke data, correcting wild points such as outlying points in the data, dehooking (removing idiosyncratic movements at ends of strokes) , reducing multi-point dots of stylus movement to a single dot, and stroke correction in cases where a spurious indication of, say, lifting the stylus from the pad wrongly results in a single stroke being recorded as two strokes.
Prior to the attempt to recognize the character a number of types of normalization are preferably also performed as part of the pre- recognition processing. These include the known types of normalization such as deskewing, baseline drift correction, size normalization, stroke length normalization, all known to those skilled in the art.
In prior art handwriting recognition systems the next step, recognition, is quite difficult. Known recognition methodologies which have been employed include feature matching, in which features of the to- be-recognized character are tested for match with characters in a feature database; time sequence of zones, directions, or extremes, where a simple database match, or a binary tree analysis, may be employed regarding relatively discrete information about the character-formation temporal data; curve matching, in which the curves making up the character are compared with standard curves in a database; or stroke codes, where parts of a character are classified and enumerated, then compared with a database. As mentioned above, all these methodologies suffer, when used in prior art handwriting recognition systems, from the problem that they are of no use unless they give a single highly unambiguous result for each recognized character. To achieve this singular result, it is generally required that the size of the character set be constrained, that the font be constrained, or that the number of users be constrained. At block 4 of Fig. 1, the recognition is performed, but the recognition algorithm only requires that the range of possible recognition results be confined to a small number of matches. Where letters of an alphabet are being recognized, the number of candidates can be half a dozen or so. Where Chinese characters are being recognized, the number of candidates may typically be a few.dozens or a few hundred.
The next stage of processing, shown as block 5, is the employment of pronunciation-by-rule methods to arrive at a pronunciation for each of the candidates. Some letters or characters may have more than one pronunciation, and if so both pronunciations are arrived at. Such letters (or other speech portions) are called allophones, and do not cause a problem for the apparatus of the invention. Feature templates for each of the pronunciations are stored; the number of feature templates to be compared with the spoken features may thus exceed the number of candidate speech portions. Similarly, it may be desired to permit, say, American and British templates for speech portions pronounced differently in the two dialects, or for speech portions pronounced by men or women.
At about the same time, the user is asked to speak the character into a microphone. Well-known speech recognition steps are taken, starting with storage of the spoken information, preferably in RAM. A few megabytes suffice for storage of several seconds of speech, collected by microphone 6 and audio storage means 7. Features are extracted by known means, such as determination of the amount of speech energy within various frequency ranges as a function of time.
In accordance with the invention, the novel step of correlation is performed. The extracted features of the spoken information are compared one by one with the features of the candidates mentioned before. This step, shown as functional block 9, assumes a meaningful correlation measure which, in the general case, can be expected to permit selection of one of the candidates (call it candidate #1) as being a closer feature match than the others. In order for this correlation to result in an "acceptance", two requirements must be met:
(1) the extracted features of the spoken information must match the features of candidate #1 with a correlation value exceeding some predetermined threshold (i.e. it must be a "sufficiently close" match) ; and
(2) the extracted features of the spoken information must match the features of candidate #1 with a correlation value exceeding the correlation values for the other candidates by some predetermined margin (i.e. it must be a "better" match than all the runners-up) .
Experience suggests that with properly chosen threshold and margin (and, of course, with a workable correlation measure) then most of the time the two requirements will be met, and recognition will have been accomplished on the "first pass". However, according to the invention any of several remedial measures may be taken during a so-called "second pass". For example, the system may:
(1) offer a number of candidates, ranked by correlation value, for multiple-choice selection by the user; or
(2) have the user repeat the reciting and writing process; or
(3) where an alphabet has not yet been used but is available, have the user enter the information character by character.
In the case of Kanji entry, one example of the last option is having the user enter the information by kana. In the case of entry of a cursive English word, one example of the last option is having the user enter the word letter by letter.
Optionally, each feature match is followed as in block 10 with an audio synthesis of the match character. The synthesized speech is replayed for the user, and assuming the user agrees with the match, the process continues with the next handwriting to be recognized. In the (hopefully rare) case that the match was incorrect, the user will so indicate, either by touchpad input or by speech input to the system.
Instead of or in addition to an audio synthesis of the match character, there may be a visual display of the match character as indicated by block 28. Assuming the user agrees with the match, the process continues with the next handwriting to be recognized. As with the audio synthesis technique just described, in the (hopefully rare) case that the match was incorrect, the user will so indicate, either by touchpad input or by speech input to the system. Turning now to Fig.2 there is shown a block diagram of a system according to the invention based on a known personal computer. CPU 12, preferably a fast processor such as a high-speed 80386 microprocessor, forms an AT-compatible or EISA-compatible system with BIOS 13, fixed disk 16 (and disk controller 15), optional floppy disk 17, RAM 14, and I/O interface hardware 18 to drive keyboard 20, display 29, and printer 19. All the above-mentioned are in communication by means of bus 27. Provided as peripherals to the microcomputer are touchpad tablet 1 with interface 22 plugged into an expansion slot of bus 27, microphone 6 with speech recognition (feature extraction) interface 24, and an optional speaker with speech synthesis card 26. All the three expansion peripherals comprise known hardware, available off-the-shelf. The steps depicted functionally in Fig. 1, blocks 2 through 5 and 7 through 9 (and optionally 10 and 28) are all performed essentially by the CPU 12 under program control. That is, nearly all the steps are performed in software.
With data entry rates that are typical for humans (several tenths of a second for Roman characters, a second or two for Chinese characters) theory suggests the bandwidth of a fast AT or EISA bus and microprocessor throughput can permit recognition according to the invention not only for individual letters but also for Chinese characters or entire words in Roman-alphabet languages.
A variant of the preferred embodiment is possible in which, instead of pronunciation-by-rule, a template store is employed. The candidates obtained by handwriting recognition are pointers into a template store (preferably in fixed disk 16) containing speech templates, one (or more, in the case of allophones) for each candidate. Once the template entries corresponding to the candidates have been identified, the correlation step is performed, to find out which entry (and thus which handwriting candidate) is the closest match to the spoken input. Note that the embodiment just described has some elements in common with known speech recognition systems. But the embodiment offers many advantages over such known systems. Foe example, it will be appreciated that some prior art speech recognition systems try to match the received features with every entry in a template store. In the case where the template store is large, with perhaps many tens of thousands of entries, this requires many tens of thousands of comparisons and is fraught with the danger of an incorrect result since several entries may turn out to have correlation coefficients not far below that of the entry with the highest correlation coefficient. In the case where the template store is small, on the other hand, the prospect of having one entry correlate much better than the others is greater but the obvious drawback is the limited recognition vocabulary.
Thus one advantage of the inventive embodiment may be seen — where analysis begins with the handwritten input the portion of the template store that must be compared with the received features is reduced substantially. This saves computational time and enhances the prospect of correct recognition on the first try.
A hybrid embodiment is also possible, in which the candidate words (or other speech portions) are used as pointers into the template store for those words which happen to be in the template store; for other words (namely for words not having entries in the template store) then pronunciation-by-rule is used to derive pronunciation templates. The features extracted from the spoken input are compared to template candidates from either or both of the sources, namely from the template store or from the pronunciation-by-rule activity. As with the previous systems, a "best match" is found which preferably matches more closely by a predetermined margin than any of the other matches. It will be appreciated that it is not strictly necessary first to store the correlation coefficients for all of the candidates, and only later to determine which candidate had the highest coefficient (and thus the best match) . Instead, one approach is to establish and maintain only two coefficients, each initialized to zero. One is the "best" coefficient found thus far in the comparison process, and the other is the "second best" coefficient found thus far. Only if a particular comparison yields a coefficient better than either the "best" or "second Best" found thus far will it be necessary to update one or the other, or both. It will be noted that the above-described embodiment is disclosed with respect to one particular sequence, namely (1) analyzing the handwritten input to derive a number of candidate writing portions (e.g. words) and (2) receiving spoken input to disambiguate (e.g. narrow down the candidates to one result) . This is the preferred embodiment because experience suggests the present state of technology favors recognition of a relatively high quality graphical input; speech recognition, over the anticipated wide vocabulary to which the invention would be applied, is less reliable in and of itself, though it is thought sufficiently workable for the above-mentioned disambiguation.
The invention may be optionally embodied with the opposite sequence, namely (1) analyzing the spoken input to derive a number of candidate speech portions (e.g. words) and (2) receiving written input to disambiguate (e.g. narrow down the candidates to one result) . At present with present-day speech recognition technology, this is thought less practical than the preferred embodiment. The relative merits of the two approaches depend on the relative effectiveness of the two component technologies as applied to a particular pair of inputs. For example, if the user input a relatively high-quality (in terms of the system's ability to recognize) graph, and a low-quality speech-bite, the preferred approach would be more effective. A system could, in fact, apply both approaches to a set of inputs, and choose the result with the higher degree of confidence.
In an alternative embodiment, then, the system according to the invention would receive spoken words from the user by means of the data collection 7 and feature extraction 8 processes of Fig. 1. With reference to Fig. 2, the extracted features would be made available to CPU 12 by data channel 27. As mentioned above in connection with"the preferred embodiment, it is known in the speech-recognition art to extract these features, which are indicative typically of the intensity of sound energy in each of a number of preselected frequency bands, indicative of the overall sound intensity, and so on. In known speech recognition systems one of the most vexing problems is figuring out when one spoken word ends and the next begins, but in the apparatus of this embodiment it is assumed that the user speaks only one speech portion (word, letter, syllable, or phrase, depending on the language involved and the design choices made) at a time, in response to synthesized prompts.
CPU 12 has access to a template store located in fixed disk 16 (or alternatively, in floppy disk 17 or elsewhere) which contains all the words (or other speech portions, such as syllables, phrases, and the like) to be potentially recognized, which may range from a few hundred to as many as a hundred thousand.
The candidates in the template store are winnowed out to arrive at a relatively small number of candidates from the template store. An attempt is then made to attempt to correlate, on some appropriate measure, the data received at the touchpad with one or another of the candidates from the template store.
Measures of correlation would, in the simple case, take into account number of strokes and other relatively unambiguous aspects of the data received at the touchpad, to find a closest match with one or another of the candidates from the template store. As with the preferred embodiment, the "closest match" will generally be defined as a match that is closer by a predetermined amount than any of the other candidates, and is objectively a match at least as close as betters some predetermined standard. In the absence of a close enough match (either because the best match was not very good, or because it was not much better than the second-place match) remedial steps are taken as described above in connection with the first embodiment.
Those skilled in the art will appreciate that while the embodiments have been disclosed with respect to one particular way of characterizing a spoken input (extraction of acoustic features and comparison with a set of acoustic-feature templates) , the inventive aspects of the invention do not depend on the use of such templates. For example, one may employ instead what is known as transcription. A transcript is derived from the spoken input, where the transcript is a series of characters or symbols capable of being stored in the computer which approximate the spoken input. The transcript of the spoken input is then compared with'transcripts associated with each of the candidates.
While the embodiments have been disclosed with connection to the recognition of letters, the scope of the invention should not be so limited. For example, much attention has been paid to the recognition of
Chinese characters (kanji) through touchpad input means and the like. While typical ambiguity for recognition of a letter may be two or three candidates, it is not unheard-of for a given set of calligraphic data to give rise to several dozen candidate kanji. With prior art systems the inability of the algorithm to exclude all choices but one makes the entire enterprise fruitless. With the method of the invention, however, the user may also speak the kanji, and speech feature extraction and matching permit disambiguation. In yet another application, the case of recognition of words (formed of letters) may be accomplished by the method of the invention. The user would enter a sequence of letters at the touchpad, and speak the word that was spelled. What is generated in this case is a set of candidate words, allowing for ambiguity in the recognition of individual letters. Then the user speaks the word, and the candidate with the nearest match is taken to be the correct match. While most of the above examples use Roman letters, it should be appreciated that the apparatus of the invention may also be set up to recognize katakana or hiragana characters, or alphabets other than the Roman alphabet, without departing from the spirit of the invention. These include Chinese, Japanese, Korean, and Vietnamese, as well as editing marks, mathematical and scientific symbols, and any other graphical shape having an associated sound. Finally, depending on the language it is possible to recognize entire phrases rather than words singly. In the case of phrases, an entire phrase is handwritten at the touchpad and the entire phrase is spoken by the user; the method of any of the embodiments described above is applied to the phrase just as it would be applied to a letter, syllable, character or word.
Any of the above-mentioned embodiments may be improved upon by employing ranking methodologies. In the simple cases described above, one of two recognition inputs is used to derive a set of candidates and the other recognition input is used to disambiguate from among the candidates. In those simple cases all the candidates on the list are treated as equally plausible in light of the recognition input that was used to generate the list; any information as to whether one of the candidates was better in light of that input is discarded. In those simple cases, then, the other of the two inputs gives rise to the correlation coefficients that permit disambiguation and selection of the "best match".
When ranking methodologies are used, a higher confidence result can be obtained. Take, for example, the embodiment in which handwriting has been recognized yielding a list of five candidate words. In the simple case, the speech recognition input (hopefully) permits the conclusion that one of the five candidates is the best match. In the "ranking" embodiment, the apparatus stores not only the list of five words resulting from the handwriting recognition, but also stores for each candidate a corresponding confidence level weighting value. Then, when the speech recognition features are compared for a match with each of the candidates, the closeness of the match is factored in together with the confidence level weighting value.
In the "ranking" embodiment, then, the results of the recognition process take into account not only how closely the spoken input agrees with one of several candidates resulting from the handwritten input, but also how "close" any given candidate is to being the right one for the handwritten input.
A computational example will show that there are several meaningful ways to accomplish the ranking.
Assume a system where handwriting recognition is used to find candidate words, of which there are n, and that speech recognition is to be used to select from among the n words. Let s(i) be the correlation coefficients (where i ranges from 1 to n) between the features received from the speech input and the n candidates. In the simple embodiment, if there is a successful recognition it is because for some j there is an s(j) which is larger than the other s(i) by at least some margin, and that s(j) is itself above a predetermined threshold; in that case candidate j is taken as the correctly recognized word.
In the ranking embodiment, another set of coefficients h(i) represent the higher or lower confidence levels representing the relative likelihood that a given candidate is the correct one relative to the handwriting input. The overall conclusion as to which of the candidates is correct results from a combination of the coefficients s(i) and h(i) . If the s(i) and h(i) are scaled, for example, to fall between 0 and 1, then the sum or the product of the coefficients may be used. In the case of the sum, recognition is successful of for some j there is a sum s(j)+h(j) which is larger than the other s(i)+h(i) by at least some margin, and that s(j)+h(j) is itself above a predetermined threshold; in that case candidate j is taken as the correctly recognized word. Products may also be used.
An additional level of sophistication may be had by weighting the two inputs relative to each other; the weighting may be static or dynamic. With static weighting one of the two inputs is given a greater weight than the other; for example if the ambient noise level is always great the handwriting coefficients may be scaled to fall between 0 and 2, and then summed with the handwriting coefficients. With dynamic weighting the relative weight given to the s(i) and the h(i) may vary from speech portion to speech portion; if ambient noise made one particular speech portion harder to extract features from then the s(i) could be given slightly less weight relative to the h(i).
The ranking methodology may be applied to any of the above embodiments — where either of the two recognition inputs is used to generate a list of candidates and where the other of the two is used to disambiguate.
Those skilled in the art will appreciate that nothing in the invention requires that the touchpad entry be first and the spoken input second; the opposite order or even simultaneous entry can be accommodated by appropriate hardware and software reconfiguration.

Claims

Claims
1. A data-entry apparatus for entering data comprising handwritten word portions corresponding to spoken speech portions, comprising: writing-pad receiving means for receiving a group of writing signals; sound receiving means for detecting sounds ' corresponding to a spoken speech portion and for extracting features for said sounds; template generation means for generation of a multiplicity of speech portion templates, each template indicative of a word portion corresponding to said group of writing signals; and correlating means responsive to the extracted features for evaluating the correlation between the extracted features and the features of each generated speech portion template, and for identifying the generated speech portion template in the multiplicity of speech portion templates having the highest correlation with the extracted features.
2. The data-entry apparatus of claim 1, further comprising a speech synthesizer responsive to the correlating means for synthesizing the speech portion corresponding to the speech portion template in the multiplicity of speech portion templates having the highest correlation with the extracted features.
3. The data-entry apparatus of claim 1 wherein the writing-pad receiving means is a two-dimensional touchpad.
4. The data-entry apparatus of claim 3 wherein the handwritten word portions are Chinese characters and the sound speech portion is a spoken Chinese character.
5. The data-entry apparatus of claim 3 wherein the handwritten word portions are Kanji characters and the sound speech portion is a spoken Kanji character.
6. The data-entry apparatus of claim 3 wherein the handwritten word portions are katakana characters and the sound speech portion is a spoken katakana character.
7. The data-entry apparatus of claim 3 wherein the handwritten word portions are hiragana characters and the sound speech portion is a spoken hiragana character.
8. The data-entry apparatus.of claim 3 wherein the handwritten word portions are words written in an alphabet and the sound speech portion is a spoken word.
9. A data-entry apparatus for entering data comprising spoken speech portions corresponding to handwritten word portions, comprising: writing-pad receiving means for receiving a group of writing signals; sound receiving means for detecting sounds corresponding to a spoken speech portion and for extracting features for said sounds; word-candidate generation means for generation of a multiplicity of speech portion word-candidates, each word-candidate indicative of a candidate word portion corresponding to the features for the spoken speech portion; and correlating means responsive to the group of writing signals for evaluating the correlation between the group of writing signals and the features of each generated speech portion word- candidate, and for identifying the generated speech portion word-candidate in the multiplicity of speech portion word-candidates having the highest correlation with the group of writing signals.
10. The data-entry apparatus of claim 9, further comprising a speech synthesizer responsive to the correlating means for synthesizing the speech portion corresponding to the speech portion word- candidate in the multiplicity of speech portion word- candidates having the highest correlation with the group of writing signals.
11. The data-entry apparatus of claim 9 wherein the writing-pad receiving means is a two-dimensional touchpad.
12. The data-entry apparatus of claim 11 wherein the handwritten word portions are Chinese characters and the sound speech portion is a spoken Chinese character.
13. The data-entry apparatus of claim 11 wherein the handwritten word portions are Kanji characters and the sound speech portion is a spoken Kanji character.
14. The data-entry apparatus of claim 11 wherein the handwritten word portions are katakana characters and the sound speech portion is a spoken katakana character.
15. The data-entry apparatus of claim 11 wherein the handwritten word portions are hiragana characters and the sound speech portion is a spoken hiragana character.
16. The data-entry apparatus of claim 11 wherein the handwritten word portions are words written in an alphabet and the sound speech portion is a spoken word.
PCT/US1991/006874 1990-09-26 1991-09-23 Audio-augmented handwriting recognition WO1992005517A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US58818590A 1990-09-26 1990-09-26
US588,185 1990-09-26

Publications (1)

Publication Number Publication Date
WO1992005517A1 true WO1992005517A1 (en) 1992-04-02

Family

ID=24352831

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1991/006874 WO1992005517A1 (en) 1990-09-26 1991-09-23 Audio-augmented handwriting recognition

Country Status (2)

Country Link
AU (1) AU8641891A (en)
WO (1) WO1992005517A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5491758A (en) * 1993-01-27 1996-02-13 International Business Machines Corporation Automatic handwriting recognition using both static and dynamic parameters
US6285785B1 (en) 1991-03-28 2001-09-04 International Business Machines Corporation Message recognition employing integrated speech and handwriting information
WO2002045002A1 (en) * 2000-11-28 2002-06-06 Siemens Aktiengesellschaft Method and system for reducing the error rate in pattern recognitions
GB2428125A (en) * 2005-07-07 2007-01-17 Hewlett Packard Development Co Digital pen with speech input
JP2008537806A (en) * 2005-02-08 2008-09-25 テジック コミュニケーションズ インク Method and apparatus for resolving manually input ambiguous text input using speech input
US20100023312A1 (en) * 2008-07-23 2010-01-28 The Quantum Group, Inc. System and method enabling bi-translation for improved prescription accuracy

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5759241A (en) * 1980-09-27 1982-04-09 Nobuhiko Sasaki Inputting method for kanji (chinese character)
JPS5858637A (en) * 1981-10-02 1983-04-07 Nec Corp Sentence input device
JPS58134371A (en) * 1982-02-03 1983-08-10 Nec Corp Japanese word input device
JPS60189070A (en) * 1984-03-08 1985-09-26 Fujitsu Ltd Character input device
JPS61240361A (en) * 1985-04-17 1986-10-25 Hitachi Electronics Eng Co Ltd Documentation device with hand-written character

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5759241A (en) * 1980-09-27 1982-04-09 Nobuhiko Sasaki Inputting method for kanji (chinese character)
JPS5858637A (en) * 1981-10-02 1983-04-07 Nec Corp Sentence input device
JPS58134371A (en) * 1982-02-03 1983-08-10 Nec Corp Japanese word input device
JPS60189070A (en) * 1984-03-08 1985-09-26 Fujitsu Ltd Character input device
JPS61240361A (en) * 1985-04-17 1986-10-25 Hitachi Electronics Eng Co Ltd Documentation device with hand-written character

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
PATENT ABSTRACTS OF JAPAN vol. 10, no. 45 (P-430)21 February 1986 & JP,A,60 189 070 ( FUJITSU K.K. ) 26 September 1985 see abstract *
PATENT ABSTRACTS OF JAPAN vol. 11, no. 87 (P-557)17 March 1987 & JP,A,61 240 361 ( HITACHI ELECTRONICS ENG. CO. LTD. ) 25 October 1986 see abstract *
PATENT ABSTRACTS OF JAPAN vol. 6, no. 137 (P-130)24 July 1982 & JP,A,57 059 241 ( SASAKI NOBUHIKO ) 9 April 1982 see abstract *
PATENT ABSTRACTS OF JAPAN vol. 7, no. 146 (P-206)25 June 1983 & JP,A,58 058 637 ( NIPPON DENKI K.K. ) 7 April 1983 see abstract *
PATENT ABSTRACTS OF JAPAN vol. 7, no. 249 (P-234)5 November 1983 & JP,A,58 134 371 ( NIPPON DENKI K.K. ) 10 August 1983 see abstract *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6285785B1 (en) 1991-03-28 2001-09-04 International Business Machines Corporation Message recognition employing integrated speech and handwriting information
US5491758A (en) * 1993-01-27 1996-02-13 International Business Machines Corporation Automatic handwriting recognition using both static and dynamic parameters
US5539839A (en) * 1993-01-27 1996-07-23 International Business Machines Corporation Automatic handwriting recognition using both static and dynamic parameters
US5544264A (en) * 1993-01-27 1996-08-06 International Business Machines Corporation Automatic handwriting recognition using both static and dynamic parameters
US5544261A (en) * 1993-01-27 1996-08-06 International Business Machines Corporation Automatic handwriting recognition using both static and dynamic parameters
US5550931A (en) * 1993-01-27 1996-08-27 International Business Machines Corporation Automatic handwriting recognition using both static and dynamic parameters
WO2002045002A1 (en) * 2000-11-28 2002-06-06 Siemens Aktiengesellschaft Method and system for reducing the error rate in pattern recognitions
JP2008537806A (en) * 2005-02-08 2008-09-25 テジック コミュニケーションズ インク Method and apparatus for resolving manually input ambiguous text input using speech input
JP4829901B2 (en) * 2005-02-08 2011-12-07 テジック コミュニケーションズ インク Method and apparatus for confirming manually entered indeterminate text input using speech input
GB2428125A (en) * 2005-07-07 2007-01-17 Hewlett Packard Development Co Digital pen with speech input
US20100023312A1 (en) * 2008-07-23 2010-01-28 The Quantum Group, Inc. System and method enabling bi-translation for improved prescription accuracy
US9230222B2 (en) * 2008-07-23 2016-01-05 The Quantum Group, Inc. System and method enabling bi-translation for improved prescription accuracy

Also Published As

Publication number Publication date
AU8641891A (en) 1992-04-15

Similar Documents

Publication Publication Date Title
US5502774A (en) Automatic recognition of a consistent message using multiple complimentary sources of information
US6487532B1 (en) Apparatus and method for distinguishing similar-sounding utterances speech recognition
US7174288B2 (en) Multi-modal entry of ideogrammatic languages
EP1141941B1 (en) Handwritten or spoken words recognition with neural networks
Burr Designing a handwriting reader
US7336827B2 (en) System, process and software arrangement for recognizing handwritten characters
EP1564675B1 (en) Apparatus and method for searching for digital ink query
US20080008387A1 (en) Method and apparatus for recognition of handwritten symbols
Kavallieratou et al. Slant estimation algorithm for OCR systems
EP0505621A2 (en) Improved message recognition employing integrated speech and handwriting information
US4468756A (en) Method and apparatus for processing languages
US6826306B1 (en) System and method for automatic quality assurance of user enrollment in a recognition system
US7424156B2 (en) Recognition method and the same system of ingegrating vocal input and handwriting input
KR100480316B1 (en) Character recognition method and apparatus using writer-specific reference vectors generated during character-recognition processing
Oni et al. Computational modelling of an optical character recognition system for Yorùbá printed text images
WO1992005517A1 (en) Audio-augmented handwriting recognition
JP3444108B2 (en) Voice recognition device
CN112749629A (en) Engineering optimization method for Chinese lip language recognition of identity verification system
Lee et al. A Markov language model in Chinese text recognition
JP2989387B2 (en) Term recognition device and term recognition method in input character processing device
JP2660998B2 (en) Japanese language processor
Wagner et al. Isolated-word recognition of the complete vocabulary of spoken Chinese
KR100204618B1 (en) Method and system for recognition of character or graphic
Frosini et al. A fuzzy classification based system for handwritten character recognition
Goni et al. Scientific African

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BG BR CA FI HU JP KR NO RO SU

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IT LU NL SE

NENP Non-entry into the national phase

Ref country code: CA