WO2023139769A1 - 文法調整装置、及びコンピュータが読み取り可能な記憶媒体 - Google Patents
文法調整装置、及びコンピュータが読み取り可能な記憶媒体 Download PDFInfo
- Publication number
- WO2023139769A1 WO2023139769A1 PCT/JP2022/002282 JP2022002282W WO2023139769A1 WO 2023139769 A1 WO2023139769 A1 WO 2023139769A1 JP 2022002282 W JP2022002282 W JP 2022002282W WO 2023139769 A1 WO2023139769 A1 WO 2023139769A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- grammar
- extracted
- speech recognition
- unit
- grammars
- Prior art date
Links
- 238000003860 storage Methods 0.000 title claims description 26
- 238000011156 evaluation Methods 0.000 claims abstract description 50
- 238000000605 extraction Methods 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 8
- 238000000034 method Methods 0.000 description 15
- 230000007704 transition Effects 0.000 description 7
- 238000013500 data storage Methods 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000003754 machining Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
Definitions
- the present invention relates to a speech recognition grammar adjustment device and a computer-readable storage medium.
- the operation part of the device has many buttons and operation screens, but the operation is complicated and it takes time to master.
- a voice input interface allows users to perform desired operations simply by uttering voice commands. Therefore, attempts have been made to improve operability using a voice input interface.
- the voice commands used to operate the device can be assumed depending on the type of device that uses the voice command, the site where the device is installed, and the operation details of the device. Therefore, expected voice commands can be created in grammar (syntax and words). For example, see Patent Document 1.
- Evaluation data is used to evaluate whether the accuracy of the created grammar is high.
- the creator of the speech recognition system checks the accuracy of speech recognition when using the created grammar, and edits the grammar. Speech recognition grammars are often created manually.
- a grammar adjustment device that is an aspect of the present disclosure includes a grammar storage unit that stores a grammar of a voice command for operating an industrial device, a grammar extraction unit that extracts a part of the grammar, a target registration unit that receives registration of a target for the speech recognition evaluation value of the extracted grammar, a speech recognition unit that performs speech recognition of evaluation speech data using the extracted grammar, and an evaluation value calculation unit that calculates the speech recognition evaluation value of the extracted grammar based on the results of speech recognition using the extracted grammar and correct data of the evaluation speech data.
- a grammar selection unit that selects a grammar that satisfies a target from among the one or more extracted grammars extracted by the grammar extraction unit.
- a storage medium which is one aspect of the present disclosure, stores a grammar of a voice command for operating an industrial device, extracts a part of the grammar by being executed by one or more processors, receives a target registration of a speech recognition evaluation value of the extracted grammar, performs speech recognition of evaluation speech data using the extracted grammar, calculates a speech recognition evaluation value of the extracted grammar based on the result of speech recognition using the extracted grammar and correct data of the evaluation speech data, and calculates the speech recognition evaluation value of the extracted grammar among one or more extracted grammars. , stores processor readable instructions for selecting a grammar that satisfies a goal.
- grammar creation for speech recognition can be supported.
- FIG. 1 is a block diagram showing the configuration of a grammar adjustment device
- FIG. FIG. 10 is a diagram showing examples of syntax definitions and word definitions
- It is a figure which shows the combination example of the speaker of the data for evaluation, and a recording place.
- It is a figure which shows the example of the calculation result of an evaluation value.
- It is a figure which shows the example of a goal registration screen.
- It is a figure which shows the example of the accuracy rate of a different grammar.
- 4 is a flowchart for explaining processing of the grammar adjustment device; It is a hardware configuration of a grammar adjustment device.
- the grammar adjustment device 100 will be described below.
- the grammar adjustment device 100 is implemented in an information processing device having an arithmetic unit and a storage unit. Examples of such information processing devices include PCs (personal computers) and mobile terminals, but are not limited to these.
- Fig. 1 shows the basic configuration of the grammar adjustment device 100.
- the grammar adjustment device 100 comprises an evaluation data storage unit 11 , a target registration unit 12 , a basic grammar storage unit 13 , a grammar extraction unit 14 , a speech recognition unit 15 , an extracted grammar storage unit 16 , an evaluation value calculation unit 17 and a grammar selection unit 18 .
- the basic grammar storage unit 13 stores grammars of voice commands that serve as bases.
- a voice command is a command for operating equipment in the industrial field by voice.
- the grammar of voice commands consists of syntax and words.
- the basic grammar storage unit 13 includes a syntax storage unit 19 that stores syntax and a word storage unit 20 that stores words. Words include words that make up voice commands and phoneme representations of words. Syntax defines the arrangement of words that make up a voice command.
- the base grammar is exhaustively created to cover as many voice commands as possible that are expected to be used in the field. For example, the syntax of a voice command for setting the "override" of a numerical controller to "30" is assumed to be “override 30", “set override to 30", “set override to 30", and so on.
- a grammar author constructs as many grammars as possible.
- the basic grammar is determined by the type of device that recognizes voice commands, specifications, work content, and so on.
- a plurality of phoneme arrays may be assigned to one word.
- the word “override” can be represented by multiple phonemes such as "o:ba:raido", “oubaaraido” and "oubaraido”.
- the base grammar is constructed to cover as many phonemes of such words as possible.
- FIG. 2 shows an example syntax definition and an example word definition.
- An example syntax definition defines the words that make up a voice command and the order of the words.
- “S” is the start symbol of the voice command
- "NS_B” and “NS_E” are silent sections at the beginning and end of the sentence.
- the second and third lines define "tags" that go into "COMMAND”.
- the second line defines that the syntax element "COMMAND” includes tags "ROBOT” and "INTERFACE”
- the third line defines that the syntax element "COMMAND” includes tags "NAIGAI” and "INTERFACE”.
- the first and second lines of the word definition define the Japanese notation and phoneme notation of the tag "ROBOT".
- the Japanese notation of the tag "ROBOT” is "robot” and the phoneme notation is "roboqto”.
- the 3rd to 5th lines of the word definition define the Japanese notation and the phoneme notation of the Japanese included in the tag "NAIGAI”.
- the tag "NAIGAI” contains two Japanese words, "external” and "internal.”
- the "outside” phoneme is "gaibu” and the "inside” phoneme is "naibu”.
- the 6th to 8th lines of the word definition define the Japanese notation and the phoneme notation of the Japanese included in the tag "INTERFACE".
- the tag "INTERFACE” contains one Japanese word "interface”.
- the “interface” has two types of phoneme notation “iNtafe:su” and “iNta:feisu”. "%NS_B” defines a silence section [s] at the beginning of a sentence, and “%NS_E” defines a silence section [/s] at the end of a sentence.
- the grammar extraction unit 14 extracts a part of the grammar from the exhaustive grammar stored in the basic grammar storage unit 13 .
- cluster division of the k-means method is used.
- a method other than the k-means method may be used for grammar extraction.
- the clustering of the k-means method uses the acoustic distance of the grammar. Acoustic distance can be obtained from an acoustic spectrum, from a phoneme string, or the like. According to the method of calculating the acoustic distance from the acoustic spectrum, the acoustic spectrum of the voice command is vectorized and the cosine distance or Euclidean distance between the vectors is calculated.
- Cosine distance, Levenshtein distance, Jarrowinkler distance, and Hamming distance are used in the method of calculating acoustic distance from a phoneme string. Cosine distance, Euclidean distance, Leberstein distance, Jarrowinkler distance and Hamming distance are well known.
- the acoustic distance of the phonemes of words included in the voice command e.g., "iNtafe:su", “iNtafeisu”, etc.
- a part can be extracted from a cluster of words with close acoustic distances.
- the k-means method uses random numbers to set the centers of K clusters, (a) assigning the nearest center to each voice command (or word), and (b) calculating the center for each cluster. Repeat (a) and (b) until the centers of all clusters do not change to divide the voice commands into clusters.
- the grammar extraction unit 14 extracts grammars (syntax and words) of voice commands included in the same cluster, and outputs them to the voice recognition unit as grammars for evaluation.
- the k-means method is an example of a method for extracting grammars with close distances, and methods other than the k-means method may be used.
- the results of the k-means method are affected by the random number of initial values and the number of clusters K.
- FIG. The random number and the number of clusters K may be set manually by the user, or may be automatically set by the grammar extraction unit 14 .
- the evaluation data storage unit 11 associates and stores voice data including voice commands recorded by a plurality of speakers at a plurality of recording locations with correct data, which is a correct text for the voice data. For example, voice data of utterances of "external interface" by a plurality of speakers at a plurality of recording locations and correct data (text) of "external interface” are stored in association with each other.
- the speech data in the evaluation data storage unit 11 are recorded at different recording locations by speakers with different attributes (gender, age). Since the evaluation data was recorded at the site using the voice command, the noise at the site using the voice command is included.
- FIG. 3 is a table showing the relationship between the speaker of the evaluation data and the recording location.
- the evaluation data in FIG. 3 includes voices recorded by speaker A (male, 60 years old) at factories A and B, voices recorded by speaker B (female, 30 years old) at factories C and D, and the like.
- the voice recognition unit 15 receives a voice command from the evaluation data storage unit 11 and performs voice recognition of the input voice command.
- the speech recognition unit 15 is generally composed of an acoustic model, a language model, and a decoder.
- the acoustic model receives speech data and outputs phonemes (senones) that form the speech data based on the feature amount of the speech data.
- the language model outputs the probability of occurrence of word strings.
- the language model selects hypothetical word strings based on phonemes and outputs linguistically plausible candidates.
- the decoder outputs a word string with a high probability as a recognition result based on the outputs of the acoustic model and language model that are statistically created.
- the extracted grammar storage unit 16 includes a syntax storage unit 21 and a word storage unit 22, and stores the grammars extracted by the grammar extraction unit 14.
- the speech recognition section 15 performs speech recognition using the grammar stored in the extraction grammar storage section 16 .
- the evaluation value calculation unit 17 compares the correct text from the evaluation data storage unit with the recognition result of the speech recognition unit 15, and calculates the accuracy rate of speech recognition.
- FIG. 4 is an example of the accuracy rate as an evaluation value.
- Types of voice commands include, for example, approval commands, numerical commands, and transition commands.
- An approval command is a command indicating approval. Assume that the approval commands include “yes”, “no”, “yes”, “no”, “execute”, “abort”, and the like.
- Numerical commands are commands for designating numerical values such as "0.5", "1", "2", and "100".
- a “transition command” is a command for designating a display screen such as a "home screen” or a "speed setting screen”.
- a "machine operation command” such as "set a workpiece” may be considered.
- the target registration unit 12 accepts registration of target values for speech recognition.
- the target registration unit 12 receives target values such as a target accuracy rate for all voice commands, a target accuracy rate for each type of voice command, and a target search time.
- FIG. 5 is an example of a target registration screen.
- the target accuracy rate for each type of voice command is set as "approval command: 95% or more”, “numerical command: 90% or more”, “transition command: 80% or more”, and maximum execution time "within 30 minutes”.
- the grammar selection unit 18 compares the speech recognition result with the target accuracy rate, and if there is a grammar determined to satisfy the target accuracy rate as a result of the speech recognition, it selects that grammar as an appropriate grammar. The processing of the grammar selection unit 18 is repeated until the target time for grammar adjustment elapses or the target accuracy rate is cleared. When the target time for grammar adjustment has passed, an appropriate grammar is selected from the grammars for which speech recognition has been performed so far.
- the grammar selection unit 18 may present the accuracy rate of each grammar to the creator of the grammar, and the creator of the grammar may select the grammar.
- the accuracy rate of voice commands is calculated for each type of approval command, transition command, and numerical command.
- the approval command is used for confirming the operation, so a high accuracy rate is required.
- Numerical commands that specify numerical values also require a high accuracy rate.
- a transition command that instructs a screen transition may have a lower accuracy rate than an approval command or a numerical command.
- a target accuracy rate can be set for each voice command or for each type of voice command. Automatically select the grammar that achieves the target accuracy rate. For example, FIG. 6 shows the accuracy rate of grammar A and grammar B. In FIG.
- grammar A Since the accuracy rate of "approval command”, “numerical value command”, and “transition command” of grammar A satisfies the target accuracy rate registered by the target registration unit 12, grammar A is selected as an appropriate grammar. Further conditions may be set when a plurality of grammars satisfy the target accuracy rate.
- the grammar adjustment device 100 registers a target accuracy rate of a voice command (step S1, receives registration of the maximum execution time for grammar adjustment (step S2), and receives registration of a cluster division criterion (step S3).
- step S1 receives registration of the maximum execution time for grammar adjustment
- step S3 receives registration of a cluster division criterion.
- the grammar adjustment device 100 clusters the voice commands (or words included in the voice commands) stored in the basic grammar storage unit 13 (step S4), extracts one or more representative voice commands (or words included in the voice commands) from each cluster (step S5), and reconstructs the grammar using the extracted voice commands (or words included in the voice commands) (step S6).
- the grammar adjustment device 100 performs speech recognition on the evaluation data using the grammar reconstructed in step S5 (step S7).
- Grammar adjustment device 100 calculates an evaluation value for speech recognition (step S8).
- Grammar adjustment device 100 compares the evaluation result with the target accuracy rate, and if the evaluation result satisfies the target accuracy rate (step S9; Yes), selects the grammar (step S10).
- step S9 if the evaluation result does not satisfy the target accuracy rate (step S9; No), it is determined whether or not the maximum execution time has been reached.
- the grammar adjustment device 100 presents the grammars that have undergone speech recognition so far to the user and accepts the selection of the grammar (step S10).
- step S9 if the target time for grammar adjustment has not been reached (step S11; No), the process proceeds to step S4, and the processes from step S4 to step S9 are repeated.
- the adjustment of the grammar is finished when the condition of the target accuracy rate is satisfied, but the adjustment may be continued until the maximum execution time is reached without finishing the adjustment.
- the grammar adjustment device 100 of the present disclosure is a device that supports the creation of voice command grammar, extracts a part of the comprehensively created grammar, reconstructs the grammar, and selects a grammar with a high accuracy rate.
- the grammar accuracy rate is calculated for each type of voice command, so it is possible to adjust the grammar so that the accuracy is suitable for the situation where voice recognition is used.
- Grammar evaluation data is recorded at the site where voice commands are used, so it is possible to construct a grammar suitable for recognizing voice data containing noise peculiar to the site or time period.
- recognition candidates are selected from the words and syntax registered in the grammar even if noise is included, so the accuracy rate is improved.
- the grammar adjustment device 100 of the present disclosure automatically adjusts grammar, it can optimize the grammar based on objective criteria without depending on the subjectivity or know-how of the grammar creator. In addition, since the grammar is automatically adjusted, even an inexperienced technician can adjust the grammar.
- the hardware configuration of the grammar adjustment device 100 will be described with reference to FIG.
- the CPU 111 included in the grammar adjustment device 100 is a processor that controls the grammar adjustment device 100 as a whole.
- the CPU 111 reads the system program processed in the ROM 112 via the bus and controls the entire grammar adjustment apparatus 100 according to the system program.
- the RAM 113 temporarily stores calculation data, display data, various data input by the user via the input unit 71, and the like.
- the display unit 70 is a monitor or the like attached to the grammar adjustment device 100 .
- the display unit 70 displays an operation screen, a setting screen, and the like of the grammar adjustment device 100 .
- the input unit 71 is integrated with the display unit 70 or is a keyboard, touch panel, operation button, etc. separate from the display unit 70 .
- the user operates the input unit 71 to perform input to the screen displayed on the display unit 70 .
- the display unit 70 and the input unit 71 may be mobile terminals.
- the non-volatile memory 114 is, for example, a memory that is backed up by a battery (not shown) so that the stored state is retained even when the power of the grammar adjustment apparatus 100 is turned off.
- the non-volatile memory 114 stores machining programs, system programs, available options, billing tables, and the like.
- the nonvolatile memory 114 stores a program read from an external device via an interface (not shown), a program input via the input unit 71, and various data obtained from each part of the grammar adjustment apparatus 100, a machine tool, etc. (for example, setting parameters obtained from the machine tool, etc.). Programs and various data stored in the non-volatile memory 114 may be developed in the RAM 113 at the time of execution/use.
- Various system programs are pre-written in the ROM 112 .
- grammar adjustment device 11 evaluation data storage unit 12 target registration unit 13 basic grammar storage unit 14 grammar extraction unit 15 speech recognition unit 16 extracted grammar storage unit 17 evaluation value calculation unit 18 grammar selection unit 19 syntax storage unit 20 word storage unit 21 syntax storage unit 22 word storage unit 70 display unit 71 input unit 111 CPU 112 ROMs 113 RAM 114 non-volatile memory
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/002282 WO2023139769A1 (ja) | 2022-01-21 | 2022-01-21 | 文法調整装置、及びコンピュータが読み取り可能な記憶媒体 |
JP2023575014A JPWO2023139769A1 (enrdf_load_stackoverflow) | 2022-01-21 | 2022-01-21 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/002282 WO2023139769A1 (ja) | 2022-01-21 | 2022-01-21 | 文法調整装置、及びコンピュータが読み取り可能な記憶媒体 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023139769A1 true WO2023139769A1 (ja) | 2023-07-27 |
Family
ID=87348531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/002282 WO2023139769A1 (ja) | 2022-01-21 | 2022-01-21 | 文法調整装置、及びコンピュータが読み取り可能な記憶媒体 |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2023139769A1 (enrdf_load_stackoverflow) |
WO (1) | WO2023139769A1 (enrdf_load_stackoverflow) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS602998A (ja) * | 1983-06-20 | 1985-01-09 | 富士通株式会社 | 音声認識装置 |
JPH0250197A (ja) * | 1988-05-06 | 1990-02-20 | Ricoh Co Ltd | 辞書パターン作成装置 |
JP2009217006A (ja) * | 2008-03-11 | 2009-09-24 | Nippon Hoso Kyokai <Nhk> | 辞書修正装置、システム、およびコンピュータプログラム |
JP2009229529A (ja) * | 2008-03-19 | 2009-10-08 | Toshiba Corp | 音声認識装置及び音声認識方法 |
JP2014191246A (ja) * | 2013-03-28 | 2014-10-06 | Nec Corp | 認識処理制御装置、認識処理制御方法および認識処理制御プログラム |
-
2022
- 2022-01-21 JP JP2023575014A patent/JPWO2023139769A1/ja active Pending
- 2022-01-21 WO PCT/JP2022/002282 patent/WO2023139769A1/ja active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS602998A (ja) * | 1983-06-20 | 1985-01-09 | 富士通株式会社 | 音声認識装置 |
JPH0250197A (ja) * | 1988-05-06 | 1990-02-20 | Ricoh Co Ltd | 辞書パターン作成装置 |
JP2009217006A (ja) * | 2008-03-11 | 2009-09-24 | Nippon Hoso Kyokai <Nhk> | 辞書修正装置、システム、およびコンピュータプログラム |
JP2009229529A (ja) * | 2008-03-19 | 2009-10-08 | Toshiba Corp | 音声認識装置及び音声認識方法 |
JP2014191246A (ja) * | 2013-03-28 | 2014-10-06 | Nec Corp | 認識処理制御装置、認識処理制御方法および認識処理制御プログラム |
Also Published As
Publication number | Publication date |
---|---|
JPWO2023139769A1 (enrdf_load_stackoverflow) | 2023-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4657736B2 (ja) | ユーザ訂正を用いた自動音声認識学習のためのシステムおよび方法 | |
EP2309489B1 (en) | Methods and systems for considering information about an expected response when performing speech recognition | |
US10068566B2 (en) | Method and system for considering information about an expected response when performing speech recognition | |
US9275637B1 (en) | Wake word evaluation | |
US6952665B1 (en) | Translating apparatus and method, and recording medium used therewith | |
EP1346343B1 (en) | Speech recognition using word-in-phrase command | |
US6910012B2 (en) | Method and system for speech recognition using phonetically similar word alternatives | |
TWI543150B (zh) | 用於提供聲音串流擴充筆記摘錄之方法、電腦可讀取儲存裝置及系統 | |
US6334102B1 (en) | Method of adding vocabulary to a speech recognition system | |
US6934682B2 (en) | Processing speech recognition errors in an embedded speech recognition system | |
US7412387B2 (en) | Automatic improvement of spoken language | |
US20130204621A1 (en) | Speaker adaptation of vocabulary for speech recognition | |
EP2645364B1 (en) | Spoken dialog system using prominence | |
US7966177B2 (en) | Method and device for recognising a phonetic sound sequence or character sequence | |
JP2012238017A (ja) | 置換コマンドを有する音声認識方法 | |
JP2001282282A (ja) | 音声情報処理方法および装置および記憶媒体 | |
JP2010282199A (ja) | 語彙獲得装置、マルチ対話行動システム及び語彙獲得プログラム | |
CN111843986B (zh) | 机器人示教装置 | |
US6963834B2 (en) | Method of speech recognition using empirically determined word candidates | |
Kipyatkova et al. | A study of neural network Russian language models for automatic continuous speech recognition systems | |
WO2006097975A1 (ja) | 音声認識プログラム | |
JP2019101065A (ja) | 音声対話装置、音声対話方法及びプログラム | |
JP5544575B2 (ja) | 音声言語評価装置、方法、及びプログラム | |
WO2023139769A1 (ja) | 文法調整装置、及びコンピュータが読み取り可能な記憶媒体 | |
US20060136195A1 (en) | Text grouping for disambiguation in a speech application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22921926 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023575014 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22921926 Country of ref document: EP Kind code of ref document: A1 |