WO2023139770A1 - 文法作成支援装置、及びコンピュータが読み取り可能な記憶媒体 - Google Patents
文法作成支援装置、及びコンピュータが読み取り可能な記憶媒体 Download PDFInfo
- Publication number
- WO2023139770A1 WO2023139770A1 PCT/JP2022/002285 JP2022002285W WO2023139770A1 WO 2023139770 A1 WO2023139770 A1 WO 2023139770A1 JP 2022002285 W JP2022002285 W JP 2022002285W WO 2023139770 A1 WO2023139770 A1 WO 2023139770A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- grammar
- evaluation
- data
- support device
- recognition
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
Definitions
- the present invention relates to a speech recognition grammar creation support device and a computer-readable storage medium.
- the operation part of the device has many buttons and operation screens, but the operation is complicated and it may take time to master.
- a voice input interface allows users to perform desired operations simply by uttering voice commands. Therefore, attempts have been made to improve operability using a voice input interface.
- the voice commands used to operate the device can be assumed depending on the type of device that uses the voice command, the site where the device is installed, and the operation details of the device. Therefore, expected voice commands can be created in grammar (syntax and words). For example, see Patent Document 1.
- Evaluation data is used to evaluate whether the accuracy of the created grammar is high.
- the creator of the speech recognition system checks the accuracy of speech recognition when using the created grammar, and edits the grammar. Grammar for speech recognition is often described in text.
- a grammar creation support device includes: a grammar storage unit that stores grammars of voice commands for operating industrial equipment; a speech recognition unit that performs speech recognition based on the grammars; an evaluation data storage unit that stores evaluation data including speech data for grammar evaluation and correct data for the evaluation speech data; , provided.
- a storage medium which is one aspect of the present disclosure, stores a grammar of a voice command for operating an industrial device, performs speech recognition of speech data for evaluation of the grammar based on the grammar by being executed by one or more processors, creates a summary of the recognition result based on the recognition result of the speech recognition and correct data of the speech data for evaluation, presents the summary of the recognition result and the grammar in association with each other, and stores processor-readable instructions for accepting processing of the grammar.
- grammar creation for speech recognition can be supported.
- FIG. 1 is a block diagram showing the configuration of a grammar preparation support device
- FIG. FIG. 10 is a diagram showing examples of syntax definitions and word definitions; It is a figure which shows the combination example of the speaker of the data for evaluation, and a recording place. It is a figure which shows the example of an evaluation result display screen. It is a figure which shows the example of a log
- FIG. 10 is a diagram showing an example of a grammar image display; It is a figure which shows the processing example of a grammar. 4 is a flowchart for explaining processing of the grammar creation support device; 2 shows the hardware configuration of the grammar preparation support device.
- the grammar creation support device 100 will be described below.
- the grammar creation support device 100 is implemented in an information processing device having a calculation unit and a storage unit. Examples of such information processing devices include PCs (personal computers) and mobile terminals, but are not limited to these.
- Fig. 1 shows the basic configuration of the grammar creation support device 100.
- the grammar preparation support device 100 comprises an evaluation data storage unit 11 , a target performance registration unit 12 , a speech recognition unit 13 , a grammar storage unit 14 , a recognition result evaluation unit 15 , a grammar processing unit 16 and an evaluation history storage unit 17 .
- the voice recognition unit 13 inputs voice data and outputs the recognition result of the input voice data in text format.
- the speech recognition unit 13 is generally composed of an acoustic model, a language model and a decoder.
- the acoustic model receives speech data and outputs phonemes (senones) that form the speech data based on the feature amount of the speech data.
- the language model outputs the probability of occurrence of word strings.
- the language model selects hypothetical word strings based on phonemes and outputs linguistically plausible candidates.
- the decoder outputs a word string with a high probability as a recognition result based on the outputs of the acoustic model and language model that are statistically created.
- the grammar storage unit 14 stores the grammar of voice commands.
- Voice commands are voice commands for operating equipment in the industrial field.
- a speech recognition unit 13 selects a speech command defined in the grammar.
- the grammar of voice commands consists of syntax and words.
- the grammar storage unit 14 includes a syntax storage unit 18 that stores syntax and a word storage unit 19 that stores words. Words include words to be recognized by speech recognition and phoneme representations of the words. Syntax defines the words that make up a voice command and the order of the words.
- the base grammar is exhaustively created to cover as many voice commands as possible that are expected to be used in the field.
- the grammar generation support device 100 supports the generation of an appropriate grammar by processing the base grammar based on the recognition result of the evaluation data.
- the basic grammar is determined by the type of device that recognizes voice commands, the type of work, and so on.
- FIG. 2 shows an example syntax definition and an example word definition.
- An example syntax definition defines the words that make up a voice command and the order of the words.
- “S” is the start symbol of the voice command
- "NS_B” and “NS_E” are silent sections at the beginning and end of the sentence.
- the second and third lines define "tags" that go into “COMMAND”.
- the second line defines that the syntax element "COMMAND” includes tags "ROBOT” and "INTERFACE”
- the third line defines that the syntax element "COMMAND” includes tags "NAIGAI” and "INTERFACE”.
- the first and second lines of the word definition define the Japanese notation and phoneme notation of the tag "ROBOT".
- the Japanese notation of the tag "ROBOT” is "robot” and the phoneme notation is "roboqto”.
- the 3rd to 5th lines of the word definition define the Japanese notation and the phoneme notation of the Japanese included in the tag "NAIGAI”.
- the tag "NAIGAI” contains two Japanese words, "external” and "internal.”
- the "outside” phoneme is "gaibu” and the "inside” phoneme is "naibu”.
- the 6th to 8th lines of the word definition define the Japanese notation and the phoneme notation of the Japanese included in the tag "INTERFACE".
- the tag "INTERFACE” contains one Japanese word "interface”.
- the “interface” has two types of phoneme notation “iNtafe:su” and “iNta:feisu”. "%NS_B” defines a silence section [s] at the beginning of a sentence, and “%NS_E” defines a silence section [/s] at the end of a sentence.
- the evaluation data storage unit 11 associates and stores voice data containing voice commands recorded by a plurality of speakers at a plurality of recording locations with correct data, which is a correct text for the voice data. For example, voice data of utterances of "external interface" by a plurality of speakers at a plurality of recording locations and correct data (text) of "external interface” are stored in association with each other.
- the evaluation data includes voice data recorded at different recording locations by speakers with different attributes (gender, age).
- FIG. 3 is a table showing the relationship between the speaker of the evaluation data and the recording location.
- the evaluation data in FIG. 3 includes voices recorded by speaker A (male, 60 years old) at factories A and B, voices recorded by speaker B (female, 30 years old) at factories C and D, and the like.
- the target performance registration unit 12 accepts registration of target performance for speech recognition.
- the target performance registration unit 12 receives target values such as the accuracy rate of voice commands, the accuracy rate for each type of voice command, and the processing time (average value) of voice recognition.
- the registered contents of the target performance are reflected on the evaluation result display screen described later.
- the recognition result evaluation unit 15 compares the correct text stored in the evaluation data storage unit with the speech data recognition result, creates a summary of the grammar evaluation result, and displays the created summary on the display unit.
- FIG. 4 is an example of a recognition result display screen. In the example of FIG. 4, the evaluation of the entire voice command and the evaluation of each type of voice command are displayed.
- Types of voice commands include, for example, approval commands, numerical commands, and transition commands.
- An approval command is a command indicating approval. Assume that the approval commands include "yes”, “no”, “yes”, “no”, “execute”, “abort", and the like.
- Numerical commands are commands for designating numerical values such as "0.5", "1", "2", and "100".
- a “transition command” is a command for designating a display screen such as a "home screen” or a "speed setting screen”.
- a “machine operation command” such as "set a workpiece” may be considered.
- the processing time of speech recognition may be displayed on the recognition result display screen.
- the target performance registered by the target performance registration unit may be displayed.
- the recognition result evaluation unit 15 may display a history of recognition results.
- FIG. 5 shows a history display screen. On the history display screen, past voice recognition data can be selected.
- the identification number of the evaluation result and the execution time of speech recognition are displayed. Selecting a time or identification number displays the selected speech recognition rating and the grammar used for speech recognition. Note that the history display screen is not limited to the arrangement shown in FIG.
- the grammar processing unit 16 accepts grammar processing (editing).
- the creator of the grammar can process (edit) the grammar while confirming the evaluation result of speech recognition and the grammar corresponding to the evaluation result.
- the grammar may be displayed as text or as an image.
- the acoustic distance of the voice command is calculated, and the words and word paths are connected by links.
- the acoustic distance may be calculated from the speech data or the correct answer data of the evaluation data, or may be calculated from the phoneme notation of the grammar.
- FIG. 6 shows an image display example of the grammar.
- FIG. 6 is an image display example of the syntax definitions and word definitions of FIG.
- the words defined by 'ROBOT', 'INTERFACE', and 'NAIGAI' and 'INTERFACE' are included in the grammatical element 'COMMAND'.
- the grammar processor 16 finds the acoustic distances of these words.
- naibu and “gaibu”, and “iNtafe:su” and “iNta:feisu” are acoustically close, so they are displayed at close positions.
- "roboqto” is displayed in a distant position because it is acoustically distant from any other word.
- the grammar processing unit 16 arranges words that can be included in the syntax on the screen and connects the paths between the words with links. For example, in the example of FIG. 6, the words in "ROBOT” and the words in "INTERFACE”, and the words in "NAIGAI” and the words in "INTERFACE” are connected by links.
- a well-known network visualization method is used for arranging words.
- a spring model is exemplified as one of network visualization methods.
- the spring model of the present disclosure treats words as nodes and calculates the acoustic distance between any two nodes. Consider the acoustic distance between two nodes as the length of the spring and place the space between the two nodes. After arranging the words in the graph, using the syntax, Connect words with links.
- the matching portion of the phonemes includes the phoneme “aib” included in “naibu” and "gaibu”.
- An example of the part where the distance between phonemes is close is the phoneme “afe:” included in “iNta:feisu” and the phoneme “:fei” included in "iNta:feisu”.
- bold type is used to highlight these.
- the high appearance rate, the high matching rate, and the like may be expressed by the size of characters.
- FIG. 7 is a modified example of the grammar of FIG.
- the "naibu” link is removed.
- the grammar creator can unlink “naibu” if there is a misrecognition of "naibu” and “gaibu” and there is no problem even if "naibu” is not used according to the specifications. If the specification requires the word “naibu”, "naibu” can be left manually.
- words and syntax that cannot be removed from the specifications can be left at the creator's discretion.
- Grammar processing and recognition result evaluation are repeated.
- the creator of the grammar can confirm the evaluation of the recognition results (for example, accuracy rate) in relation to the processing of the grammar, and can customize the grammar by processing the grammar within a range that complies with the specifications.
- the evaluation history storage unit 17 stores recognition results and grammars in association with each other.
- a grammar stored in the evaluation history storage unit 17 is selected, the evaluation result display screen shown in FIG. 4 is displayed.
- the creator of the grammar processes the grammar while referring to summary information such as the accuracy rate of speech recognition.
- summary information such as the accuracy rate of speech recognition.
- approval commands such as "yes” and "no” are used for final confirmation, so a high accuracy rate is required.
- Numerical commands that specify numerical values also require a high accuracy rate.
- a transition command specifying a screen transition may have a lower accuracy rate than an approval command or a numerical command.
- Grammar authors can register such performance targets and refine the grammars while considering the needs of each site.
- the grammar creation support apparatus 100 receives registration of the target performance of speech recognition (step S1) and registration of the number of saved evaluation histories of speech recognition (step S2).
- the grammar creation device acquires grammar evaluation data (step S3).
- the creator of the grammar creates the base grammar based on the site specifications.
- the base grammar is created as comprehensively as possible according to the requests of the device users.
- the grammar creation support device 100 stores the base grammar (step S4).
- the grammar creation support device 100 performs speech recognition of the evaluation data using the registered grammar (step S5).
- the grammar creation support device 100 summarizes the recognition result of step S5 and presents it to the creator (step S6).
- step S7 the author confirms the recognition result and determines that the grammar has been completed (step S7; YES)
- the author finishes creating the grammar.
- step S7 When the creator checks the recognition result and determines that the grammar needs to be corrected (step S7; NO), the previously created grammar and a summary of the recognition result are stored in the recognition result storage unit, and the grammar is processed (step S8).
- the grammar creation support device 100 registers the processed grammar in step S8, proceeds to step S5, and performs speech recognition using the registered grammar.
- the grammar creator compares the previously created grammar with the newly created grammar.
- Grammar creation support device 100 repeats the processing from step S5 to step S8 until the creator determines that the grammar is complete.
- the grammar preparation support device 100 of the present disclosure is a device that supports the preparation of voice command grammar, performs speech recognition of evaluation data using the prepared grammar, summarizes the recognition results of the evaluation data, and presents the summary results to the grammar creator.
- the recognition result of the evaluation data is calculated for all voice commands and for each type of voice command.
- the target performance differs for each type of voice command.
- a grammar creator can manipulate the grammar to achieve the target performance for each type of voice command.
- the grammar can be displayed as text or as an image.
- the acoustic distance of words is used to connect words (nodes) with links according to the syntax. Because acoustic distance is used to place words, grammatical structure can be visually determined.
- the acoustic distance may be calculated from the speech data of the evaluation data, or may be calculated from the phonemes expressed in text.
- a method of calculating the acoustic distance from the speech data includes the inter-distribution distance. Cosine distance, Levenshtein distance, Jaro-Winkler distance, Hamming distance, etc. are available as methods of calculating acoustic distance from phonemes expressed in text. The method of calculating the acoustic distance is not limited. Cosine distance, Euclidean distance, Levenshtein distance, Jaro-Winkler distance, and Hamming distance are known.
- Industrial equipment is installed in noisy sites such as factories. Noise is characteristic for each site or time period.
- evaluation data is acquired at the site where the equipment is installed, and evaluation is performed in consideration of noise unique to the site.
- the hardware configuration of the grammar creation support device 100 will be described with reference to FIG.
- the CPU 111 included in the grammar creation support device 100 is a processor that controls the grammar creation support device 100 as a whole.
- the CPU 111 reads the system program processed in the ROM 112 via the bus, and controls the entire grammar creation support apparatus 100 according to the system program.
- the RAM 113 temporarily stores calculation data, display data, various data input by the user via the input unit 71, and the like.
- the display unit 70 is a monitor or the like attached to the grammar creation support device 100 .
- the display unit 70 displays an operation screen, a setting screen, and the like of the grammar creation support device 100 .
- the input unit 71 is integrated with the display unit 70 or is a keyboard, touch panel, operation button, etc. separate from the display unit 70 .
- the user operates the input unit 71 to perform input to the screen displayed on the display unit 70 .
- the display unit 70 and the input unit 71 may be mobile terminals.
- the non-volatile memory 114 is, for example, a memory that is backed up by a battery (not shown) so that the memory state is maintained even when the power of the grammar creation support device 100 is turned off.
- the non-volatile memory 114 stores machining programs, system programs, available options, billing tables, and the like.
- the nonvolatile memory 114 stores a program read from an external device via an interface (not shown), a program input via the input unit 71, and various data obtained from each unit of the grammar creation support device 100, a machine tool, etc. (for example, setting parameters obtained from the machine tool, etc.). Programs and various data stored in the non-volatile memory 114 may be developed in the RAM 113 at the time of execution/use.
- Various system programs are pre-written in the ROM 112 .
- grammar creation support device 11 evaluation data storage unit 12 target performance registration unit 13 speech recognition unit 14 grammar storage unit 15 recognition result evaluation unit 16 grammar processing unit 17 evaluation history storage unit 18 syntax storage unit 19 word storage unit 70 display unit 71 input unit 111 CPU 112 ROMs 113 RAM 114 non-volatile memory
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023575015A JPWO2023139770A1 (enrdf_load_stackoverflow) | 2022-01-21 | 2022-01-21 | |
PCT/JP2022/002285 WO2023139770A1 (ja) | 2022-01-21 | 2022-01-21 | 文法作成支援装置、及びコンピュータが読み取り可能な記憶媒体 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/002285 WO2023139770A1 (ja) | 2022-01-21 | 2022-01-21 | 文法作成支援装置、及びコンピュータが読み取り可能な記憶媒体 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023139770A1 true WO2023139770A1 (ja) | 2023-07-27 |
Family
ID=87348529
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/002285 WO2023139770A1 (ja) | 2022-01-21 | 2022-01-21 | 文法作成支援装置、及びコンピュータが読み取り可能な記憶媒体 |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2023139770A1 (enrdf_load_stackoverflow) |
WO (1) | WO2023139770A1 (enrdf_load_stackoverflow) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004151547A (ja) * | 2002-10-31 | 2004-05-27 | Toshiba Corp | 認識文法モデル作成方法及び検査方法 |
JP2009229529A (ja) * | 2008-03-19 | 2009-10-08 | Toshiba Corp | 音声認識装置及び音声認識方法 |
JP2018040906A (ja) * | 2016-09-06 | 2018-03-15 | 株式会社東芝 | 辞書更新装置およびプログラム |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5161174B2 (ja) * | 2009-08-28 | 2013-03-13 | 日本電信電話株式会社 | 経路探索装置、音声認識装置、これらの方法及びプログラム |
-
2022
- 2022-01-21 JP JP2023575015A patent/JPWO2023139770A1/ja active Pending
- 2022-01-21 WO PCT/JP2022/002285 patent/WO2023139770A1/ja active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004151547A (ja) * | 2002-10-31 | 2004-05-27 | Toshiba Corp | 認識文法モデル作成方法及び検査方法 |
JP2009229529A (ja) * | 2008-03-19 | 2009-10-08 | Toshiba Corp | 音声認識装置及び音声認識方法 |
JP2018040906A (ja) * | 2016-09-06 | 2018-03-15 | 株式会社東芝 | 辞書更新装置およびプログラム |
Also Published As
Publication number | Publication date |
---|---|
JPWO2023139770A1 (enrdf_load_stackoverflow) | 2023-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3662780B2 (ja) | 自然言語を用いた対話システム | |
US7389235B2 (en) | Method and system for unified speech and graphic user interfaces | |
US6952665B1 (en) | Translating apparatus and method, and recording medium used therewith | |
US20020123894A1 (en) | Processing speech recognition errors in an embedded speech recognition system | |
US6801897B2 (en) | Method of providing concise forms of natural commands | |
JP2012238017A (ja) | 置換コマンドを有する音声認識方法 | |
EP1405169B1 (en) | Information processing apparatus and method, and program product | |
US20020123893A1 (en) | Processing speech recognition errors in an embedded speech recognition system | |
JP2007264471A (ja) | 音声認識装置および音声認識方法 | |
JP2005321730A (ja) | 対話システム、対話システム実行方法、及びコンピュータプログラム | |
CN110428813A (zh) | 一种语音理解的方法、装置、电子设备及介质 | |
CA2297414A1 (en) | Method and system for distinguishing between text insertion and replacement | |
WO2011033834A1 (ja) | 音声翻訳システム、音声翻訳方法および記録媒体 | |
JPWO2006097975A1 (ja) | 音声認識プログラム | |
WO2023139770A1 (ja) | 文法作成支援装置、及びコンピュータが読み取り可能な記憶媒体 | |
JP3762191B2 (ja) | 情報入力方法、情報入力装置及び記憶媒体 | |
WO2023139769A1 (ja) | 文法調整装置、及びコンピュータが読み取り可能な記憶媒体 | |
JP4537755B2 (ja) | 音声対話システム | |
JP2023017563A (ja) | 会話スクリプト生成装置および会話スクリプト生成方法 | |
JPS60173595A (ja) | 会話応答装置 | |
JP6452826B2 (ja) | ファクトリーオートメーションシステムおよびリモートサーバ | |
JP2002156996A (ja) | 音声認識装置、認識結果修正方法及び記録媒体 | |
JP2010002830A (ja) | 音声認識装置 | |
JP4012228B2 (ja) | 情報入力方法、情報入力装置及び記憶媒体 | |
CN111867789A (zh) | 机器人的示教装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22921927 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023575015 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22921927 Country of ref document: EP Kind code of ref document: A1 |