WO2023139769A1 - 文法調整装置、及びコンピュータが読み取り可能な記憶媒体 - Google Patents

文法調整装置、及びコンピュータが読み取り可能な記憶媒体 Download PDF

Info

Publication number
WO2023139769A1
WO2023139769A1 PCT/JP2022/002282 JP2022002282W WO2023139769A1 WO 2023139769 A1 WO2023139769 A1 WO 2023139769A1 JP 2022002282 W JP2022002282 W JP 2022002282W WO 2023139769 A1 WO2023139769 A1 WO 2023139769A1
Authority
WO
WIPO (PCT)
Prior art keywords
grammar
extracted
speech recognition
unit
grammars
Prior art date
Application number
PCT/JP2022/002282
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
泰弘 芝▲崎▼
Original Assignee
ファナック株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ファナック株式会社 filed Critical ファナック株式会社
Priority to PCT/JP2022/002282 priority Critical patent/WO2023139769A1/ja
Priority to JP2023575014A priority patent/JPWO2023139769A1/ja
Publication of WO2023139769A1 publication Critical patent/WO2023139769A1/ja

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice

Definitions

  • the present invention relates to a speech recognition grammar adjustment device and a computer-readable storage medium.
  • the operation part of the device has many buttons and operation screens, but the operation is complicated and it takes time to master.
  • a voice input interface allows users to perform desired operations simply by uttering voice commands. Therefore, attempts have been made to improve operability using a voice input interface.
  • the voice commands used to operate the device can be assumed depending on the type of device that uses the voice command, the site where the device is installed, and the operation details of the device. Therefore, expected voice commands can be created in grammar (syntax and words). For example, see Patent Document 1.
  • Evaluation data is used to evaluate whether the accuracy of the created grammar is high.
  • the creator of the speech recognition system checks the accuracy of speech recognition when using the created grammar, and edits the grammar. Speech recognition grammars are often created manually.
  • a grammar adjustment device that is an aspect of the present disclosure includes a grammar storage unit that stores a grammar of a voice command for operating an industrial device, a grammar extraction unit that extracts a part of the grammar, a target registration unit that receives registration of a target for the speech recognition evaluation value of the extracted grammar, a speech recognition unit that performs speech recognition of evaluation speech data using the extracted grammar, and an evaluation value calculation unit that calculates the speech recognition evaluation value of the extracted grammar based on the results of speech recognition using the extracted grammar and correct data of the evaluation speech data.
  • a grammar selection unit that selects a grammar that satisfies a target from among the one or more extracted grammars extracted by the grammar extraction unit.
  • a storage medium which is one aspect of the present disclosure, stores a grammar of a voice command for operating an industrial device, extracts a part of the grammar by being executed by one or more processors, receives a target registration of a speech recognition evaluation value of the extracted grammar, performs speech recognition of evaluation speech data using the extracted grammar, calculates a speech recognition evaluation value of the extracted grammar based on the result of speech recognition using the extracted grammar and correct data of the evaluation speech data, and calculates the speech recognition evaluation value of the extracted grammar among one or more extracted grammars. , stores processor readable instructions for selecting a grammar that satisfies a goal.
  • grammar creation for speech recognition can be supported.
  • FIG. 1 is a block diagram showing the configuration of a grammar adjustment device
  • FIG. FIG. 10 is a diagram showing examples of syntax definitions and word definitions
  • It is a figure which shows the combination example of the speaker of the data for evaluation, and a recording place.
  • It is a figure which shows the example of the calculation result of an evaluation value.
  • It is a figure which shows the example of a goal registration screen.
  • It is a figure which shows the example of the accuracy rate of a different grammar.
  • 4 is a flowchart for explaining processing of the grammar adjustment device; It is a hardware configuration of a grammar adjustment device.
  • the grammar adjustment device 100 will be described below.
  • the grammar adjustment device 100 is implemented in an information processing device having an arithmetic unit and a storage unit. Examples of such information processing devices include PCs (personal computers) and mobile terminals, but are not limited to these.
  • Fig. 1 shows the basic configuration of the grammar adjustment device 100.
  • the grammar adjustment device 100 comprises an evaluation data storage unit 11 , a target registration unit 12 , a basic grammar storage unit 13 , a grammar extraction unit 14 , a speech recognition unit 15 , an extracted grammar storage unit 16 , an evaluation value calculation unit 17 and a grammar selection unit 18 .
  • the basic grammar storage unit 13 stores grammars of voice commands that serve as bases.
  • a voice command is a command for operating equipment in the industrial field by voice.
  • the grammar of voice commands consists of syntax and words.
  • the basic grammar storage unit 13 includes a syntax storage unit 19 that stores syntax and a word storage unit 20 that stores words. Words include words that make up voice commands and phoneme representations of words. Syntax defines the arrangement of words that make up a voice command.
  • the base grammar is exhaustively created to cover as many voice commands as possible that are expected to be used in the field. For example, the syntax of a voice command for setting the "override" of a numerical controller to "30" is assumed to be “override 30", “set override to 30", “set override to 30", and so on.
  • a grammar author constructs as many grammars as possible.
  • the basic grammar is determined by the type of device that recognizes voice commands, specifications, work content, and so on.
  • a plurality of phoneme arrays may be assigned to one word.
  • the word “override” can be represented by multiple phonemes such as "o:ba:raido", “oubaaraido” and "oubaraido”.
  • the base grammar is constructed to cover as many phonemes of such words as possible.
  • FIG. 2 shows an example syntax definition and an example word definition.
  • An example syntax definition defines the words that make up a voice command and the order of the words.
  • “S” is the start symbol of the voice command
  • "NS_B” and “NS_E” are silent sections at the beginning and end of the sentence.
  • the second and third lines define "tags" that go into "COMMAND”.
  • the second line defines that the syntax element "COMMAND” includes tags "ROBOT” and "INTERFACE”
  • the third line defines that the syntax element "COMMAND” includes tags "NAIGAI” and "INTERFACE”.
  • the first and second lines of the word definition define the Japanese notation and phoneme notation of the tag "ROBOT".
  • the Japanese notation of the tag "ROBOT” is "robot” and the phoneme notation is "roboqto”.
  • the 3rd to 5th lines of the word definition define the Japanese notation and the phoneme notation of the Japanese included in the tag "NAIGAI”.
  • the tag "NAIGAI” contains two Japanese words, "external” and "internal.”
  • the "outside” phoneme is "gaibu” and the "inside” phoneme is "naibu”.
  • the 6th to 8th lines of the word definition define the Japanese notation and the phoneme notation of the Japanese included in the tag "INTERFACE".
  • the tag "INTERFACE” contains one Japanese word "interface”.
  • the “interface” has two types of phoneme notation “iNtafe:su” and “iNta:feisu”. "%NS_B” defines a silence section [s] at the beginning of a sentence, and “%NS_E” defines a silence section [/s] at the end of a sentence.
  • the grammar extraction unit 14 extracts a part of the grammar from the exhaustive grammar stored in the basic grammar storage unit 13 .
  • cluster division of the k-means method is used.
  • a method other than the k-means method may be used for grammar extraction.
  • the clustering of the k-means method uses the acoustic distance of the grammar. Acoustic distance can be obtained from an acoustic spectrum, from a phoneme string, or the like. According to the method of calculating the acoustic distance from the acoustic spectrum, the acoustic spectrum of the voice command is vectorized and the cosine distance or Euclidean distance between the vectors is calculated.
  • Cosine distance, Levenshtein distance, Jarrowinkler distance, and Hamming distance are used in the method of calculating acoustic distance from a phoneme string. Cosine distance, Euclidean distance, Leberstein distance, Jarrowinkler distance and Hamming distance are well known.
  • the acoustic distance of the phonemes of words included in the voice command e.g., "iNtafe:su", “iNtafeisu”, etc.
  • a part can be extracted from a cluster of words with close acoustic distances.
  • the k-means method uses random numbers to set the centers of K clusters, (a) assigning the nearest center to each voice command (or word), and (b) calculating the center for each cluster. Repeat (a) and (b) until the centers of all clusters do not change to divide the voice commands into clusters.
  • the grammar extraction unit 14 extracts grammars (syntax and words) of voice commands included in the same cluster, and outputs them to the voice recognition unit as grammars for evaluation.
  • the k-means method is an example of a method for extracting grammars with close distances, and methods other than the k-means method may be used.
  • the results of the k-means method are affected by the random number of initial values and the number of clusters K.
  • FIG. The random number and the number of clusters K may be set manually by the user, or may be automatically set by the grammar extraction unit 14 .
  • the evaluation data storage unit 11 associates and stores voice data including voice commands recorded by a plurality of speakers at a plurality of recording locations with correct data, which is a correct text for the voice data. For example, voice data of utterances of "external interface" by a plurality of speakers at a plurality of recording locations and correct data (text) of "external interface” are stored in association with each other.
  • the speech data in the evaluation data storage unit 11 are recorded at different recording locations by speakers with different attributes (gender, age). Since the evaluation data was recorded at the site using the voice command, the noise at the site using the voice command is included.
  • FIG. 3 is a table showing the relationship between the speaker of the evaluation data and the recording location.
  • the evaluation data in FIG. 3 includes voices recorded by speaker A (male, 60 years old) at factories A and B, voices recorded by speaker B (female, 30 years old) at factories C and D, and the like.
  • the voice recognition unit 15 receives a voice command from the evaluation data storage unit 11 and performs voice recognition of the input voice command.
  • the speech recognition unit 15 is generally composed of an acoustic model, a language model, and a decoder.
  • the acoustic model receives speech data and outputs phonemes (senones) that form the speech data based on the feature amount of the speech data.
  • the language model outputs the probability of occurrence of word strings.
  • the language model selects hypothetical word strings based on phonemes and outputs linguistically plausible candidates.
  • the decoder outputs a word string with a high probability as a recognition result based on the outputs of the acoustic model and language model that are statistically created.
  • the extracted grammar storage unit 16 includes a syntax storage unit 21 and a word storage unit 22, and stores the grammars extracted by the grammar extraction unit 14.
  • the speech recognition section 15 performs speech recognition using the grammar stored in the extraction grammar storage section 16 .
  • the evaluation value calculation unit 17 compares the correct text from the evaluation data storage unit with the recognition result of the speech recognition unit 15, and calculates the accuracy rate of speech recognition.
  • FIG. 4 is an example of the accuracy rate as an evaluation value.
  • Types of voice commands include, for example, approval commands, numerical commands, and transition commands.
  • An approval command is a command indicating approval. Assume that the approval commands include “yes”, “no”, “yes”, “no”, “execute”, “abort”, and the like.
  • Numerical commands are commands for designating numerical values such as "0.5", "1", "2", and "100".
  • a “transition command” is a command for designating a display screen such as a "home screen” or a "speed setting screen”.
  • a "machine operation command” such as "set a workpiece” may be considered.
  • the target registration unit 12 accepts registration of target values for speech recognition.
  • the target registration unit 12 receives target values such as a target accuracy rate for all voice commands, a target accuracy rate for each type of voice command, and a target search time.
  • FIG. 5 is an example of a target registration screen.
  • the target accuracy rate for each type of voice command is set as "approval command: 95% or more”, “numerical command: 90% or more”, “transition command: 80% or more”, and maximum execution time "within 30 minutes”.
  • the grammar selection unit 18 compares the speech recognition result with the target accuracy rate, and if there is a grammar determined to satisfy the target accuracy rate as a result of the speech recognition, it selects that grammar as an appropriate grammar. The processing of the grammar selection unit 18 is repeated until the target time for grammar adjustment elapses or the target accuracy rate is cleared. When the target time for grammar adjustment has passed, an appropriate grammar is selected from the grammars for which speech recognition has been performed so far.
  • the grammar selection unit 18 may present the accuracy rate of each grammar to the creator of the grammar, and the creator of the grammar may select the grammar.
  • the accuracy rate of voice commands is calculated for each type of approval command, transition command, and numerical command.
  • the approval command is used for confirming the operation, so a high accuracy rate is required.
  • Numerical commands that specify numerical values also require a high accuracy rate.
  • a transition command that instructs a screen transition may have a lower accuracy rate than an approval command or a numerical command.
  • a target accuracy rate can be set for each voice command or for each type of voice command. Automatically select the grammar that achieves the target accuracy rate. For example, FIG. 6 shows the accuracy rate of grammar A and grammar B. In FIG.
  • grammar A Since the accuracy rate of "approval command”, “numerical value command”, and “transition command” of grammar A satisfies the target accuracy rate registered by the target registration unit 12, grammar A is selected as an appropriate grammar. Further conditions may be set when a plurality of grammars satisfy the target accuracy rate.
  • the grammar adjustment device 100 registers a target accuracy rate of a voice command (step S1, receives registration of the maximum execution time for grammar adjustment (step S2), and receives registration of a cluster division criterion (step S3).
  • step S1 receives registration of the maximum execution time for grammar adjustment
  • step S3 receives registration of a cluster division criterion.
  • the grammar adjustment device 100 clusters the voice commands (or words included in the voice commands) stored in the basic grammar storage unit 13 (step S4), extracts one or more representative voice commands (or words included in the voice commands) from each cluster (step S5), and reconstructs the grammar using the extracted voice commands (or words included in the voice commands) (step S6).
  • the grammar adjustment device 100 performs speech recognition on the evaluation data using the grammar reconstructed in step S5 (step S7).
  • Grammar adjustment device 100 calculates an evaluation value for speech recognition (step S8).
  • Grammar adjustment device 100 compares the evaluation result with the target accuracy rate, and if the evaluation result satisfies the target accuracy rate (step S9; Yes), selects the grammar (step S10).
  • step S9 if the evaluation result does not satisfy the target accuracy rate (step S9; No), it is determined whether or not the maximum execution time has been reached.
  • the grammar adjustment device 100 presents the grammars that have undergone speech recognition so far to the user and accepts the selection of the grammar (step S10).
  • step S9 if the target time for grammar adjustment has not been reached (step S11; No), the process proceeds to step S4, and the processes from step S4 to step S9 are repeated.
  • the adjustment of the grammar is finished when the condition of the target accuracy rate is satisfied, but the adjustment may be continued until the maximum execution time is reached without finishing the adjustment.
  • the grammar adjustment device 100 of the present disclosure is a device that supports the creation of voice command grammar, extracts a part of the comprehensively created grammar, reconstructs the grammar, and selects a grammar with a high accuracy rate.
  • the grammar accuracy rate is calculated for each type of voice command, so it is possible to adjust the grammar so that the accuracy is suitable for the situation where voice recognition is used.
  • Grammar evaluation data is recorded at the site where voice commands are used, so it is possible to construct a grammar suitable for recognizing voice data containing noise peculiar to the site or time period.
  • recognition candidates are selected from the words and syntax registered in the grammar even if noise is included, so the accuracy rate is improved.
  • the grammar adjustment device 100 of the present disclosure automatically adjusts grammar, it can optimize the grammar based on objective criteria without depending on the subjectivity or know-how of the grammar creator. In addition, since the grammar is automatically adjusted, even an inexperienced technician can adjust the grammar.
  • the hardware configuration of the grammar adjustment device 100 will be described with reference to FIG.
  • the CPU 111 included in the grammar adjustment device 100 is a processor that controls the grammar adjustment device 100 as a whole.
  • the CPU 111 reads the system program processed in the ROM 112 via the bus and controls the entire grammar adjustment apparatus 100 according to the system program.
  • the RAM 113 temporarily stores calculation data, display data, various data input by the user via the input unit 71, and the like.
  • the display unit 70 is a monitor or the like attached to the grammar adjustment device 100 .
  • the display unit 70 displays an operation screen, a setting screen, and the like of the grammar adjustment device 100 .
  • the input unit 71 is integrated with the display unit 70 or is a keyboard, touch panel, operation button, etc. separate from the display unit 70 .
  • the user operates the input unit 71 to perform input to the screen displayed on the display unit 70 .
  • the display unit 70 and the input unit 71 may be mobile terminals.
  • the non-volatile memory 114 is, for example, a memory that is backed up by a battery (not shown) so that the stored state is retained even when the power of the grammar adjustment apparatus 100 is turned off.
  • the non-volatile memory 114 stores machining programs, system programs, available options, billing tables, and the like.
  • the nonvolatile memory 114 stores a program read from an external device via an interface (not shown), a program input via the input unit 71, and various data obtained from each part of the grammar adjustment apparatus 100, a machine tool, etc. (for example, setting parameters obtained from the machine tool, etc.). Programs and various data stored in the non-volatile memory 114 may be developed in the RAM 113 at the time of execution/use.
  • Various system programs are pre-written in the ROM 112 .
  • grammar adjustment device 11 evaluation data storage unit 12 target registration unit 13 basic grammar storage unit 14 grammar extraction unit 15 speech recognition unit 16 extracted grammar storage unit 17 evaluation value calculation unit 18 grammar selection unit 19 syntax storage unit 20 word storage unit 21 syntax storage unit 22 word storage unit 70 display unit 71 input unit 111 CPU 112 ROMs 113 RAM 114 non-volatile memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
PCT/JP2022/002282 2022-01-21 2022-01-21 文法調整装置、及びコンピュータが読み取り可能な記憶媒体 WO2023139769A1 (ja)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2022/002282 WO2023139769A1 (ja) 2022-01-21 2022-01-21 文法調整装置、及びコンピュータが読み取り可能な記憶媒体
JP2023575014A JPWO2023139769A1 (enrdf_load_stackoverflow) 2022-01-21 2022-01-21

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/002282 WO2023139769A1 (ja) 2022-01-21 2022-01-21 文法調整装置、及びコンピュータが読み取り可能な記憶媒体

Publications (1)

Publication Number Publication Date
WO2023139769A1 true WO2023139769A1 (ja) 2023-07-27

Family

ID=87348531

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/002282 WO2023139769A1 (ja) 2022-01-21 2022-01-21 文法調整装置、及びコンピュータが読み取り可能な記憶媒体

Country Status (2)

Country Link
JP (1) JPWO2023139769A1 (enrdf_load_stackoverflow)
WO (1) WO2023139769A1 (enrdf_load_stackoverflow)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS602998A (ja) * 1983-06-20 1985-01-09 富士通株式会社 音声認識装置
JPH0250197A (ja) * 1988-05-06 1990-02-20 Ricoh Co Ltd 辞書パターン作成装置
JP2009217006A (ja) * 2008-03-11 2009-09-24 Nippon Hoso Kyokai <Nhk> 辞書修正装置、システム、およびコンピュータプログラム
JP2009229529A (ja) * 2008-03-19 2009-10-08 Toshiba Corp 音声認識装置及び音声認識方法
JP2014191246A (ja) * 2013-03-28 2014-10-06 Nec Corp 認識処理制御装置、認識処理制御方法および認識処理制御プログラム

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS602998A (ja) * 1983-06-20 1985-01-09 富士通株式会社 音声認識装置
JPH0250197A (ja) * 1988-05-06 1990-02-20 Ricoh Co Ltd 辞書パターン作成装置
JP2009217006A (ja) * 2008-03-11 2009-09-24 Nippon Hoso Kyokai <Nhk> 辞書修正装置、システム、およびコンピュータプログラム
JP2009229529A (ja) * 2008-03-19 2009-10-08 Toshiba Corp 音声認識装置及び音声認識方法
JP2014191246A (ja) * 2013-03-28 2014-10-06 Nec Corp 認識処理制御装置、認識処理制御方法および認識処理制御プログラム

Also Published As

Publication number Publication date
JPWO2023139769A1 (enrdf_load_stackoverflow) 2023-07-27

Similar Documents

Publication Publication Date Title
JP4657736B2 (ja) ユーザ訂正を用いた自動音声認識学習のためのシステムおよび方法
EP2309489B1 (en) Methods and systems for considering information about an expected response when performing speech recognition
US10068566B2 (en) Method and system for considering information about an expected response when performing speech recognition
US9275637B1 (en) Wake word evaluation
US6952665B1 (en) Translating apparatus and method, and recording medium used therewith
EP1346343B1 (en) Speech recognition using word-in-phrase command
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
TWI543150B (zh) 用於提供聲音串流擴充筆記摘錄之方法、電腦可讀取儲存裝置及系統
US6334102B1 (en) Method of adding vocabulary to a speech recognition system
US6934682B2 (en) Processing speech recognition errors in an embedded speech recognition system
US7412387B2 (en) Automatic improvement of spoken language
US20130204621A1 (en) Speaker adaptation of vocabulary for speech recognition
EP2645364B1 (en) Spoken dialog system using prominence
US7966177B2 (en) Method and device for recognising a phonetic sound sequence or character sequence
JP2012238017A (ja) 置換コマンドを有する音声認識方法
JP2001282282A (ja) 音声情報処理方法および装置および記憶媒体
JP2010282199A (ja) 語彙獲得装置、マルチ対話行動システム及び語彙獲得プログラム
CN111843986B (zh) 机器人示教装置
US6963834B2 (en) Method of speech recognition using empirically determined word candidates
Kipyatkova et al. A study of neural network Russian language models for automatic continuous speech recognition systems
WO2006097975A1 (ja) 音声認識プログラム
JP2019101065A (ja) 音声対話装置、音声対話方法及びプログラム
JP5544575B2 (ja) 音声言語評価装置、方法、及びプログラム
WO2023139769A1 (ja) 文法調整装置、及びコンピュータが読み取り可能な記憶媒体
US20060136195A1 (en) Text grouping for disambiguation in a speech application

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921926

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023575014

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22921926

Country of ref document: EP

Kind code of ref document: A1