WO2023139769A1

WO2023139769A1 - Grammar adjustment device and computer-readable storage medium

Info

Publication number: WO2023139769A1
Application number: PCT/JP2022/002282
Authority: WO
Inventors: 泰弘芝▲崎▼
Original assignee: ファナック株式会社
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2023-07-27

Abstract

The present invention stores the grammar of voice commands that operate industrial machinery, uses one or more processors to extract a portion of the grammar, receives registration of a target for an evaluation value for voice recognition of the extracted grammar, uses the extracted grammar to perform voice recognition on voice data for evaluation, calculates an evaluation value for voice recognition of the extracted grammar on the basis of the results of the voice recognition that uses the extracted grammar and of correct answer data for the voice data for evaluation, and selects grammar that satisfies the target from the extracted grammar.

Description

Grammar adjustment device and computer readable storage medium

The present invention relates to a speech recognition grammar adjustment device and a computer-readable storage medium.

Currently, in industrial fields such as the manufacturing industry, various devices such as robots, conveyors, machine tools, and mechanical equipment are in operation. Many of such devices are provided with an operation section, and many devices themselves for controlling each device, such as PLCs (Programmable Logic Controllers), NCs (Numerical Controllers), and control panels, also have an operation section.

The operation part of the device has many buttons and operation screens, but the operation is complicated and it takes time to master. A voice input interface allows users to perform desired operations simply by uttering voice commands. Therefore, attempts have been made to improve operability using a voice input interface.

　The voice commands used to operate the device can be assumed depending on the type of device that uses the voice command, the site where the device is installed, and the operation details of the device. Therefore, expected voice commands can be created in grammar (syntax and words). For example, see Patent Document 1.

JP-A-9-325787

Evaluation data is used to evaluate whether the accuracy of the created grammar is high. The creator of the speech recognition system checks the accuracy of speech recognition when using the created grammar, and edits the grammar. Speech recognition grammars are often created manually.

In the industrial field, there is a demand for technology that supports grammar creation for speech recognition.

A grammar adjustment device that is an aspect of the present disclosure includes a grammar storage unit that stores a grammar of a voice command for operating an industrial device, a grammar extraction unit that extracts a part of the grammar, a target registration unit that receives registration of a target for the speech recognition evaluation value of the extracted grammar, a speech recognition unit that performs speech recognition of evaluation speech data using the extracted grammar, and an evaluation value calculation unit that calculates the speech recognition evaluation value of the extracted grammar based on the results of speech recognition using the extracted grammar and correct data of the evaluation speech data. a grammar selection unit that selects a grammar that satisfies a target from among the one or more extracted grammars extracted by the grammar extraction unit.
A storage medium, which is one aspect of the present disclosure, stores a grammar of a voice command for operating an industrial device, extracts a part of the grammar by being executed by one or more processors, receives a target registration of a speech recognition evaluation value of the extracted grammar, performs speech recognition of evaluation speech data using the extracted grammar, calculates a speech recognition evaluation value of the extracted grammar based on the result of speech recognition using the extracted grammar and correct data of the evaluation speech data, and calculates the speech recognition evaluation value of the extracted grammar among one or more extracted grammars. , stores processor readable instructions for selecting a grammar that satisfies a goal.

According to one aspect of the present invention, grammar creation for speech recognition can be supported.

1 is a block diagram showing the configuration of a grammar adjustment device; FIG. FIG. 10 is a diagram showing examples of syntax definitions and word definitions; It is a figure which shows the combination example of the speaker of the data for evaluation, and a recording place. It is a figure which shows the example of the calculation result of an evaluation value. It is a figure which shows the example of a goal registration screen. It is a figure which shows the example of the accuracy rate of a different grammar. 4 is a flowchart for explaining processing of the grammar adjustment device; It is a hardware configuration of a grammar adjustment device.

The grammar adjustment device 100 will be described below.
The grammar adjustment device 100 is implemented in an information processing device having an arithmetic unit and a storage unit. Examples of such information processing devices include PCs (personal computers) and mobile terminals, but are not limited to these.

　Fig. 1 shows the basic configuration of the grammar adjustment device 100. The grammar adjustment device 100 comprises an evaluation data storage unit 11 , a target registration unit 12 , a basic grammar storage unit 13 , a grammar extraction unit 14 , a speech recognition unit 15 , an extracted grammar storage unit 16 , an evaluation value calculation unit 17 and a grammar selection unit 18 .

The basic grammar storage unit 13 stores grammars of voice commands that serve as bases. A voice command is a command for operating equipment in the industrial field by voice. The grammar of voice commands consists of syntax and words. The basic grammar storage unit 13 includes a syntax storage unit 19 that stores syntax and a word storage unit 20 that stores words.
Words include words that make up voice commands and phoneme representations of words. Syntax defines the arrangement of words that make up a voice command.
The base grammar is exhaustively created to cover as many voice commands as possible that are expected to be used in the field. For example, the syntax of a voice command for setting the "override" of a numerical controller to "30" is assumed to be "override 30", "set override to 30", "set override to 30", and so on. A grammar author constructs as many grammars as possible. The basic grammar is determined by the type of device that recognizes voice commands, specifications, work content, and so on.
A plurality of phoneme arrays may be assigned to one word. For example, the word "override" can be represented by multiple phonemes such as "o:ba:raido", "oubaaraido" and "oubaraido". The base grammar is constructed to cover as many phonemes of such words as possible.

Grammar consists of sentences and words. FIG. 2 shows an example syntax definition and an example word definition. An example syntax definition defines the words that make up a voice command and the order of the words. In the first line "S: NS_B COMMAND NS_E" of the syntax definition in FIG. 2, "S" is the start symbol of the voice command, and "NS_B" and "NS_E" are silent sections at the beginning and end of the sentence. There is a syntactic element "COMMAND" between silent intervals.
The second and third lines define "tags" that go into "COMMAND". The second line defines that the syntax element "COMMAND" includes tags "ROBOT" and "INTERFACE", and the third line defines that the syntax element "COMMAND" includes tags "NAIGAI" and "INTERFACE".

The first and second lines of the word definition define the Japanese notation and phoneme notation of the tag "ROBOT". The Japanese notation of the tag "ROBOT" is "robot" and the phoneme notation is "roboqto". The 3rd to 5th lines of the word definition define the Japanese notation and the phoneme notation of the Japanese included in the tag "NAIGAI". The tag "NAIGAI" contains two Japanese words, "external" and "internal." The "outside" phoneme is "gaibu" and the "inside" phoneme is "naibu". The 6th to 8th lines of the word definition define the Japanese notation and the phoneme notation of the Japanese included in the tag "INTERFACE". The tag "INTERFACE" contains one Japanese word "interface". The “interface” has two types of phoneme notation “iNtafe:su” and “iNta:feisu”. "%NS_B" defines a silence section [s] at the beginning of a sentence, and "%NS_E" defines a silence section [/s] at the end of a sentence.

The grammar extraction unit 14 extracts a part of the grammar from the exhaustive grammar stored in the basic grammar storage unit 13 . As an example of the grammar extraction method, cluster division of the k-means method is used. A method other than the k-means method may be used for grammar extraction. The clustering of the k-means method uses the acoustic distance of the grammar. Acoustic distance can be obtained from an acoustic spectrum, from a phoneme string, or the like. According to the method of calculating the acoustic distance from the acoustic spectrum, the acoustic spectrum of the voice command is vectorized and the cosine distance or Euclidean distance between the vectors is calculated. Cosine distance, Levenshtein distance, Jarrowinkler distance, and Hamming distance are used in the method of calculating acoustic distance from a phoneme string. Cosine distance, Euclidean distance, Leberstein distance, Jarrowinkler distance and Hamming distance are well known.
In addition to calculating the distance of the entire voice command, the acoustic distance of the phonemes of words included in the voice command (e.g., "iNtafe:su", "iNtafeisu", etc.) can be calculated, and a part can be extracted from a cluster of words with close acoustic distances.

The k-means method, exemplified in this disclosure, uses random numbers to set the centers of K clusters, (a) assigning the nearest center to each voice command (or word), and (b) calculating the center for each cluster. Repeat (a) and (b) until the centers of all clusters do not change to divide the voice commands into clusters.
The grammar extraction unit 14 extracts grammars (syntax and words) of voice commands included in the same cluster, and outputs them to the voice recognition unit as grammars for evaluation.
Note that the k-means method is an example of a method for extracting grammars with close distances, and methods other than the k-means method may be used. Also, the results of the k-means method are affected by the random number of initial values and the number of clusters K. FIG. The random number and the number of clusters K may be set manually by the user, or may be automatically set by the grammar extraction unit 14 .

The evaluation data storage unit 11 associates and stores voice data including voice commands recorded by a plurality of speakers at a plurality of recording locations with correct data, which is a correct text for the voice data. For example, voice data of utterances of "external interface" by a plurality of speakers at a plurality of recording locations and correct data (text) of "external interface" are stored in association with each other.
The speech data in the evaluation data storage unit 11 are recorded at different recording locations by speakers with different attributes (gender, age). Since the evaluation data was recorded at the site using the voice command, the noise at the site using the voice command is included. FIG. 3 is a table showing the relationship between the speaker of the evaluation data and the recording location. The evaluation data in FIG. 3 includes voices recorded by speaker A (male, 60 years old) at factories A and B, voices recorded by speaker B (female, 30 years old) at factories C and D, and the like.

The voice recognition unit 15 receives a voice command from the evaluation data storage unit 11 and performs voice recognition of the input voice command. The speech recognition unit 15 is generally composed of an acoustic model, a language model, and a decoder. The acoustic model receives speech data and outputs phonemes (senones) that form the speech data based on the feature amount of the speech data. The language model outputs the probability of occurrence of word strings. The language model selects hypothetical word strings based on phonemes and outputs linguistically plausible candidates. The decoder outputs a word string with a high probability as a recognition result based on the outputs of the acoustic model and language model that are statistically created.

The extracted grammar storage unit 16 includes a syntax storage unit 21 and a word storage unit 22, and stores the grammars extracted by the grammar extraction unit 14. The speech recognition section 15 performs speech recognition using the grammar stored in the extraction grammar storage section 16 .

The evaluation value calculation unit 17 compares the correct text from the evaluation data storage unit with the recognition result of the speech recognition unit 15, and calculates the accuracy rate of speech recognition. FIG. 4 is an example of the accuracy rate as an evaluation value. In the present disclosure, the accuracy rate is calculated for each type of voice command as well as the evaluation of the voice command as a whole. Types of voice commands include, for example, approval commands, numerical commands, and transition commands. An approval command is a command indicating approval. Assume that the approval commands include "yes", "no", "yes", "no", "execute", "abort", and the like. Numerical commands are commands for designating numerical values such as "0.5", "1", "2", and "100". A "transition command" is a command for designating a display screen such as a "home screen" or a "speed setting screen". In addition, a "machine operation command" such as "set a workpiece" may be considered.

The target registration unit 12 accepts registration of target values for speech recognition. The target registration unit 12 receives target values such as a target accuracy rate for all voice commands, a target accuracy rate for each type of voice command, and a target search time.
FIG. 5 is an example of a target registration screen. In FIG. 5, the target accuracy rate for each type of voice command is set as "approval command: 95% or more", "numerical command: 90% or more", "transition command: 80% or more", and maximum execution time "within 30 minutes".

The grammar selection unit 18 compares the speech recognition result with the target accuracy rate, and if there is a grammar determined to satisfy the target accuracy rate as a result of the speech recognition, it selects that grammar as an appropriate grammar. The processing of the grammar selection unit 18 is repeated until the target time for grammar adjustment elapses or the target accuracy rate is cleared. When the target time for grammar adjustment has passed, an appropriate grammar is selected from the grammars for which speech recognition has been performed so far. The grammar selection unit 18 may present the accuracy rate of each grammar to the creator of the grammar, and the creator of the grammar may select the grammar.

An example of a grammar selection method will be described. In the present disclosure, the accuracy rate of voice commands is calculated for each type of approval command, transition command, and numerical command. Among these voice commands, the approval command is used for confirming the operation, so a high accuracy rate is required. Numerical commands that specify numerical values also require a high accuracy rate. A transition command that instructs a screen transition may have a lower accuracy rate than an approval command or a numerical command. With the grammar adjustment device of the present disclosure, a target accuracy rate can be set for each voice command or for each type of voice command. Automatically select the grammar that achieves the target accuracy rate.
For example, FIG. 6 shows the accuracy rate of grammar A and grammar B. In FIG. Since the accuracy rate of "approval command", "numerical value command", and "transition command" of grammar A satisfies the target accuracy rate registered by the target registration unit 12, grammar A is selected as an appropriate grammar. Further conditions may be set when a plurality of grammars satisfy the target accuracy rate.

The processing of the grammar adjustment device 100 of the present disclosure will be described with reference to FIG.
As a preparation step, the grammar adjustment device 100 registers a target accuracy rate of a voice command (step S1, receives registration of the maximum execution time for grammar adjustment (step S2), and receives registration of a cluster division criterion (step S3). When using the k-means method, registration of an initial random number and the number of clusters K is received as a cluster division criterion.

The grammar adjustment device 100 clusters the voice commands (or words included in the voice commands) stored in the basic grammar storage unit 13 (step S4), extracts one or more representative voice commands (or words included in the voice commands) from each cluster (step S5), and reconstructs the grammar using the extracted voice commands (or words included in the voice commands) (step S6).
The grammar adjustment device 100 performs speech recognition on the evaluation data using the grammar reconstructed in step S5 (step S7). Grammar adjustment device 100 calculates an evaluation value for speech recognition (step S8). Grammar adjustment device 100 compares the evaluation result with the target accuracy rate, and if the evaluation result satisfies the target accuracy rate (step S9; Yes), selects the grammar (step S10).

In step S9, if the evaluation result does not satisfy the target accuracy rate (step S9; No), it is determined whether or not the maximum execution time has been reached. When the target time for grammar adjustment is reached (step S11; Yes), the grammar adjustment device 100 presents the grammars that have undergone speech recognition so far to the user and accepts the selection of the grammar (step S10).

In step S9, if the target time for grammar adjustment has not been reached (step S11; No), the process proceeds to step S4, and the processes from step S4 to step S9 are repeated.
In this flowchart, the adjustment of the grammar is finished when the condition of the target accuracy rate is satisfied, but the adjustment may be continued until the maximum execution time is reached without finishing the adjustment.

As described above, the grammar adjustment device 100 of the present disclosure is a device that supports the creation of voice command grammar, extracts a part of the comprehensively created grammar, reconstructs the grammar, and selects a grammar with a high accuracy rate.

The grammar accuracy rate is calculated for each type of voice command, so it is possible to adjust the grammar so that the accuracy is suitable for the situation where voice recognition is used.

Grammar evaluation data is recorded at the site where voice commands are used, so it is possible to construct a grammar suitable for recognizing voice data containing noise peculiar to the site or time period. In addition, in the grammar adjustment device of the present disclosure, since field-specific technical terms and expressions are registered as grammar, recognition candidates are selected from the words and syntax registered in the grammar even if noise is included, so the accuracy rate is improved.

Since the grammar adjustment device 100 of the present disclosure automatically adjusts grammar, it can optimize the grammar based on objective criteria without depending on the subjectivity or know-how of the grammar creator. In addition, since the grammar is automatically adjusted, even an inexperienced technician can adjust the grammar.

[Hardware configuration]
The hardware configuration of the grammar adjustment device 100 will be described with reference to FIG. The CPU 111 included in the grammar adjustment device 100 is a processor that controls the grammar adjustment device 100 as a whole. The CPU 111 reads the system program processed in the ROM 112 via the bus and controls the entire grammar adjustment apparatus 100 according to the system program. The RAM 113 temporarily stores calculation data, display data, various data input by the user via the input unit 71, and the like.

The display unit 70 is a monitor or the like attached to the grammar adjustment device 100 . The display unit 70 displays an operation screen, a setting screen, and the like of the grammar adjustment device 100 .

The input unit 71 is integrated with the display unit 70 or is a keyboard, touch panel, operation button, etc. separate from the display unit 70 . The user operates the input unit 71 to perform input to the screen displayed on the display unit 70 . Note that the display unit 70 and the input unit 71 may be mobile terminals.

The non-volatile memory 114 is, for example, a memory that is backed up by a battery (not shown) so that the stored state is retained even when the power of the grammar adjustment apparatus 100 is turned off. The non-volatile memory 114 stores machining programs, system programs, available options, billing tables, and the like. The nonvolatile memory 114 stores a program read from an external device via an interface (not shown), a program input via the input unit 71, and various data obtained from each part of the grammar adjustment apparatus 100, a machine tool, etc. (for example, setting parameters obtained from the machine tool, etc.). Programs and various data stored in the non-volatile memory 114 may be developed in the RAM 113 at the time of execution/use. Various system programs are pre-written in the ROM 112 .

REFERENCE SIGNS LIST 100 grammar adjustment device 11 evaluation data storage unit 12 target registration unit 13 basic grammar storage unit 14 grammar extraction unit 15 speech recognition unit 16 extracted grammar storage unit 17 evaluation value calculation unit 18 grammar selection unit 19 syntax storage unit 20 word storage unit 21 syntax storage unit 22 word storage unit 70 display unit 71 input unit 111 CPU
112 ROMs
113 RAM
114 non-volatile memory

Claims

a grammar storage unit that stores grammars of voice commands for operating industrial equipment;
a grammar extraction unit that extracts a part of the grammar;
a target registration unit that accepts registration of targets for speech recognition evaluation values of the extracted grammar;
a speech recognition unit that performs speech recognition of evaluation speech data using the extracted grammar;
an evaluation value calculation unit that calculates an evaluation value of speech recognition of the extracted grammar based on the results of speech recognition using the extracted grammar and correct data of the evaluation speech data;
a grammar selection unit that selects a grammar that satisfies the target from among the one or more extracted grammars extracted by the grammar extraction unit;
A grammar adjuster comprising:
The grammar adjustment device according to claim 1, wherein the evaluation value is the accuracy rate of the speech recognition.
The grammar adjustment device according to claim 1, wherein an execution time for grammar adjustment is received, and until the execution time is reached, grammar extraction by the grammar extraction unit, speech recognition using the extracted grammar by the speech recognition unit, and evaluation value calculation of speech recognition using the extracted grammar by the evaluation value calculation unit are repeated.
The grammar adjustment device according to claim 1, wherein the grammar extraction unit clusters the grammars stored in the grammar storage unit and extracts a representative of the clustered grammars.
The grammar adjustment device according to claim 4, wherein said grammar extraction unit clusters said grammars using acoustic distances of speech commands defined in said grammars.
The grammar adjustment device according to claim 4, wherein said grammar extraction unit clusters said grammars using acoustic distances of words included in voice commands defined in said grammars.
Memorize the grammar of voice commands to operate industrial equipment,
by one or more processors executing:
extracting a portion of said grammar;
Receiving registration of targets for speech recognition evaluation values of the extracted grammar;
using the extracted grammar to perform speech recognition of the evaluation speech data,
calculating an evaluation value of speech recognition of the extracted grammar based on the results of speech recognition using the extracted grammar and correct data of the evaluation speech data;
selecting, among one or more of the extracted grammars, a grammar that meets the goal;
A storage medium storing instructions readable by the processor.