CN115394288B

CN115394288B - Language identification method and system for civil aviation multi-language radio land-air conversation

Info

Publication number: CN115394288B
Application number: CN202211331120.0A
Authority: CN
Inventors: 张华勇; 刘丰丰; 王小刚
Original assignee: Chengdu Aiwei Translation Technology Co ltd
Current assignee: Chengdu Aiwei Translation Technology Co ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-01-24
Anticipated expiration: 2042-10-28
Also published as: CN115394288A

Abstract

The invention discloses a language identification method and a language identification system for civil aviation multi-language radio land-air communication, wherein the method comprises the following steps: constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language; respectively sending the voice into voice recognition models of various languages to obtain texts and posterior probabilities; calculating the probability average value of the character with the maximum probability of each non-blank frame of the voice through the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language; the confidence corresponding to each language is zoomed; and comparing the confidence degrees obtained by the voice recognition models, selecting the language of the current voice, and returning the text. When the method is applied, a complex calculation module is not required to be added, the priori knowledge of the speech recognition algorithm on the language can be fully utilized, and the method has high adaptivity to language recognition in a complex environment.

Description

Language identification method and system for civil aviation multi-language radio land-air conversation

Technical Field

The invention relates to the field of civil aviation radio land-air calls, in particular to a language identification method and system of civil aviation multi-language radio land-air calls.

Background

With the rapid development of global economy, civil aviation rapidly develops, the number of flights increases year by year, and the workload of controllers is continuously increased due to the continuously increasing air traffic flow. The controllers often need to communicate with domestic and international flight pilots during working, and understanding of the land-air communication contents by the controllers in a high-intensity working process is prone to deviation, so that safety risks are brought to normal operation of airports.

In order to reduce the workload of a controller, a large number of air traffic control automation systems are generated at the same time, the controller is assisted to work, and the potential safety hazard in the airport control work is reduced. Because the air traffic control personnel and the flight personnel directly communicate through voice, most systems introduce voice recognition technology as a means for converting voice into text. Then, most of the existing mature speech recognition technologies are single-language recognition, and due to the difference of pronunciation dictionaries between languages, the high-precision accuracy rate is difficult to realize by a scheme based on mixed language recognition, so that the language recognition technology is a key technology for civil aviation land-air communication speech recognition.

Most of the existing language recognition technologies are based on a neural network algorithm, voice acoustic features are extracted through a large number of voice language databases, and a deep learning model is trained by using a back propagation algorithm. However, the deep learning model has low robustness and is often poor in performance in an unknown environment which is not found in training, the deep learning model is time-consuming in inference, and the training parameter adjusting steps are tedious, which increases the complexity of the system.

Disclosure of Invention

The invention aims to solve the problem of complex system of the existing language identification, and provides a language identification method of civil aviation multilingual radio land-air communication. The invention also discloses a system for realizing the language identification method of the civil aviation multi-language radio land-air conversation.

The purpose of the invention is mainly realized by the following technical scheme:

the language identification method of civil aviation multi-language radio land-air conversation comprises the following steps:

constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language;

respectively sending the voice into voice recognition models of various languages for recognition to obtain texts, and obtaining posterior probabilities output by the voice recognition models;

calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language;

scaling the confidence degrees corresponding to the languages to make the confidence probability mean values output by the voice recognition models consistent;

and comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the relatively maximum confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text. When the method is applied, all the voice recognition models facing civil aviation are trained respectively, the audio input by a user is input into all the voice recognition models, the posterior probability of each frame of audio for each character is obtained, the posterior probability average value of each character output by each voice recognition model is obtained respectively, the average value is used as the language confidence coefficient of the input voice in the language corresponding to the current recognition model, all the confidence coefficients are compared, the language corresponding to the higher confidence coefficient is selected as the language output of the current voice, and then the voice recognition result corresponding to the language is selected as the recognition result of the current voice. Compared with the traditional language identification method, the method does not need an additional module, has lower calculation complexity, and has higher generalization and robustness for language identification because the prior knowledge of speech identification is combined.

Further, the text dictionary is expressed as

A piece of text corresponding to a voice is represented as

Wherein, in the step (A),Nthe size of the vocabulary table is determined according to the number of vocabularies of each language,w _N in table dictionaryNThe number of the characters is one,

represents the total number of audio frames of a piece of speech,

respectively 1, 2, 3, \8230,mThe index of the corresponding character in the frame audio in the dictionary,

is shown asmCharacters corresponding to the frame audio;

the training and deploying of the end-to-end speech recognition models of various languages comprises the following steps: and respectively constructing a deep learning voice recognition model for each language, extracting FBank characteristics of the voice as the input of the model, and training to obtain an end-to-end voice recognition model of each language through a training strategy based on a CTC loss function.

Further, the sending the speech into the speech recognition models of the respective languages respectively for recognition to obtain a text, and obtaining the posterior probability output by each speech recognition model includes:

the speech is fed into speech recognition models of various languages, and a feature is calculatedTAudio sequence of feature vectors

After the text is input into the speech recognition model, the posterior output and greedy decoded text of each character or letter of each audio frame in the dictionary of the corresponding language speech recognition model are obtained, and the posterior output of the speech recognition model is as follows:

by using

Is shown as

The probability vectors in the frame audio corresponding to each word in the vocabulary,

take a value of1、 2、…、T：

Wherein the content of the first and second substances,

is shown in

The speech recognition model in the frame to the second in the vocabulary

The weight of the individual characters is such that,

take a value of1、2、…、NAnd the posterior probability of the audio output by the speech recognition model is as follows:

wherein the content of the first and second substances,

denotes the first

The frame of audio is a frame of audio,

the parameters that represent the model of the speech recognition,

representing the speech recognition model for the first of a language in a given frame

The posterior probability of an individual character.

Further, the calculating the probability average value of the character with the maximum probability of each non-blank frame of the speech through the posterior probability output by each speech recognition model comprises:

for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step;

removing probability output of blank frames, taking natural logarithm of all probabilities output by each speech recognition model, and then taking an average value of all effective outputs of each speech recognition model to obtain a probability average value of the whole speech in the corresponding language speech recognition model; the score calculation formula is as follows:

wherein the content of the first and second substances,

representing the confidence level of the speech under the speech recognition model,

indicating the position of the blank character in the vocabulary,

the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,

the representation is taken to be the maximum probability value,

the representation is taken from the natural logarithm of the number,

the parameters representing the model of the speech recognition are,

representing the speech recognition model for the second of a language under a given frame

The posterior probability of a character, T, represents the total number of feature vectors in speech. When the method is applied, the probability output of blank frames is removed, and score calculation is not involved, because the speech recognition model output usually contains a large number of blank nonsense characters, does not contain linguistic information, and only represents the blank frames of audio or the segmentation identifiers of two characters, so that the probability output needs to be removed. The invention takes a natural logarithm for each probability, because if the limiting probability is in the interval of 0-1, the probability of partial frames is diluted after the average value is taken, and the final confidence coefficient is in a smaller range, therefore, the probability takes a natural logarithm, so that the confidence coefficient of the word with the maximum probability in a frame is more distinguished.

Further, when the confidence corresponding to each language is scaled, for any two languages, the calculation formula of the scaling factor is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing a scaling factor of the first language relative to the second language,

indicating the size of the vocabulary in the first language,

indicating the size of the vocabulary in the second language,Trepresents the total number of feature vectors in the speech,

take a value of1、2、…、T。

Further, the judgment formula for the confidence level obtained by comparing the speech recognition models is as follows:

wherein the content of the first and second substances,Score ₁ the confidence level output for the first speech recognition model,Score ₂ the confidence level output for the second speech recognition model,

and after the scaling factor is added, if the confidence coefficient of the first language is greater than that of the second language, the voice is represented as the first language, and if the confidence coefficient of the first language is less than that of the second language, the voice is represented as the second language.

The system for realizing the language identification method of the civil aviation multi-language radio land-air communication comprises the following steps:

the text dictionary building module is used for building text dictionaries of various languages;

the speech recognition model training module is used for training and deploying speech recognition models of various languages from end to end;

the text and posterior probability acquisition module is used for acquiring texts and posterior probabilities which are obtained by recognition in the voice recognition models of various languages of the voice input;

the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by the voice recognition models as the confidence coefficient of the corresponding language;

the voice confidence scaling module is used for scaling the confidence corresponding to each language so as to lead the confidence probability mean values output by each voice recognition model to be consistent;

the confidence coefficient comparison module is used for comparing the confidence coefficient obtained by each voice recognition model based on the scaling factor and selecting the language corresponding to the voice recognition model with the relatively maximum confidence coefficient as the language of the current voice;

and the text output module is used for outputting the output text of the speech recognition model with the relatively maximum confidence coefficient as the current text.

In conclusion, compared with the prior art, the invention has the following beneficial effects: (1) The method combines the existing voice recognition model, obtains the confidence coefficient of the voice under the model by averaging the posterior probability of each effective audio frame output by the voice recognition model, then carries out weighted comparison on each confidence coefficient under each voice recognition model according to the size of a vocabulary table, and obtains the voice score value with the maximum confidence coefficient through calculation, thereby obtaining the language information of the voice. Because the linguistic information of the voice is combined, the method can overcome the problems of low generalization and robustness of the traditional language identification method, and can adapt to various complex noise environments.

(2) The invention only uses the posterior probability of the speech recognition as the unique information of the calculated language, has almost negligible calculation complexity compared with the traditional deep learning model scheme, and can not bring excessive performance loss due to the fact that the speech recognition model can carry out parallel reasoning.

(3) The method can adapt to a more complex voice environment, does not bring low computational complexity, and can be effectively applied to the task of civil aviation land-air conversation voice analysis to assist airport control personnel in making decisions.

(4) The invention can be applied to the discrimination of Chinese, english, japanese, russian and other languages, and has wide application range.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flow chart of an embodiment of the present invention;

fig. 2 is a block diagram of a system in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1:

as shown in fig. 1, the language identification method for civil aviation multi-language radio land-air conversation includes: constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language; respectively sending the voice into voice recognition models of various languages for recognition to obtain texts, and obtaining posterior probabilities output by the voice recognition models; calculating the probability average value of the character with the maximum probability of each non-blank frame of the voice through the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language; scaling the confidence degrees corresponding to the languages to make the confidence probability mean values output by the voice recognition models consistent; and comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the highest confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text.

The text dictionary of the embodiment is expressed as

A piece of text corresponding to a voice is represented as

Wherein, in the process,Nthe size of the vocabulary table is determined according to the number of vocabularies of each language,w _N in table dictionaryNThe number of the characters is one, and the characters,

represents the total number of audio frames of a piece of speech,

respectively No. 1, no. 2, no. 3, \ 8230; and,mThe index of the corresponding character in the frame audio in the dictionary,

is shown asmAnd (5) characters corresponding to the frame audio. In this embodiment, training and deploying the end-to-end speech recognition models of various languages specifically includes: and respectively constructing a deep learning voice recognition model for each language, extracting FBank characteristics of the voice as the input of the model, and training to obtain the end-to-end voice recognition model of each language through a training strategy based on a CTC loss function. Wherein, the CTC (connection temporal classification) loss function is a connection timing classification loss function. In this embodiment, CTC is used as a loss function of model training, and during training, after speech is output through model modeling, a result of a model is decoded and searched through a modeling strategy based on CTC, so that a posterior probability vector corresponding to each audio frame one to one can be obtained.

In this embodiment, the sending the speech to the speech recognition models of the different languages respectively for recognition to obtain a text, and obtaining the posterior probability output by each speech recognition model specifically includes: the speech is fed into speech recognition models of various languages, and a feature is calculatedTAudio sequence of feature vectors

，

Feature vectors of 64 dimensions are input into the speech recognition model, and the posterior output of each word or letter and the text after greedy decoding in the dictionary of the speech recognition model of the corresponding language of each audio frame are obtained, wherein the posterior output of the speech recognition model is as follows:

by using

Is shown as

A probability vector in the frame audio corresponding to each word in the vocabulary,

take a value of1、 2、…、T：

Wherein the content of the first and second substances,

is shown in

The speech recognition model in the frame to the second in the vocabulary

The weight of the individual characters is such that,

wherein the content of the first and second substances,

denotes the first

The frame of audio is a frame of audio,

the parameters representing the model of the speech recognition are,

The posterior probability of an individual character,

indicating normalization to the input. In this embodiment, when calculating the feature vector of the audio sequence, the audio is pre-emphasized first, then frame division is performed, the frame length is 20ms and the frame shift is 10ms, after frame division, discrete fourier transform is performed on the audio sequence to obtain frequency domain features, and then the audio FBank features are extracted through a 64-dimensional mel filter bank.

The embodiment of calculating the probability average value of the maximum character of the probability of each non-blank frame of the speech through the posterior probabilities output by each speech recognition model specifically includes: for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step; removing probability output of blank frames, taking the natural logarithm of all probabilities output by each voice recognition model, and then averaging all effective outputs of each voice recognition model to obtain the probability average value of the whole voice in the corresponding language voice recognition model; the score calculation formula is as follows:

indicates the confidence of the speech under the speech recognition model (the lower the score, the higher the confidence is indicated),

indicating the position of the blank character in the vocabulary,

the representation is taken to be the maximum probability value,

the representation is taken from the natural logarithm of the number,

the parameters representing the model of the speech recognition are,

The posterior probability of an individual character,Trepresenting the total number of feature vectors in the speech.

When the embodiment is applied, the inconsistent size of the word lists of various languages is considered because

The normalization calculation of (2) can cause the mean values of confidence probabilities calculated by each speech recognition model to be inconsistent, and effective comparison cannot be carried out. Assuming that there is an audio, the probability of each word predicted by each speech recognition model for each frame is equal, the probability of each word is equal, which indicates that each speech recognition model cannot correctly recognize the text of the speech, and it is equally confused for each frame of the audio, then the final scores of each speech recognition model should be equal, based on which, adding a scaling factor to the score of one of any two compared speech recognition models, the following identity can be obtained:

when the confidence corresponding to each language can be derived through transformation, the calculation formula of the scaling factor for any two languages is as follows:

indicating the size of the vocabulary in the first language,

indicating the size of the vocabulary in the second language,Trepresenting the total number of feature vectors in the speech,

take a value of1、2、…、T。

The judgment formula for the confidence level obtained by comparing the speech recognition models in the embodiment is as follows:

and after the scaling factor is added, if the confidence coefficient of the first language is greater than that of the second language, the voice is represented as the first language, and if the confidence coefficient of the first language is less than that of the second language, the voice is represented as the second language. Last word of combinationAnd outputting the voice recognition text as a real text of the model after the language judgment. If the first language confidence level is equal to the second language confidence level, usually the speech segment is blank speech or the speech content is short (less than 0.5 seconds), this embodiment may default to select a language output as the language judgment under the extreme condition.

As shown in fig. 2, this embodiment further discloses a system for implementing the language identification method for civil aviation multi-language radio land-air communication, including: the text dictionary building module is used for building text dictionaries of various languages; the speech recognition model training module is used for training and deploying speech recognition models of various languages from end to end; the text and posterior probability acquisition module is used for acquiring texts and posterior probabilities which are obtained by recognition in the voice recognition models of various languages of the voice input; the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by the voice recognition models as the confidence coefficient of the corresponding language; the voice confidence scaling module is used for scaling the confidence corresponding to each language so as to lead the confidence probability mean values output by each voice recognition model to be consistent; the confidence coefficient comparison module is used for comparing the confidence coefficient obtained by each voice recognition model based on the scaling factor and selecting the language corresponding to the voice recognition model with the relatively maximum confidence coefficient as the language of the current voice; and the text output module is used for outputting the output text of the speech recognition model with the relatively maximum confidence coefficient as the current text.

When the method is applied, confidence level scoring is performed on each frame of the voice by using the prior linguistic information of the voice recognition, and the language type of the voice is judged by comparing confidence levels output by the voice recognition models under different languages.

The following examples will be given when the present embodiment is applied to chinese and english recognition: the Chinese text dictionary comprises 5207 characters (containing a blank character) and 28 letters (containing spaces and blank characters) in English in total, and when the text is expressed, for example, the Chinese sentence "national aviation three-hole two-right turn aviationGoing to three five ", it can be represented as (nation, navigation, three, hole, two, right, turn, navigation, going to, three, five). For a piece of calculated features

Audio sequence of feature vectors

. After the input of the voice recognition model, the posterior output and greedy decoded text of each word or letter of each audio frame in the dictionary of the corresponding Chinese-English model are obtained. The posterior output of Chinese and English is respectively:

here, use is made of

And

respectively represent the sizes of Chinese and English vocabularies,

and

is a first

Probability vectors in frame audio corresponding to each word in the vocabulary:

the posterior probability of the audio output in the chinese-english speech recognition model can be expressed as:

wherein

The parameters representing the chinese model are the parameters,

the parameters of the english model are represented by,

、

respectively representing Chinese and English recognition models for the second language in a given frame

The posterior probability of an individual character.

The Chinese and English scores are calculated as follows:

wherein

And

and representing the confidence of the speech under Chinese and English speech recognition models respectively.

Considering the inconsistency of the Chinese and English word lists in size, because

The normalization calculation of (2) can cause the mean value of confidence probability calculated by Chinese and English models to be inconsistent, and effective comparison cannot be carried out. Assuming an audio, the probability of each word predicted by the Chinese-English model for each frame is equal, i.e. the probability is equal

，

Specifically, if a Chinese model based on 5207 common words and an English model based on 28 letters are used, the output probability of the language module corresponding to the Chinese model is

The probability output by the language module corresponding to the English recognition model is

. The probability of each word is equal, which means that the Chinese-English model cannot correctly recognize the text of the voice, and each frame of the audio is equally confused, so that the final score of the Chinese-English voice recognition model should be equal, and based on the result, a scaling factor is added to the score of the English recognition model, the following identity can be obtained:

wherein

A scaling factor is represented that is a function of,

which represents the size of the chinese vocabulary and is,

indicating the size of the english vocabulary. Then by transformation it can be deduced:

speech recognition with maximum confidence by comparing confidence levels with obtained scaling factorsAnd taking the language corresponding to the model as the current language. Specifically, if the vocabulary sizes of Chinese and English are 5207 and 28, respectively, then

. The final language can be judged by the following formula:

after the scaling factor is added, if the Chinese confidence coefficient is greater than the English confidence coefficient, the voice is represented as Chinese, otherwise, the voice is represented as English. Finally, combining the text of voice recognition as the real text of the model after language judgment and outputting. Specifically, the audio has the following section, the Chinese character is identified as (dog postbox), and after passing through the Chinese speech recognition model, the probability of each output character is

English recognition is (go home), and the probability of each letter is

. Respectively taking the natural logarithm value as

，

. To obtain the mean values thereof respectively

And finally

Therefore, the audio is determined to be english, and the text is determined to be (go home).

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The language identification method of civil aviation multi-language radio land-air conversation is characterized by comprising the following steps:

scaling the confidence degrees corresponding to each language to make the confidence probability mean values output by each speech recognition model consistent;

comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the relatively maximum confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text;

the calculating the probability average value of each non-blank frame probability maximum character of the voice through the posterior probability output by each voice recognition model comprises the following steps:

wherein the content of the first and second substances,

indicating the position of the blank character in the vocabulary,

the representation is taken to be the maximum probability value,

the representation is taken from the natural logarithm of the number,

the parameters representing the model of the speech recognition are,

2. The method of claim 1 for recognizing the language of civil aviation multi-language radio land-air conversationCharacterized in that said text dictionary is represented as

A piece of text corresponding to a voice is represented as

represents the total number of audio frames of a piece of speech,

denotes the firstmCharacters corresponding to the frame audio;

3. The language identification method of civil aviation multilingual radio land-air communication according to claim 1, wherein the step of sending the speech into the speech recognition models of different languages respectively for recognition to obtain texts, and obtaining the posterior probabilities output by the speech recognition models comprises:

After the text is input into the speech recognition model, the posterior output of each word or letter and the text after greedy decoding of each audio frame in the dictionary of the corresponding language speech recognition model are obtained, and the posterior output of the speech recognition model is as follows:

wherein the content of the first and second substances,Nfor vocabulary size, adopt

Is shown as

take a value of1、2、…、T：

Wherein the content of the first and second substances,

is shown in

The speech recognition model in the frame to the second in the vocabulary

The weight of each of the characters is determined,

take a value of1、 2、…、NAnd the posterior probability of the audio output by the speech recognition model is as follows:

wherein the content of the first and second substances,

is shown as

The frame of audio is a frame of audio,

the parameters representing the model of the speech recognition are,

The posterior probability of an individual character.

4. The language identification method of civil aviation multilingual radio land-air communication according to claim 1, wherein when the confidence levels corresponding to the languages are scaled, the scaling factor is calculated for any two languages as follows:

wherein the content of the first and second substances,

representing a scaling factor for the first language relative to the second language,

indicating the size of the vocabulary in the first language,

take a value of1、2、…、T。

5. The language identification method of civil aviation multi-language radio land-air communication according to claim 1, wherein the confidence level obtained by comparing the speech recognition models is determined according to the following formula:

6. A system for implementing the language identification method for civil aviation multilingual radio land-air communication according to any one of claims 1 to 5, comprising:

the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of the maximum character of each non-blank frame probability of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language;

the text output module is used for outputting the output text of the speech recognition model with the maximum confidence coefficient as the current text;

removing probability output of blank frames, taking the natural logarithm of all probabilities output by each voice recognition model, and then averaging all effective outputs of each voice recognition model to obtain the probability average value of the whole voice in the corresponding language voice recognition model; the score calculation formula is as follows:

wherein the content of the first and second substances,

representing the confidence of the speech under the speech recognition model,

indicating the position of the blank character in the vocabulary,

the representation is taken to be the maximum probability value,

the representation is taken from the natural logarithm of the number,

the parameters representing the model of the speech recognition are,