CN115394288B - Language identification method and system for civil aviation multi-language radio land-air conversation - Google Patents
Language identification method and system for civil aviation multi-language radio land-air conversation Download PDFInfo
- Publication number
- CN115394288B CN115394288B CN202211331120.0A CN202211331120A CN115394288B CN 115394288 B CN115394288 B CN 115394288B CN 202211331120 A CN202211331120 A CN 202211331120A CN 115394288 B CN115394288 B CN 115394288B
- Authority
- CN
- China
- Prior art keywords
- language
- voice
- recognition model
- probability
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Abstract
The invention discloses a language identification method and a language identification system for civil aviation multi-language radio land-air communication, wherein the method comprises the following steps: constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language; respectively sending the voice into voice recognition models of various languages to obtain texts and posterior probabilities; calculating the probability average value of the character with the maximum probability of each non-blank frame of the voice through the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language; the confidence corresponding to each language is zoomed; and comparing the confidence degrees obtained by the voice recognition models, selecting the language of the current voice, and returning the text. When the method is applied, a complex calculation module is not required to be added, the priori knowledge of the speech recognition algorithm on the language can be fully utilized, and the method has high adaptivity to language recognition in a complex environment.
Description
Technical Field
The invention relates to the field of civil aviation radio land-air calls, in particular to a language identification method and system of civil aviation multi-language radio land-air calls.
Background
With the rapid development of global economy, civil aviation rapidly develops, the number of flights increases year by year, and the workload of controllers is continuously increased due to the continuously increasing air traffic flow. The controllers often need to communicate with domestic and international flight pilots during working, and understanding of the land-air communication contents by the controllers in a high-intensity working process is prone to deviation, so that safety risks are brought to normal operation of airports.
In order to reduce the workload of a controller, a large number of air traffic control automation systems are generated at the same time, the controller is assisted to work, and the potential safety hazard in the airport control work is reduced. Because the air traffic control personnel and the flight personnel directly communicate through voice, most systems introduce voice recognition technology as a means for converting voice into text. Then, most of the existing mature speech recognition technologies are single-language recognition, and due to the difference of pronunciation dictionaries between languages, the high-precision accuracy rate is difficult to realize by a scheme based on mixed language recognition, so that the language recognition technology is a key technology for civil aviation land-air communication speech recognition.
Most of the existing language recognition technologies are based on a neural network algorithm, voice acoustic features are extracted through a large number of voice language databases, and a deep learning model is trained by using a back propagation algorithm. However, the deep learning model has low robustness and is often poor in performance in an unknown environment which is not found in training, the deep learning model is time-consuming in inference, and the training parameter adjusting steps are tedious, which increases the complexity of the system.
Disclosure of Invention
The invention aims to solve the problem of complex system of the existing language identification, and provides a language identification method of civil aviation multilingual radio land-air communication. The invention also discloses a system for realizing the language identification method of the civil aviation multi-language radio land-air conversation.
The purpose of the invention is mainly realized by the following technical scheme:
the language identification method of civil aviation multi-language radio land-air conversation comprises the following steps:
constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language;
respectively sending the voice into voice recognition models of various languages for recognition to obtain texts, and obtaining posterior probabilities output by the voice recognition models;
calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language;
scaling the confidence degrees corresponding to the languages to make the confidence probability mean values output by the voice recognition models consistent;
and comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the relatively maximum confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text. When the method is applied, all the voice recognition models facing civil aviation are trained respectively, the audio input by a user is input into all the voice recognition models, the posterior probability of each frame of audio for each character is obtained, the posterior probability average value of each character output by each voice recognition model is obtained respectively, the average value is used as the language confidence coefficient of the input voice in the language corresponding to the current recognition model, all the confidence coefficients are compared, the language corresponding to the higher confidence coefficient is selected as the language output of the current voice, and then the voice recognition result corresponding to the language is selected as the recognition result of the current voice. Compared with the traditional language identification method, the method does not need an additional module, has lower calculation complexity, and has higher generalization and robustness for language identification because the prior knowledge of speech identification is combined.
Further, the text dictionary is expressed asA piece of text corresponding to a voice is represented asWherein, in the step (A),Nthe size of the vocabulary table is determined according to the number of vocabularies of each language,w N in table dictionaryNThe number of the characters is one,represents the total number of audio frames of a piece of speech,respectively 1, 2, 3, \8230,mThe index of the corresponding character in the frame audio in the dictionary,is shown asmCharacters corresponding to the frame audio;
the training and deploying of the end-to-end speech recognition models of various languages comprises the following steps: and respectively constructing a deep learning voice recognition model for each language, extracting FBank characteristics of the voice as the input of the model, and training to obtain an end-to-end voice recognition model of each language through a training strategy based on a CTC loss function.
Further, the sending the speech into the speech recognition models of the respective languages respectively for recognition to obtain a text, and obtaining the posterior probability output by each speech recognition model includes:
the speech is fed into speech recognition models of various languages, and a feature is calculatedTAudio sequence of feature vectorsAfter the text is input into the speech recognition model, the posterior output and greedy decoded text of each character or letter of each audio frame in the dictionary of the corresponding language speech recognition model are obtained, and the posterior output of the speech recognition model is as follows:
by usingIs shown asThe probability vectors in the frame audio corresponding to each word in the vocabulary,take a value of1、 2、…、T:
Wherein the content of the first and second substances,is shown inThe speech recognition model in the frame to the second in the vocabularyThe weight of the individual characters is such that,take a value of1、2、…、NAnd the posterior probability of the audio output by the speech recognition model is as follows:
wherein the content of the first and second substances,denotes the firstThe frame of audio is a frame of audio,the parameters that represent the model of the speech recognition,representing the speech recognition model for the first of a language in a given frameThe posterior probability of an individual character.
Further, the calculating the probability average value of the character with the maximum probability of each non-blank frame of the speech through the posterior probability output by each speech recognition model comprises:
for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step;
removing probability output of blank frames, taking natural logarithm of all probabilities output by each speech recognition model, and then taking an average value of all effective outputs of each speech recognition model to obtain a probability average value of the whole speech in the corresponding language speech recognition model; the score calculation formula is as follows:
wherein the content of the first and second substances,representing the confidence level of the speech under the speech recognition model,indicating the position of the blank character in the vocabulary,the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,the representation is taken to be the maximum probability value,the representation is taken from the natural logarithm of the number,the parameters representing the model of the speech recognition are,representing the speech recognition model for the second of a language under a given frameThe posterior probability of a character, T, represents the total number of feature vectors in speech. When the method is applied, the probability output of blank frames is removed, and score calculation is not involved, because the speech recognition model output usually contains a large number of blank nonsense characters, does not contain linguistic information, and only represents the blank frames of audio or the segmentation identifiers of two characters, so that the probability output needs to be removed. The invention takes a natural logarithm for each probability, because if the limiting probability is in the interval of 0-1, the probability of partial frames is diluted after the average value is taken, and the final confidence coefficient is in a smaller range, therefore, the probability takes a natural logarithm, so that the confidence coefficient of the word with the maximum probability in a frame is more distinguished.
Further, when the confidence corresponding to each language is scaled, for any two languages, the calculation formula of the scaling factor is as follows:
wherein, the first and the second end of the pipe are connected with each other,representing a scaling factor of the first language relative to the second language,indicating the size of the vocabulary in the first language,indicating the size of the vocabulary in the second language,Trepresents the total number of feature vectors in the speech,take a value of1、2、…、T。
Further, the judgment formula for the confidence level obtained by comparing the speech recognition models is as follows:
wherein the content of the first and second substances,Score 1 the confidence level output for the first speech recognition model,Score 2 the confidence level output for the second speech recognition model,and after the scaling factor is added, if the confidence coefficient of the first language is greater than that of the second language, the voice is represented as the first language, and if the confidence coefficient of the first language is less than that of the second language, the voice is represented as the second language.
The system for realizing the language identification method of the civil aviation multi-language radio land-air communication comprises the following steps:
the text dictionary building module is used for building text dictionaries of various languages;
the speech recognition model training module is used for training and deploying speech recognition models of various languages from end to end;
the text and posterior probability acquisition module is used for acquiring texts and posterior probabilities which are obtained by recognition in the voice recognition models of various languages of the voice input;
the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by the voice recognition models as the confidence coefficient of the corresponding language;
the voice confidence scaling module is used for scaling the confidence corresponding to each language so as to lead the confidence probability mean values output by each voice recognition model to be consistent;
the confidence coefficient comparison module is used for comparing the confidence coefficient obtained by each voice recognition model based on the scaling factor and selecting the language corresponding to the voice recognition model with the relatively maximum confidence coefficient as the language of the current voice;
and the text output module is used for outputting the output text of the speech recognition model with the relatively maximum confidence coefficient as the current text.
In conclusion, compared with the prior art, the invention has the following beneficial effects: (1) The method combines the existing voice recognition model, obtains the confidence coefficient of the voice under the model by averaging the posterior probability of each effective audio frame output by the voice recognition model, then carries out weighted comparison on each confidence coefficient under each voice recognition model according to the size of a vocabulary table, and obtains the voice score value with the maximum confidence coefficient through calculation, thereby obtaining the language information of the voice. Because the linguistic information of the voice is combined, the method can overcome the problems of low generalization and robustness of the traditional language identification method, and can adapt to various complex noise environments.
(2) The invention only uses the posterior probability of the speech recognition as the unique information of the calculated language, has almost negligible calculation complexity compared with the traditional deep learning model scheme, and can not bring excessive performance loss due to the fact that the speech recognition model can carry out parallel reasoning.
(3) The method can adapt to a more complex voice environment, does not bring low computational complexity, and can be effectively applied to the task of civil aviation land-air conversation voice analysis to assist airport control personnel in making decisions.
(4) The invention can be applied to the discrimination of Chinese, english, japanese, russian and other languages, and has wide application range.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a flow chart of an embodiment of the present invention;
fig. 2 is a block diagram of a system in an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1:
as shown in fig. 1, the language identification method for civil aviation multi-language radio land-air conversation includes: constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language; respectively sending the voice into voice recognition models of various languages for recognition to obtain texts, and obtaining posterior probabilities output by the voice recognition models; calculating the probability average value of the character with the maximum probability of each non-blank frame of the voice through the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language; scaling the confidence degrees corresponding to the languages to make the confidence probability mean values output by the voice recognition models consistent; and comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the highest confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text.
The text dictionary of the embodiment is expressed asA piece of text corresponding to a voice is represented asWherein, in the process,Nthe size of the vocabulary table is determined according to the number of vocabularies of each language,w N in table dictionaryNThe number of the characters is one, and the characters,represents the total number of audio frames of a piece of speech,respectively No. 1, no. 2, no. 3, \ 8230; and,mThe index of the corresponding character in the frame audio in the dictionary,is shown asmAnd (5) characters corresponding to the frame audio. In this embodiment, training and deploying the end-to-end speech recognition models of various languages specifically includes: and respectively constructing a deep learning voice recognition model for each language, extracting FBank characteristics of the voice as the input of the model, and training to obtain the end-to-end voice recognition model of each language through a training strategy based on a CTC loss function. Wherein, the CTC (connection temporal classification) loss function is a connection timing classification loss function. In this embodiment, CTC is used as a loss function of model training, and during training, after speech is output through model modeling, a result of a model is decoded and searched through a modeling strategy based on CTC, so that a posterior probability vector corresponding to each audio frame one to one can be obtained.
In this embodiment, the sending the speech to the speech recognition models of the different languages respectively for recognition to obtain a text, and obtaining the posterior probability output by each speech recognition model specifically includes: the speech is fed into speech recognition models of various languages, and a feature is calculatedTAudio sequence of feature vectors,Feature vectors of 64 dimensions are input into the speech recognition model, and the posterior output of each word or letter and the text after greedy decoding in the dictionary of the speech recognition model of the corresponding language of each audio frame are obtained, wherein the posterior output of the speech recognition model is as follows:
by usingIs shown asA probability vector in the frame audio corresponding to each word in the vocabulary,take a value of1、 2、…、T:
Wherein the content of the first and second substances,is shown inThe speech recognition model in the frame to the second in the vocabularyThe weight of the individual characters is such that,take a value of1、2、…、NAnd the posterior probability of the audio output by the speech recognition model is as follows:
wherein the content of the first and second substances,denotes the firstThe frame of audio is a frame of audio,the parameters representing the model of the speech recognition are,representing the speech recognition model for the second of a language under a given frameThe posterior probability of an individual character,indicating normalization to the input. In this embodiment, when calculating the feature vector of the audio sequence, the audio is pre-emphasized first, then frame division is performed, the frame length is 20ms and the frame shift is 10ms, after frame division, discrete fourier transform is performed on the audio sequence to obtain frequency domain features, and then the audio FBank features are extracted through a 64-dimensional mel filter bank.
The embodiment of calculating the probability average value of the maximum character of the probability of each non-blank frame of the speech through the posterior probabilities output by each speech recognition model specifically includes: for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step; removing probability output of blank frames, taking the natural logarithm of all probabilities output by each voice recognition model, and then averaging all effective outputs of each voice recognition model to obtain the probability average value of the whole voice in the corresponding language voice recognition model; the score calculation formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,indicates the confidence of the speech under the speech recognition model (the lower the score, the higher the confidence is indicated),indicating the position of the blank character in the vocabulary,the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,the representation is taken to be the maximum probability value,the representation is taken from the natural logarithm of the number,the parameters representing the model of the speech recognition are,representing the speech recognition model for the second of a language under a given frameThe posterior probability of an individual character,Trepresenting the total number of feature vectors in the speech.
When the embodiment is applied, the inconsistent size of the word lists of various languages is considered becauseThe normalization calculation of (2) can cause the mean values of confidence probabilities calculated by each speech recognition model to be inconsistent, and effective comparison cannot be carried out. Assuming that there is an audio, the probability of each word predicted by each speech recognition model for each frame is equal, the probability of each word is equal, which indicates that each speech recognition model cannot correctly recognize the text of the speech, and it is equally confused for each frame of the audio, then the final scores of each speech recognition model should be equal, based on which, adding a scaling factor to the score of one of any two compared speech recognition models, the following identity can be obtained:
when the confidence corresponding to each language can be derived through transformation, the calculation formula of the scaling factor for any two languages is as follows:
wherein, the first and the second end of the pipe are connected with each other,representing a scaling factor of the first language relative to the second language,indicating the size of the vocabulary in the first language,indicating the size of the vocabulary in the second language,Trepresenting the total number of feature vectors in the speech,take a value of1、2、…、T。
The judgment formula for the confidence level obtained by comparing the speech recognition models in the embodiment is as follows:
wherein the content of the first and second substances,Score 1 the confidence level output for the first speech recognition model,Score 2 the confidence level output for the second speech recognition model,and after the scaling factor is added, if the confidence coefficient of the first language is greater than that of the second language, the voice is represented as the first language, and if the confidence coefficient of the first language is less than that of the second language, the voice is represented as the second language. Last word of combinationAnd outputting the voice recognition text as a real text of the model after the language judgment. If the first language confidence level is equal to the second language confidence level, usually the speech segment is blank speech or the speech content is short (less than 0.5 seconds), this embodiment may default to select a language output as the language judgment under the extreme condition.
As shown in fig. 2, this embodiment further discloses a system for implementing the language identification method for civil aviation multi-language radio land-air communication, including: the text dictionary building module is used for building text dictionaries of various languages; the speech recognition model training module is used for training and deploying speech recognition models of various languages from end to end; the text and posterior probability acquisition module is used for acquiring texts and posterior probabilities which are obtained by recognition in the voice recognition models of various languages of the voice input; the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by the voice recognition models as the confidence coefficient of the corresponding language; the voice confidence scaling module is used for scaling the confidence corresponding to each language so as to lead the confidence probability mean values output by each voice recognition model to be consistent; the confidence coefficient comparison module is used for comparing the confidence coefficient obtained by each voice recognition model based on the scaling factor and selecting the language corresponding to the voice recognition model with the relatively maximum confidence coefficient as the language of the current voice; and the text output module is used for outputting the output text of the speech recognition model with the relatively maximum confidence coefficient as the current text.
When the method is applied, confidence level scoring is performed on each frame of the voice by using the prior linguistic information of the voice recognition, and the language type of the voice is judged by comparing confidence levels output by the voice recognition models under different languages.
The following examples will be given when the present embodiment is applied to chinese and english recognition: the Chinese text dictionary comprises 5207 characters (containing a blank character) and 28 letters (containing spaces and blank characters) in English in total, and when the text is expressed, for example, the Chinese sentence "national aviation three-hole two-right turn aviationGoing to three five ", it can be represented as (nation, navigation, three, hole, two, right, turn, navigation, going to, three, five). For a piece of calculated featuresAudio sequence of feature vectors. After the input of the voice recognition model, the posterior output and greedy decoded text of each word or letter of each audio frame in the dictionary of the corresponding Chinese-English model are obtained. The posterior output of Chinese and English is respectively:
here, use is made ofAndrespectively represent the sizes of Chinese and English vocabularies,andis a firstProbability vectors in frame audio corresponding to each word in the vocabulary:
the posterior probability of the audio output in the chinese-english speech recognition model can be expressed as:
whereinThe parameters representing the chinese model are the parameters,the parameters of the english model are represented by,、respectively representing Chinese and English recognition models for the second language in a given frameThe posterior probability of an individual character.
The Chinese and English scores are calculated as follows:
whereinAndand representing the confidence of the speech under Chinese and English speech recognition models respectively.
Considering the inconsistency of the Chinese and English word lists in size, becauseThe normalization calculation of (2) can cause the mean value of confidence probability calculated by Chinese and English models to be inconsistent, and effective comparison cannot be carried out. Assuming an audio, the probability of each word predicted by the Chinese-English model for each frame is equal, i.e. the probability is equal,Specifically, if a Chinese model based on 5207 common words and an English model based on 28 letters are used, the output probability of the language module corresponding to the Chinese model isThe probability output by the language module corresponding to the English recognition model is. The probability of each word is equal, which means that the Chinese-English model cannot correctly recognize the text of the voice, and each frame of the audio is equally confused, so that the final score of the Chinese-English voice recognition model should be equal, and based on the result, a scaling factor is added to the score of the English recognition model, the following identity can be obtained:
whereinA scaling factor is represented that is a function of,which represents the size of the chinese vocabulary and is,indicating the size of the english vocabulary. Then by transformation it can be deduced:
speech recognition with maximum confidence by comparing confidence levels with obtained scaling factorsAnd taking the language corresponding to the model as the current language. Specifically, if the vocabulary sizes of Chinese and English are 5207 and 28, respectively, then. The final language can be judged by the following formula:
after the scaling factor is added, if the Chinese confidence coefficient is greater than the English confidence coefficient, the voice is represented as Chinese, otherwise, the voice is represented as English. Finally, combining the text of voice recognition as the real text of the model after language judgment and outputting. Specifically, the audio has the following section, the Chinese character is identified as (dog postbox), and after passing through the Chinese speech recognition model, the probability of each output character isEnglish recognition is (go home), and the probability of each letter is. Respectively taking the natural logarithm value as,. To obtain the mean values thereof respectivelyAnd finallyTherefore, the audio is determined to be english, and the text is determined to be (go home).
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (6)
1. The language identification method of civil aviation multi-language radio land-air conversation is characterized by comprising the following steps:
constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language;
respectively sending the voice into voice recognition models of various languages for recognition to obtain texts, and obtaining posterior probabilities output by the voice recognition models;
calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language;
scaling the confidence degrees corresponding to each language to make the confidence probability mean values output by each speech recognition model consistent;
comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the relatively maximum confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text;
the calculating the probability average value of each non-blank frame probability maximum character of the voice through the posterior probability output by each voice recognition model comprises the following steps:
for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step;
removing probability output of blank frames, taking natural logarithm of all probabilities output by each speech recognition model, and then taking an average value of all effective outputs of each speech recognition model to obtain a probability average value of the whole speech in the corresponding language speech recognition model; the score calculation formula is as follows:
wherein the content of the first and second substances,representing the confidence level of the speech under the speech recognition model,indicating the position of the blank character in the vocabulary,the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,the representation is taken to be the maximum probability value,the representation is taken from the natural logarithm of the number,the parameters representing the model of the speech recognition are,representing the speech recognition model for the first of a language in a given frameThe posterior probability of an individual character,Trepresenting the total number of feature vectors in the speech.
2. The method of claim 1 for recognizing the language of civil aviation multi-language radio land-air conversationCharacterized in that said text dictionary is represented asA piece of text corresponding to a voice is represented asWherein, in the step (A),Nthe size of the vocabulary table is determined according to the number of vocabularies of each language,w N in table dictionaryNThe number of the characters is one,represents the total number of audio frames of a piece of speech,respectively No. 1, no. 2, no. 3, \ 8230; and,mThe index of the corresponding character in the frame audio in the dictionary,denotes the firstmCharacters corresponding to the frame audio;
the training and deploying of the end-to-end speech recognition models of various languages comprises the following steps: and respectively constructing a deep learning voice recognition model for each language, extracting FBank characteristics of the voice as the input of the model, and training to obtain an end-to-end voice recognition model of each language through a training strategy based on a CTC loss function.
3. The language identification method of civil aviation multilingual radio land-air communication according to claim 1, wherein the step of sending the speech into the speech recognition models of different languages respectively for recognition to obtain texts, and obtaining the posterior probabilities output by the speech recognition models comprises:
the speech is fed into speech recognition models of various languages, and a feature is calculatedTAudio sequence of feature vectorsAfter the text is input into the speech recognition model, the posterior output of each word or letter and the text after greedy decoding of each audio frame in the dictionary of the corresponding language speech recognition model are obtained, and the posterior output of the speech recognition model is as follows:
wherein the content of the first and second substances,Nfor vocabulary size, adoptIs shown asThe probability vectors in the frame audio corresponding to each word in the vocabulary,take a value of1、2、…、T:
Wherein the content of the first and second substances,is shown inThe speech recognition model in the frame to the second in the vocabularyThe weight of each of the characters is determined,take a value of1、 2、…、NAnd the posterior probability of the audio output by the speech recognition model is as follows:
wherein the content of the first and second substances,is shown asThe frame of audio is a frame of audio,the parameters representing the model of the speech recognition are,representing the speech recognition model for the second of a language under a given frameThe posterior probability of an individual character.
4. The language identification method of civil aviation multilingual radio land-air communication according to claim 1, wherein when the confidence levels corresponding to the languages are scaled, the scaling factor is calculated for any two languages as follows:
wherein the content of the first and second substances,representing a scaling factor for the first language relative to the second language,indicating the size of the vocabulary in the first language,indicating the size of the vocabulary in the second language,Trepresenting the total number of feature vectors in the speech,take a value of1、2、…、T。
5. The language identification method of civil aviation multi-language radio land-air communication according to claim 1, wherein the confidence level obtained by comparing the speech recognition models is determined according to the following formula:
wherein the content of the first and second substances,Score 1 the confidence level output for the first speech recognition model,Score 2 the confidence level output for the second speech recognition model,and after the scaling factor is added, if the confidence coefficient of the first language is greater than that of the second language, the voice is represented as the first language, and if the confidence coefficient of the first language is less than that of the second language, the voice is represented as the second language.
6. A system for implementing the language identification method for civil aviation multilingual radio land-air communication according to any one of claims 1 to 5, comprising:
the text dictionary building module is used for building text dictionaries of various languages;
the speech recognition model training module is used for training and deploying speech recognition models of various languages from end to end;
the text and posterior probability acquisition module is used for acquiring texts and posterior probabilities which are obtained by recognition in the voice recognition models of various languages of the voice input;
the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of the maximum character of each non-blank frame probability of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language;
the voice confidence scaling module is used for scaling the confidence corresponding to each language so as to lead the confidence probability mean values output by each voice recognition model to be consistent;
the confidence coefficient comparison module is used for comparing the confidence coefficient obtained by each voice recognition model based on the scaling factor and selecting the language corresponding to the voice recognition model with the relatively maximum confidence coefficient as the language of the current voice;
the text output module is used for outputting the output text of the speech recognition model with the maximum confidence coefficient as the current text;
the calculating the probability average value of each non-blank frame probability maximum character of the voice through the posterior probability output by each voice recognition model comprises the following steps:
for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step;
removing probability output of blank frames, taking the natural logarithm of all probabilities output by each voice recognition model, and then averaging all effective outputs of each voice recognition model to obtain the probability average value of the whole voice in the corresponding language voice recognition model; the score calculation formula is as follows:
wherein the content of the first and second substances,representing the confidence of the speech under the speech recognition model,indicating the position of the blank character in the vocabulary,the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,the representation is taken to be the maximum probability value,the representation is taken from the natural logarithm of the number,the parameters representing the model of the speech recognition are,representing the speech recognition model for the first of a language in a given frameThe posterior probability of an individual character,Trepresenting the total number of feature vectors in the speech.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211331120.0A CN115394288B (en) | 2022-10-28 | 2022-10-28 | Language identification method and system for civil aviation multi-language radio land-air conversation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211331120.0A CN115394288B (en) | 2022-10-28 | 2022-10-28 | Language identification method and system for civil aviation multi-language radio land-air conversation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115394288A CN115394288A (en) | 2022-11-25 |
CN115394288B true CN115394288B (en) | 2023-01-24 |
Family
ID=84115019
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211331120.0A Active CN115394288B (en) | 2022-10-28 | 2022-10-28 | Language identification method and system for civil aviation multi-language radio land-air conversation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115394288B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105280181A (en) * | 2014-07-15 | 2016-01-27 | 中国科学院声学研究所 | Training method for language recognition model and language recognition method |
CN110895932A (en) * | 2018-08-24 | 2020-03-20 | 中国科学院声学研究所 | Multi-language voice recognition method based on language type and voice content collaborative classification |
CN111402861A (en) * | 2020-03-25 | 2020-07-10 | 苏州思必驰信息科技有限公司 | Voice recognition method, device, equipment and storage medium |
CN112017676A (en) * | 2019-05-31 | 2020-12-01 | 京东数字科技控股有限公司 | Audio processing method, apparatus and computer readable storage medium |
CN112951240A (en) * | 2021-05-14 | 2021-06-11 | 北京世纪好未来教育科技有限公司 | Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium |
CN113298188A (en) * | 2021-06-28 | 2021-08-24 | 深圳市商汤科技有限公司 | Character recognition and neural network training method and device |
CN114648976A (en) * | 2022-02-16 | 2022-06-21 | 普强时代(珠海横琴)信息技术有限公司 | Language identification method and device, electronic equipment and medium |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5530729B2 (en) * | 2009-01-23 | 2014-06-25 | 本田技研工業株式会社 | Speech understanding device |
JP5967569B2 (en) * | 2012-07-09 | 2016-08-10 | 国立研究開発法人情報通信研究機構 | Speech processing system |
CN106782513B (en) * | 2017-01-25 | 2019-08-23 | 上海交通大学 | Speech recognition realization method and system based on confidence level |
CN109119072A (en) * | 2018-09-28 | 2019-01-01 | 中国民航大学 | Civil aviaton's land sky call acoustic model construction method based on DNN-HMM |
CN112233653B (en) * | 2020-12-10 | 2021-03-12 | 北京远鉴信息技术有限公司 | Method, device and equipment for training multi-dialect accent mandarin speech recognition model |
-
2022
- 2022-10-28 CN CN202211331120.0A patent/CN115394288B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105280181A (en) * | 2014-07-15 | 2016-01-27 | 中国科学院声学研究所 | Training method for language recognition model and language recognition method |
CN110895932A (en) * | 2018-08-24 | 2020-03-20 | 中国科学院声学研究所 | Multi-language voice recognition method based on language type and voice content collaborative classification |
CN112017676A (en) * | 2019-05-31 | 2020-12-01 | 京东数字科技控股有限公司 | Audio processing method, apparatus and computer readable storage medium |
CN111402861A (en) * | 2020-03-25 | 2020-07-10 | 苏州思必驰信息科技有限公司 | Voice recognition method, device, equipment and storage medium |
CN112951240A (en) * | 2021-05-14 | 2021-06-11 | 北京世纪好未来教育科技有限公司 | Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium |
CN113298188A (en) * | 2021-06-28 | 2021-08-24 | 深圳市商汤科技有限公司 | Character recognition and neural network training method and device |
CN114648976A (en) * | 2022-02-16 | 2022-06-21 | 普强时代(珠海横琴)信息技术有限公司 | Language identification method and device, electronic equipment and medium |
Non-Patent Citations (1)
Title |
---|
基于多任务神经网络的语种识别研究;秦晨光;《中国优秀硕士学位论文全文数据库》;20210215(第2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115394288A (en) | 2022-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109255113B (en) | Intelligent proofreading system | |
CN108304372B (en) | Entity extraction method and device, computer equipment and storage medium | |
US11062699B2 (en) | Speech recognition with trained GMM-HMM and LSTM models | |
CN110895932B (en) | Multi-language voice recognition method based on language type and voice content collaborative classification | |
CN109410914B (en) | Method for identifying Jiangxi dialect speech and dialect point | |
US6836760B1 (en) | Use of semantic inference and context-free grammar with speech recognition system | |
CN109637537B (en) | Method for automatically acquiring annotated data to optimize user-defined awakening model | |
CN110070855B (en) | Voice recognition system and method based on migrating neural network acoustic model | |
CN109119072A (en) | Civil aviaton's land sky call acoustic model construction method based on DNN-HMM | |
CN111177324B (en) | Method and device for carrying out intention classification based on voice recognition result | |
JP2005084681A (en) | Method and system for semantic language modeling and reliability measurement | |
CN111445898B (en) | Language identification method and device, electronic equipment and storage medium | |
WO2021147041A1 (en) | Semantic analysis method and apparatus, device, and storage medium | |
CN110021293A (en) | Audio recognition method and device, readable storage medium storing program for executing | |
CN115617955B (en) | Hierarchical prediction model training method, punctuation symbol recovery method and device | |
CN112767925B (en) | Voice information recognition method and device | |
JP6875819B2 (en) | Acoustic model input data normalization device and method, and voice recognition device | |
Holone | N-best list re-ranking using syntactic score: A solution for improving speech recognition accuracy in air traffic control | |
CN114153971A (en) | Error-containing Chinese text error correction, identification and classification equipment | |
CN115064154A (en) | Method and device for generating mixed language voice recognition model | |
CN114944150A (en) | Dual-task-based Conformer land-air communication acoustic model construction method | |
CN108806691B (en) | Voice recognition method and system | |
CN115104151A (en) | Offline voice recognition method and device, electronic equipment and readable storage medium | |
CN111898342A (en) | Chinese pronunciation verification method based on edit distance | |
CN115394288B (en) | Language identification method and system for civil aviation multi-language radio land-air conversation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |