CN115394288B - Language identification method and system for civil aviation multi-language radio land-air conversation - Google Patents

Language identification method and system for civil aviation multi-language radio land-air conversation Download PDF

Info

Publication number
CN115394288B
CN115394288B CN202211331120.0A CN202211331120A CN115394288B CN 115394288 B CN115394288 B CN 115394288B CN 202211331120 A CN202211331120 A CN 202211331120A CN 115394288 B CN115394288 B CN 115394288B
Authority
CN
China
Prior art keywords
language
voice
recognition model
probability
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211331120.0A
Other languages
Chinese (zh)
Other versions
CN115394288A (en
Inventor
张华勇
刘丰丰
王小刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Aiwei Translation Technology Co ltd
Original Assignee
Chengdu Aiwei Translation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Aiwei Translation Technology Co ltd filed Critical Chengdu Aiwei Translation Technology Co ltd
Priority to CN202211331120.0A priority Critical patent/CN115394288B/en
Publication of CN115394288A publication Critical patent/CN115394288A/en
Application granted granted Critical
Publication of CN115394288B publication Critical patent/CN115394288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention discloses a language identification method and a language identification system for civil aviation multi-language radio land-air communication, wherein the method comprises the following steps: constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language; respectively sending the voice into voice recognition models of various languages to obtain texts and posterior probabilities; calculating the probability average value of the character with the maximum probability of each non-blank frame of the voice through the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language; the confidence corresponding to each language is zoomed; and comparing the confidence degrees obtained by the voice recognition models, selecting the language of the current voice, and returning the text. When the method is applied, a complex calculation module is not required to be added, the priori knowledge of the speech recognition algorithm on the language can be fully utilized, and the method has high adaptivity to language recognition in a complex environment.

Description

Language identification method and system for civil aviation multi-language radio land-air conversation
Technical Field
The invention relates to the field of civil aviation radio land-air calls, in particular to a language identification method and system of civil aviation multi-language radio land-air calls.
Background
With the rapid development of global economy, civil aviation rapidly develops, the number of flights increases year by year, and the workload of controllers is continuously increased due to the continuously increasing air traffic flow. The controllers often need to communicate with domestic and international flight pilots during working, and understanding of the land-air communication contents by the controllers in a high-intensity working process is prone to deviation, so that safety risks are brought to normal operation of airports.
In order to reduce the workload of a controller, a large number of air traffic control automation systems are generated at the same time, the controller is assisted to work, and the potential safety hazard in the airport control work is reduced. Because the air traffic control personnel and the flight personnel directly communicate through voice, most systems introduce voice recognition technology as a means for converting voice into text. Then, most of the existing mature speech recognition technologies are single-language recognition, and due to the difference of pronunciation dictionaries between languages, the high-precision accuracy rate is difficult to realize by a scheme based on mixed language recognition, so that the language recognition technology is a key technology for civil aviation land-air communication speech recognition.
Most of the existing language recognition technologies are based on a neural network algorithm, voice acoustic features are extracted through a large number of voice language databases, and a deep learning model is trained by using a back propagation algorithm. However, the deep learning model has low robustness and is often poor in performance in an unknown environment which is not found in training, the deep learning model is time-consuming in inference, and the training parameter adjusting steps are tedious, which increases the complexity of the system.
Disclosure of Invention
The invention aims to solve the problem of complex system of the existing language identification, and provides a language identification method of civil aviation multilingual radio land-air communication. The invention also discloses a system for realizing the language identification method of the civil aviation multi-language radio land-air conversation.
The purpose of the invention is mainly realized by the following technical scheme:
the language identification method of civil aviation multi-language radio land-air conversation comprises the following steps:
constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language;
respectively sending the voice into voice recognition models of various languages for recognition to obtain texts, and obtaining posterior probabilities output by the voice recognition models;
calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language;
scaling the confidence degrees corresponding to the languages to make the confidence probability mean values output by the voice recognition models consistent;
and comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the relatively maximum confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text. When the method is applied, all the voice recognition models facing civil aviation are trained respectively, the audio input by a user is input into all the voice recognition models, the posterior probability of each frame of audio for each character is obtained, the posterior probability average value of each character output by each voice recognition model is obtained respectively, the average value is used as the language confidence coefficient of the input voice in the language corresponding to the current recognition model, all the confidence coefficients are compared, the language corresponding to the higher confidence coefficient is selected as the language output of the current voice, and then the voice recognition result corresponding to the language is selected as the recognition result of the current voice. Compared with the traditional language identification method, the method does not need an additional module, has lower calculation complexity, and has higher generalization and robustness for language identification because the prior knowledge of speech identification is combined.
Further, the text dictionary is expressed as
Figure 453102DEST_PATH_IMAGE001
A piece of text corresponding to a voice is represented as
Figure 680821DEST_PATH_IMAGE002
Wherein, in the step (A),Nthe size of the vocabulary table is determined according to the number of vocabularies of each language,w N in table dictionaryNThe number of the characters is one,
Figure 309248DEST_PATH_IMAGE003
represents the total number of audio frames of a piece of speech,
Figure 440278DEST_PATH_IMAGE004
respectively 1, 2, 3, \8230,mThe index of the corresponding character in the frame audio in the dictionary,
Figure 128748DEST_PATH_IMAGE005
is shown asmCharacters corresponding to the frame audio;
the training and deploying of the end-to-end speech recognition models of various languages comprises the following steps: and respectively constructing a deep learning voice recognition model for each language, extracting FBank characteristics of the voice as the input of the model, and training to obtain an end-to-end voice recognition model of each language through a training strategy based on a CTC loss function.
Further, the sending the speech into the speech recognition models of the respective languages respectively for recognition to obtain a text, and obtaining the posterior probability output by each speech recognition model includes:
the speech is fed into speech recognition models of various languages, and a feature is calculatedTAudio sequence of feature vectors
Figure 578184DEST_PATH_IMAGE006
After the text is input into the speech recognition model, the posterior output and greedy decoded text of each character or letter of each audio frame in the dictionary of the corresponding language speech recognition model are obtained, and the posterior output of the speech recognition model is as follows:
Figure 10302DEST_PATH_IMAGE007
by using
Figure 104160DEST_PATH_IMAGE008
Is shown as
Figure 742295DEST_PATH_IMAGE009
The probability vectors in the frame audio corresponding to each word in the vocabulary,
Figure 944606DEST_PATH_IMAGE009
take a value of1、 2、…、T
Figure 321361DEST_PATH_IMAGE010
Wherein the content of the first and second substances,
Figure 659938DEST_PATH_IMAGE011
is shown in
Figure 955790DEST_PATH_IMAGE009
The speech recognition model in the frame to the second in the vocabulary
Figure 520764DEST_PATH_IMAGE012
The weight of the individual characters is such that,
Figure 61729DEST_PATH_IMAGE012
take a value of1、2、…、NAnd the posterior probability of the audio output by the speech recognition model is as follows:
Figure 520392DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 862512DEST_PATH_IMAGE014
denotes the first
Figure 508256DEST_PATH_IMAGE009
The frame of audio is a frame of audio,
Figure 820289DEST_PATH_IMAGE015
the parameters that represent the model of the speech recognition,
Figure 900503DEST_PATH_IMAGE016
representing the speech recognition model for the first of a language in a given frame
Figure 538158DEST_PATH_IMAGE017
The posterior probability of an individual character.
Further, the calculating the probability average value of the character with the maximum probability of each non-blank frame of the speech through the posterior probability output by each speech recognition model comprises:
for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step;
removing probability output of blank frames, taking natural logarithm of all probabilities output by each speech recognition model, and then taking an average value of all effective outputs of each speech recognition model to obtain a probability average value of the whole speech in the corresponding language speech recognition model; the score calculation formula is as follows:
Figure 77723DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 724605DEST_PATH_IMAGE019
representing the confidence level of the speech under the speech recognition model,
Figure 626702DEST_PATH_IMAGE020
indicating the position of the blank character in the vocabulary,
Figure 310624DEST_PATH_IMAGE021
the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,
Figure 432427DEST_PATH_IMAGE022
the representation is taken to be the maximum probability value,
Figure 882999DEST_PATH_IMAGE023
the representation is taken from the natural logarithm of the number,
Figure 780548DEST_PATH_IMAGE015
the parameters representing the model of the speech recognition are,
Figure 494426DEST_PATH_IMAGE016
representing the speech recognition model for the second of a language under a given frame
Figure 867639DEST_PATH_IMAGE017
The posterior probability of a character, T, represents the total number of feature vectors in speech. When the method is applied, the probability output of blank frames is removed, and score calculation is not involved, because the speech recognition model output usually contains a large number of blank nonsense characters, does not contain linguistic information, and only represents the blank frames of audio or the segmentation identifiers of two characters, so that the probability output needs to be removed. The invention takes a natural logarithm for each probability, because if the limiting probability is in the interval of 0-1, the probability of partial frames is diluted after the average value is taken, and the final confidence coefficient is in a smaller range, therefore, the probability takes a natural logarithm, so that the confidence coefficient of the word with the maximum probability in a frame is more distinguished.
Further, when the confidence corresponding to each language is scaled, for any two languages, the calculation formula of the scaling factor is as follows:
Figure 92209DEST_PATH_IMAGE024
wherein, the first and the second end of the pipe are connected with each other,
Figure 968898DEST_PATH_IMAGE025
representing a scaling factor of the first language relative to the second language,
Figure 260202DEST_PATH_IMAGE026
indicating the size of the vocabulary in the first language,
Figure 855132DEST_PATH_IMAGE027
indicating the size of the vocabulary in the second language,Trepresents the total number of feature vectors in the speech,
Figure 381928DEST_PATH_IMAGE009
take a value of1、2、…、T
Further, the judgment formula for the confidence level obtained by comparing the speech recognition models is as follows:
Figure 614588DEST_PATH_IMAGE028
wherein the content of the first and second substances,Score 1 the confidence level output for the first speech recognition model,Score 2 the confidence level output for the second speech recognition model,
Figure 76794DEST_PATH_IMAGE025
and after the scaling factor is added, if the confidence coefficient of the first language is greater than that of the second language, the voice is represented as the first language, and if the confidence coefficient of the first language is less than that of the second language, the voice is represented as the second language.
The system for realizing the language identification method of the civil aviation multi-language radio land-air communication comprises the following steps:
the text dictionary building module is used for building text dictionaries of various languages;
the speech recognition model training module is used for training and deploying speech recognition models of various languages from end to end;
the text and posterior probability acquisition module is used for acquiring texts and posterior probabilities which are obtained by recognition in the voice recognition models of various languages of the voice input;
the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by the voice recognition models as the confidence coefficient of the corresponding language;
the voice confidence scaling module is used for scaling the confidence corresponding to each language so as to lead the confidence probability mean values output by each voice recognition model to be consistent;
the confidence coefficient comparison module is used for comparing the confidence coefficient obtained by each voice recognition model based on the scaling factor and selecting the language corresponding to the voice recognition model with the relatively maximum confidence coefficient as the language of the current voice;
and the text output module is used for outputting the output text of the speech recognition model with the relatively maximum confidence coefficient as the current text.
In conclusion, compared with the prior art, the invention has the following beneficial effects: (1) The method combines the existing voice recognition model, obtains the confidence coefficient of the voice under the model by averaging the posterior probability of each effective audio frame output by the voice recognition model, then carries out weighted comparison on each confidence coefficient under each voice recognition model according to the size of a vocabulary table, and obtains the voice score value with the maximum confidence coefficient through calculation, thereby obtaining the language information of the voice. Because the linguistic information of the voice is combined, the method can overcome the problems of low generalization and robustness of the traditional language identification method, and can adapt to various complex noise environments.
(2) The invention only uses the posterior probability of the speech recognition as the unique information of the calculated language, has almost negligible calculation complexity compared with the traditional deep learning model scheme, and can not bring excessive performance loss due to the fact that the speech recognition model can carry out parallel reasoning.
(3) The method can adapt to a more complex voice environment, does not bring low computational complexity, and can be effectively applied to the task of civil aviation land-air conversation voice analysis to assist airport control personnel in making decisions.
(4) The invention can be applied to the discrimination of Chinese, english, japanese, russian and other languages, and has wide application range.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a flow chart of an embodiment of the present invention;
fig. 2 is a block diagram of a system in an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1:
as shown in fig. 1, the language identification method for civil aviation multi-language radio land-air conversation includes: constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language; respectively sending the voice into voice recognition models of various languages for recognition to obtain texts, and obtaining posterior probabilities output by the voice recognition models; calculating the probability average value of the character with the maximum probability of each non-blank frame of the voice through the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language; scaling the confidence degrees corresponding to the languages to make the confidence probability mean values output by the voice recognition models consistent; and comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the highest confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text.
The text dictionary of the embodiment is expressed as
Figure 159019DEST_PATH_IMAGE001
A piece of text corresponding to a voice is represented as
Figure 489506DEST_PATH_IMAGE002
Wherein, in the process,Nthe size of the vocabulary table is determined according to the number of vocabularies of each language,w N in table dictionaryNThe number of the characters is one, and the characters,
Figure 216154DEST_PATH_IMAGE003
represents the total number of audio frames of a piece of speech,
Figure 239474DEST_PATH_IMAGE004
respectively No. 1, no. 2, no. 3, \ 8230; and,mThe index of the corresponding character in the frame audio in the dictionary,
Figure 684362DEST_PATH_IMAGE005
is shown asmAnd (5) characters corresponding to the frame audio. In this embodiment, training and deploying the end-to-end speech recognition models of various languages specifically includes: and respectively constructing a deep learning voice recognition model for each language, extracting FBank characteristics of the voice as the input of the model, and training to obtain the end-to-end voice recognition model of each language through a training strategy based on a CTC loss function. Wherein, the CTC (connection temporal classification) loss function is a connection timing classification loss function. In this embodiment, CTC is used as a loss function of model training, and during training, after speech is output through model modeling, a result of a model is decoded and searched through a modeling strategy based on CTC, so that a posterior probability vector corresponding to each audio frame one to one can be obtained.
In this embodiment, the sending the speech to the speech recognition models of the different languages respectively for recognition to obtain a text, and obtaining the posterior probability output by each speech recognition model specifically includes: the speech is fed into speech recognition models of various languages, and a feature is calculatedTAudio sequence of feature vectors
Figure 54425DEST_PATH_IMAGE006
Figure 291372DEST_PATH_IMAGE029
Feature vectors of 64 dimensions are input into the speech recognition model, and the posterior output of each word or letter and the text after greedy decoding in the dictionary of the speech recognition model of the corresponding language of each audio frame are obtained, wherein the posterior output of the speech recognition model is as follows:
Figure 95379DEST_PATH_IMAGE007
by using
Figure 152197DEST_PATH_IMAGE008
Is shown as
Figure 699853DEST_PATH_IMAGE009
A probability vector in the frame audio corresponding to each word in the vocabulary,
Figure 791306DEST_PATH_IMAGE009
take a value of1、 2、…、T
Figure 766215DEST_PATH_IMAGE010
Wherein the content of the first and second substances,
Figure 77373DEST_PATH_IMAGE011
is shown in
Figure 694299DEST_PATH_IMAGE009
The speech recognition model in the frame to the second in the vocabulary
Figure 640258DEST_PATH_IMAGE012
The weight of the individual characters is such that,
Figure 51648DEST_PATH_IMAGE012
take a value of1、2、…、NAnd the posterior probability of the audio output by the speech recognition model is as follows:
Figure 83058DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 972517DEST_PATH_IMAGE014
denotes the first
Figure 38562DEST_PATH_IMAGE009
The frame of audio is a frame of audio,
Figure 355274DEST_PATH_IMAGE015
the parameters representing the model of the speech recognition are,
Figure 14925DEST_PATH_IMAGE016
representing the speech recognition model for the second of a language under a given frame
Figure 56875DEST_PATH_IMAGE017
The posterior probability of an individual character,
Figure 118372DEST_PATH_IMAGE030
indicating normalization to the input. In this embodiment, when calculating the feature vector of the audio sequence, the audio is pre-emphasized first, then frame division is performed, the frame length is 20ms and the frame shift is 10ms, after frame division, discrete fourier transform is performed on the audio sequence to obtain frequency domain features, and then the audio FBank features are extracted through a 64-dimensional mel filter bank.
The embodiment of calculating the probability average value of the maximum character of the probability of each non-blank frame of the speech through the posterior probabilities output by each speech recognition model specifically includes: for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step; removing probability output of blank frames, taking the natural logarithm of all probabilities output by each voice recognition model, and then averaging all effective outputs of each voice recognition model to obtain the probability average value of the whole voice in the corresponding language voice recognition model; the score calculation formula is as follows:
Figure 996198DEST_PATH_IMAGE018
wherein, the first and the second end of the pipe are connected with each other,
Figure 877567DEST_PATH_IMAGE019
indicates the confidence of the speech under the speech recognition model (the lower the score, the higher the confidence is indicated),
Figure 233462DEST_PATH_IMAGE020
indicating the position of the blank character in the vocabulary,
Figure 883886DEST_PATH_IMAGE021
the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,
Figure 932613DEST_PATH_IMAGE022
the representation is taken to be the maximum probability value,
Figure 301278DEST_PATH_IMAGE023
the representation is taken from the natural logarithm of the number,
Figure 962328DEST_PATH_IMAGE015
the parameters representing the model of the speech recognition are,
Figure 467259DEST_PATH_IMAGE016
representing the speech recognition model for the second of a language under a given frame
Figure 686888DEST_PATH_IMAGE017
The posterior probability of an individual character,Trepresenting the total number of feature vectors in the speech.
When the embodiment is applied, the inconsistent size of the word lists of various languages is considered because
Figure 542848DEST_PATH_IMAGE030
The normalization calculation of (2) can cause the mean values of confidence probabilities calculated by each speech recognition model to be inconsistent, and effective comparison cannot be carried out. Assuming that there is an audio, the probability of each word predicted by each speech recognition model for each frame is equal, the probability of each word is equal, which indicates that each speech recognition model cannot correctly recognize the text of the speech, and it is equally confused for each frame of the audio, then the final scores of each speech recognition model should be equal, based on which, adding a scaling factor to the score of one of any two compared speech recognition models, the following identity can be obtained:
Figure 240546DEST_PATH_IMAGE031
when the confidence corresponding to each language can be derived through transformation, the calculation formula of the scaling factor for any two languages is as follows:
Figure 865562DEST_PATH_IMAGE024
wherein, the first and the second end of the pipe are connected with each other,
Figure 256092DEST_PATH_IMAGE025
representing a scaling factor of the first language relative to the second language,
Figure 333770DEST_PATH_IMAGE026
indicating the size of the vocabulary in the first language,
Figure 602202DEST_PATH_IMAGE027
indicating the size of the vocabulary in the second language,Trepresenting the total number of feature vectors in the speech,
Figure 81725DEST_PATH_IMAGE009
take a value of1、2、…、T
The judgment formula for the confidence level obtained by comparing the speech recognition models in the embodiment is as follows:
Figure 643157DEST_PATH_IMAGE028
wherein the content of the first and second substances,Score 1 the confidence level output for the first speech recognition model,Score 2 the confidence level output for the second speech recognition model,
Figure 208130DEST_PATH_IMAGE025
and after the scaling factor is added, if the confidence coefficient of the first language is greater than that of the second language, the voice is represented as the first language, and if the confidence coefficient of the first language is less than that of the second language, the voice is represented as the second language. Last word of combinationAnd outputting the voice recognition text as a real text of the model after the language judgment. If the first language confidence level is equal to the second language confidence level, usually the speech segment is blank speech or the speech content is short (less than 0.5 seconds), this embodiment may default to select a language output as the language judgment under the extreme condition.
As shown in fig. 2, this embodiment further discloses a system for implementing the language identification method for civil aviation multi-language radio land-air communication, including: the text dictionary building module is used for building text dictionaries of various languages; the speech recognition model training module is used for training and deploying speech recognition models of various languages from end to end; the text and posterior probability acquisition module is used for acquiring texts and posterior probabilities which are obtained by recognition in the voice recognition models of various languages of the voice input; the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by the voice recognition models as the confidence coefficient of the corresponding language; the voice confidence scaling module is used for scaling the confidence corresponding to each language so as to lead the confidence probability mean values output by each voice recognition model to be consistent; the confidence coefficient comparison module is used for comparing the confidence coefficient obtained by each voice recognition model based on the scaling factor and selecting the language corresponding to the voice recognition model with the relatively maximum confidence coefficient as the language of the current voice; and the text output module is used for outputting the output text of the speech recognition model with the relatively maximum confidence coefficient as the current text.
When the method is applied, confidence level scoring is performed on each frame of the voice by using the prior linguistic information of the voice recognition, and the language type of the voice is judged by comparing confidence levels output by the voice recognition models under different languages.
The following examples will be given when the present embodiment is applied to chinese and english recognition: the Chinese text dictionary comprises 5207 characters (containing a blank character) and 28 letters (containing spaces and blank characters) in English in total, and when the text is expressed, for example, the Chinese sentence "national aviation three-hole two-right turn aviationGoing to three five ", it can be represented as (nation, navigation, three, hole, two, right, turn, navigation, going to, three, five). For a piece of calculated features
Figure 247630DEST_PATH_IMAGE032
Audio sequence of feature vectors
Figure 847239DEST_PATH_IMAGE006
. After the input of the voice recognition model, the posterior output and greedy decoded text of each word or letter of each audio frame in the dictionary of the corresponding Chinese-English model are obtained. The posterior output of Chinese and English is respectively:
Figure 579572DEST_PATH_IMAGE033
here, use is made of
Figure 631841DEST_PATH_IMAGE026
And
Figure 242076DEST_PATH_IMAGE027
respectively represent the sizes of Chinese and English vocabularies,
Figure 696191DEST_PATH_IMAGE034
and
Figure 599425DEST_PATH_IMAGE035
is a first
Figure 138991DEST_PATH_IMAGE009
Probability vectors in frame audio corresponding to each word in the vocabulary:
Figure 785873DEST_PATH_IMAGE036
the posterior probability of the audio output in the chinese-english speech recognition model can be expressed as:
Figure 94495DEST_PATH_IMAGE037
wherein
Figure 168630DEST_PATH_IMAGE038
The parameters representing the chinese model are the parameters,
Figure 195492DEST_PATH_IMAGE039
the parameters of the english model are represented by,
Figure 521431DEST_PATH_IMAGE040
Figure 45078DEST_PATH_IMAGE041
respectively representing Chinese and English recognition models for the second language in a given frame
Figure 431060DEST_PATH_IMAGE017
The posterior probability of an individual character.
The Chinese and English scores are calculated as follows:
Figure 804273DEST_PATH_IMAGE042
wherein
Figure 792958DEST_PATH_IMAGE043
And
Figure 810592DEST_PATH_IMAGE044
and representing the confidence of the speech under Chinese and English speech recognition models respectively.
Considering the inconsistency of the Chinese and English word lists in size, because
Figure 492109DEST_PATH_IMAGE030
The normalization calculation of (2) can cause the mean value of confidence probability calculated by Chinese and English models to be inconsistent, and effective comparison cannot be carried out. Assuming an audio, the probability of each word predicted by the Chinese-English model for each frame is equal, i.e. the probability is equal
Figure 227984DEST_PATH_IMAGE045
Figure 990666DEST_PATH_IMAGE046
Specifically, if a Chinese model based on 5207 common words and an English model based on 28 letters are used, the output probability of the language module corresponding to the Chinese model is
Figure 128386DEST_PATH_IMAGE047
The probability output by the language module corresponding to the English recognition model is
Figure 246384DEST_PATH_IMAGE048
. The probability of each word is equal, which means that the Chinese-English model cannot correctly recognize the text of the voice, and each frame of the audio is equally confused, so that the final score of the Chinese-English voice recognition model should be equal, and based on the result, a scaling factor is added to the score of the English recognition model, the following identity can be obtained:
Figure 938396DEST_PATH_IMAGE049
wherein
Figure 534463DEST_PATH_IMAGE025
A scaling factor is represented that is a function of,
Figure 526689DEST_PATH_IMAGE026
which represents the size of the chinese vocabulary and is,
Figure 284430DEST_PATH_IMAGE027
indicating the size of the english vocabulary. Then by transformation it can be deduced:
Figure 994897DEST_PATH_IMAGE024
speech recognition with maximum confidence by comparing confidence levels with obtained scaling factorsAnd taking the language corresponding to the model as the current language. Specifically, if the vocabulary sizes of Chinese and English are 5207 and 28, respectively, then
Figure 630540DEST_PATH_IMAGE050
. The final language can be judged by the following formula:
Figure 477273DEST_PATH_IMAGE051
after the scaling factor is added, if the Chinese confidence coefficient is greater than the English confidence coefficient, the voice is represented as Chinese, otherwise, the voice is represented as English. Finally, combining the text of voice recognition as the real text of the model after language judgment and outputting. Specifically, the audio has the following section, the Chinese character is identified as (dog postbox), and after passing through the Chinese speech recognition model, the probability of each output character is
Figure 671494DEST_PATH_IMAGE052
English recognition is (go home), and the probability of each letter is
Figure 869257DEST_PATH_IMAGE053
. Respectively taking the natural logarithm value as
Figure 807126DEST_PATH_IMAGE054
Figure 773945DEST_PATH_IMAGE055
. To obtain the mean values thereof respectively
Figure 873488DEST_PATH_IMAGE056
And finally
Figure 558548DEST_PATH_IMAGE057
Therefore, the audio is determined to be english, and the text is determined to be (go home).
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (6)

1. The language identification method of civil aviation multi-language radio land-air conversation is characterized by comprising the following steps:
constructing a text dictionary of each language, and respectively training and deploying an end-to-end speech recognition model of each language;
respectively sending the voice into voice recognition models of various languages for recognition to obtain texts, and obtaining posterior probabilities output by the voice recognition models;
calculating the probability average value of each non-blank frame probability maximum character of the voice according to the posterior probability output by each voice recognition model, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language;
scaling the confidence degrees corresponding to each language to make the confidence probability mean values output by each speech recognition model consistent;
comparing the confidence degrees obtained by the voice recognition models based on the scaling factors, selecting the language corresponding to the voice recognition model with the relatively maximum confidence degree as the language of the current voice, and returning the output text of the voice recognition model corresponding to the language as the current text;
the calculating the probability average value of each non-blank frame probability maximum character of the voice through the posterior probability output by each voice recognition model comprises the following steps:
for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step;
removing probability output of blank frames, taking natural logarithm of all probabilities output by each speech recognition model, and then taking an average value of all effective outputs of each speech recognition model to obtain a probability average value of the whole speech in the corresponding language speech recognition model; the score calculation formula is as follows:
Figure 918072DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 772895DEST_PATH_IMAGE002
representing the confidence level of the speech under the speech recognition model,
Figure 287053DEST_PATH_IMAGE003
indicating the position of the blank character in the vocabulary,
Figure 118481DEST_PATH_IMAGE004
the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,
Figure 604957DEST_PATH_IMAGE005
the representation is taken to be the maximum probability value,
Figure 365103DEST_PATH_IMAGE006
the representation is taken from the natural logarithm of the number,
Figure 366557DEST_PATH_IMAGE007
the parameters representing the model of the speech recognition are,
Figure 503140DEST_PATH_IMAGE008
representing the speech recognition model for the first of a language in a given frame
Figure 339728DEST_PATH_IMAGE009
The posterior probability of an individual character,Trepresenting the total number of feature vectors in the speech.
2. The method of claim 1 for recognizing the language of civil aviation multi-language radio land-air conversationCharacterized in that said text dictionary is represented as
Figure 67513DEST_PATH_IMAGE010
A piece of text corresponding to a voice is represented as
Figure 493946DEST_PATH_IMAGE011
Wherein, in the step (A),Nthe size of the vocabulary table is determined according to the number of vocabularies of each language,w N in table dictionaryNThe number of the characters is one,
Figure 168641DEST_PATH_IMAGE012
represents the total number of audio frames of a piece of speech,
Figure 128244DEST_PATH_IMAGE013
respectively No. 1, no. 2, no. 3, \ 8230; and,mThe index of the corresponding character in the frame audio in the dictionary,
Figure 495772DEST_PATH_IMAGE014
denotes the firstmCharacters corresponding to the frame audio;
the training and deploying of the end-to-end speech recognition models of various languages comprises the following steps: and respectively constructing a deep learning voice recognition model for each language, extracting FBank characteristics of the voice as the input of the model, and training to obtain an end-to-end voice recognition model of each language through a training strategy based on a CTC loss function.
3. The language identification method of civil aviation multilingual radio land-air communication according to claim 1, wherein the step of sending the speech into the speech recognition models of different languages respectively for recognition to obtain texts, and obtaining the posterior probabilities output by the speech recognition models comprises:
the speech is fed into speech recognition models of various languages, and a feature is calculatedTAudio sequence of feature vectors
Figure 675080DEST_PATH_IMAGE015
After the text is input into the speech recognition model, the posterior output of each word or letter and the text after greedy decoding of each audio frame in the dictionary of the corresponding language speech recognition model are obtained, and the posterior output of the speech recognition model is as follows:
Figure 887887DEST_PATH_IMAGE016
wherein the content of the first and second substances,Nfor vocabulary size, adopt
Figure 970506DEST_PATH_IMAGE017
Is shown as
Figure 243355DEST_PATH_IMAGE018
The probability vectors in the frame audio corresponding to each word in the vocabulary,
Figure 441119DEST_PATH_IMAGE019
take a value of1、2、…、T
Figure 457616DEST_PATH_IMAGE020
Wherein the content of the first and second substances,
Figure 690014DEST_PATH_IMAGE021
is shown in
Figure 101142DEST_PATH_IMAGE018
The speech recognition model in the frame to the second in the vocabulary
Figure 989463DEST_PATH_IMAGE022
The weight of each of the characters is determined,
Figure 606390DEST_PATH_IMAGE022
take a value of1、 2、…、NAnd the posterior probability of the audio output by the speech recognition model is as follows:
Figure 427715DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 42367DEST_PATH_IMAGE024
is shown as
Figure 468186DEST_PATH_IMAGE019
The frame of audio is a frame of audio,
Figure 826486DEST_PATH_IMAGE007
the parameters representing the model of the speech recognition are,
Figure 767898DEST_PATH_IMAGE008
representing the speech recognition model for the second of a language under a given frame
Figure 615768DEST_PATH_IMAGE009
The posterior probability of an individual character.
4. The language identification method of civil aviation multilingual radio land-air communication according to claim 1, wherein when the confidence levels corresponding to the languages are scaled, the scaling factor is calculated for any two languages as follows:
Figure 213102DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 171831DEST_PATH_IMAGE026
representing a scaling factor for the first language relative to the second language,
Figure 200705DEST_PATH_IMAGE027
indicating the size of the vocabulary in the first language,
Figure 157159DEST_PATH_IMAGE028
indicating the size of the vocabulary in the second language,Trepresenting the total number of feature vectors in the speech,
Figure 38528DEST_PATH_IMAGE019
take a value of1、2、…、T
5. The language identification method of civil aviation multi-language radio land-air communication according to claim 1, wherein the confidence level obtained by comparing the speech recognition models is determined according to the following formula:
Figure 535368DEST_PATH_IMAGE029
wherein the content of the first and second substances,Score 1 the confidence level output for the first speech recognition model,Score 2 the confidence level output for the second speech recognition model,
Figure 654634DEST_PATH_IMAGE026
and after the scaling factor is added, if the confidence coefficient of the first language is greater than that of the second language, the voice is represented as the first language, and if the confidence coefficient of the first language is less than that of the second language, the voice is represented as the second language.
6. A system for implementing the language identification method for civil aviation multilingual radio land-air communication according to any one of claims 1 to 5, comprising:
the text dictionary building module is used for building text dictionaries of various languages;
the speech recognition model training module is used for training and deploying speech recognition models of various languages from end to end;
the text and posterior probability acquisition module is used for acquiring texts and posterior probabilities which are obtained by recognition in the voice recognition models of various languages of the voice input;
the language analysis module is used for acquiring texts and posterior probabilities obtained by the recognition of the voice recognition models, calculating the probability average value of the maximum character of each non-blank frame probability of the voice according to the posterior probabilities output by the voice recognition models, and taking the probability average value output by each voice recognition model as the confidence coefficient of the corresponding language;
the voice confidence scaling module is used for scaling the confidence corresponding to each language so as to lead the confidence probability mean values output by each voice recognition model to be consistent;
the confidence coefficient comparison module is used for comparing the confidence coefficient obtained by each voice recognition model based on the scaling factor and selecting the language corresponding to the voice recognition model with the relatively maximum confidence coefficient as the language of the current voice;
the text output module is used for outputting the output text of the speech recognition model with the maximum confidence coefficient as the current text;
the calculating the probability average value of each non-blank frame probability maximum character of the voice through the posterior probability output by each voice recognition model comprises the following steps:
for the output vector of each time step of each language voice recognition model, taking the maximum value as the confidence coefficient of the current frame, and taking the character corresponding to the maximum value as the voice recognition output character of the current time step;
removing probability output of blank frames, taking the natural logarithm of all probabilities output by each voice recognition model, and then averaging all effective outputs of each voice recognition model to obtain the probability average value of the whole voice in the corresponding language voice recognition model; the score calculation formula is as follows:
Figure 313148DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 183278DEST_PATH_IMAGE002
representing the confidence of the speech under the speech recognition model,
Figure 218230DEST_PATH_IMAGE003
indicating the position of the blank character in the vocabulary,
Figure 254319DEST_PATH_IMAGE004
the length of the set is taken, the probability value number of the total effective characters of the audio frame is calculated,
Figure 818155DEST_PATH_IMAGE005
the representation is taken to be the maximum probability value,
Figure 674116DEST_PATH_IMAGE006
the representation is taken from the natural logarithm of the number,
Figure 309497DEST_PATH_IMAGE007
the parameters representing the model of the speech recognition are,
Figure 668934DEST_PATH_IMAGE008
representing the speech recognition model for the first of a language in a given frame
Figure 934830DEST_PATH_IMAGE009
The posterior probability of an individual character,Trepresenting the total number of feature vectors in the speech.
CN202211331120.0A 2022-10-28 2022-10-28 Language identification method and system for civil aviation multi-language radio land-air conversation Active CN115394288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211331120.0A CN115394288B (en) 2022-10-28 2022-10-28 Language identification method and system for civil aviation multi-language radio land-air conversation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211331120.0A CN115394288B (en) 2022-10-28 2022-10-28 Language identification method and system for civil aviation multi-language radio land-air conversation

Publications (2)

Publication Number Publication Date
CN115394288A CN115394288A (en) 2022-11-25
CN115394288B true CN115394288B (en) 2023-01-24

Family

ID=84115019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211331120.0A Active CN115394288B (en) 2022-10-28 2022-10-28 Language identification method and system for civil aviation multi-language radio land-air conversation

Country Status (1)

Country Link
CN (1) CN115394288B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105280181A (en) * 2014-07-15 2016-01-27 中国科学院声学研究所 Training method for language recognition model and language recognition method
CN110895932A (en) * 2018-08-24 2020-03-20 中国科学院声学研究所 Multi-language voice recognition method based on language type and voice content collaborative classification
CN111402861A (en) * 2020-03-25 2020-07-10 苏州思必驰信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN112017676A (en) * 2019-05-31 2020-12-01 京东数字科技控股有限公司 Audio processing method, apparatus and computer readable storage medium
CN112951240A (en) * 2021-05-14 2021-06-11 北京世纪好未来教育科技有限公司 Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium
CN113298188A (en) * 2021-06-28 2021-08-24 深圳市商汤科技有限公司 Character recognition and neural network training method and device
CN114648976A (en) * 2022-02-16 2022-06-21 普强时代(珠海横琴)信息技术有限公司 Language identification method and device, electronic equipment and medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5530729B2 (en) * 2009-01-23 2014-06-25 本田技研工業株式会社 Speech understanding device
JP5967569B2 (en) * 2012-07-09 2016-08-10 国立研究開発法人情報通信研究機構 Speech processing system
CN106782513B (en) * 2017-01-25 2019-08-23 上海交通大学 Speech recognition realization method and system based on confidence level
CN109119072A (en) * 2018-09-28 2019-01-01 中国民航大学 Civil aviaton's land sky call acoustic model construction method based on DNN-HMM
CN112233653B (en) * 2020-12-10 2021-03-12 北京远鉴信息技术有限公司 Method, device and equipment for training multi-dialect accent mandarin speech recognition model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105280181A (en) * 2014-07-15 2016-01-27 中国科学院声学研究所 Training method for language recognition model and language recognition method
CN110895932A (en) * 2018-08-24 2020-03-20 中国科学院声学研究所 Multi-language voice recognition method based on language type and voice content collaborative classification
CN112017676A (en) * 2019-05-31 2020-12-01 京东数字科技控股有限公司 Audio processing method, apparatus and computer readable storage medium
CN111402861A (en) * 2020-03-25 2020-07-10 苏州思必驰信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN112951240A (en) * 2021-05-14 2021-06-11 北京世纪好未来教育科技有限公司 Model training method, model training device, voice recognition method, voice recognition device, electronic equipment and storage medium
CN113298188A (en) * 2021-06-28 2021-08-24 深圳市商汤科技有限公司 Character recognition and neural network training method and device
CN114648976A (en) * 2022-02-16 2022-06-21 普强时代(珠海横琴)信息技术有限公司 Language identification method and device, electronic equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于多任务神经网络的语种识别研究;秦晨光;《中国优秀硕士学位论文全文数据库》;20210215(第2期);全文 *

Also Published As

Publication number Publication date
CN115394288A (en) 2022-11-25

Similar Documents

Publication Publication Date Title
CN109255113B (en) Intelligent proofreading system
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN110895932B (en) Multi-language voice recognition method based on language type and voice content collaborative classification
CN109410914B (en) Method for identifying Jiangxi dialect speech and dialect point
US6836760B1 (en) Use of semantic inference and context-free grammar with speech recognition system
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
CN110070855B (en) Voice recognition system and method based on migrating neural network acoustic model
CN109119072A (en) Civil aviaton's land sky call acoustic model construction method based on DNN-HMM
CN111177324B (en) Method and device for carrying out intention classification based on voice recognition result
JP2005084681A (en) Method and system for semantic language modeling and reliability measurement
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
WO2021147041A1 (en) Semantic analysis method and apparatus, device, and storage medium
CN110021293A (en) Audio recognition method and device, readable storage medium storing program for executing
CN115617955B (en) Hierarchical prediction model training method, punctuation symbol recovery method and device
CN112767925B (en) Voice information recognition method and device
JP6875819B2 (en) Acoustic model input data normalization device and method, and voice recognition device
Holone N-best list re-ranking using syntactic score: A solution for improving speech recognition accuracy in air traffic control
CN114153971A (en) Error-containing Chinese text error correction, identification and classification equipment
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN114944150A (en) Dual-task-based Conformer land-air communication acoustic model construction method
CN108806691B (en) Voice recognition method and system
CN115104151A (en) Offline voice recognition method and device, electronic equipment and readable storage medium
CN111898342A (en) Chinese pronunciation verification method based on edit distance
CN115394288B (en) Language identification method and system for civil aviation multi-language radio land-air conversation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant