CN113053366B - Multi-mode fusion-based control voice duplicate consistency verification method - Google Patents

Multi-mode fusion-based control voice duplicate consistency verification method Download PDF

Info

Publication number
CN113053366B
CN113053366B CN202110270332.1A CN202110270332A CN113053366B CN 113053366 B CN113053366 B CN 113053366B CN 202110270332 A CN202110270332 A CN 202110270332A CN 113053366 B CN113053366 B CN 113053366B
Authority
CN
China
Prior art keywords
voice
text
layer
input
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110270332.1A
Other languages
Chinese (zh)
Other versions
CN113053366A (en
Inventor
王煊
彭佳
蒋伟煜
徐秋程
丁辉
严勇杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202110270332.1A priority Critical patent/CN113053366B/en
Publication of CN113053366A publication Critical patent/CN113053366A/en
Application granted granted Critical
Publication of CN113053366B publication Critical patent/CN113053366B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a control voice duplicate consistency verification method based on multi-mode fusion. By performing semantic consistency verification of control instruction repetition, the method can realize the following auxiliary functions: and automatically judging whether the repeated description of the pilot is consistent with the control instruction semantics of the controller. The method has the advantage that the problem of text data information deficiency is compensated by using voice signals by utilizing the characteristic of information complementation among multi-mode data, and the problem is usually caused by the defect of accuracy of voice recognition technology. Therefore, a highly reliable judgment result can be obtained.

Description

Multi-mode fusion-based control voice duplicate consistency verification method
Technical Field
The invention belongs to the technical field of air traffic control automation systems, and particularly relates to a control voice repeated-description consistency verification method based on multi-mode fusion.
Background
In order to ensure the flight safety of an aircraft during the flight of the aircraft, air traffic controllers ("controllers") and aircraft pilots must be able to understand the intent of each other in a timely and accurate manner. In actual regulatory work, after a regulatory command is issued by a regulator, the pilot must repeat the regulatory command once the pilot understands the content of the command, a process called repeating. The accuracy of the repeated instruction is critical to aviation safety, however, the current judging method still depends on manual judgment of a controller, lacks assistance of an automatic system, can cause fatigue of the controller due to long-time high-strength work, and sometimes is difficult to judge whether the repeated instruction is consistent with the control instruction semantically or not, and the repeated accident caused by inconsistent repeated instruction occurs. Because the control instruction and the repeated instruction are transmitted in a voice form, semantic consistency verification is carried out by using artificial intelligence technologies such as voice recognition, natural language processing and the like, so that a controller can be helped to find repeated inconsistencies in time, the safety of air traffic operation is improved, and the workload of the controller is reduced.
The current control and repeat consistency verification method based on deep learning utilizes a voice recognition technology to convert voice data into a text form, and then utilizes a deep neural network to judge the consistency of the control and repeat text. The method has the problem of poor accuracy due to the limitation of the accuracy of the voice recognition technology. The multi-mode fusion technology utilizes the characteristic of information complementation among different modes, uses the control voice signal to compensate the control text data, enhances the semantic understanding capability of an automatic system on control instructions, and improves the accuracy and reliability of consistency check results.
Disclosure of Invention
The invention aims to: the invention aims to solve the technical problem of providing a multi-mode fusion-based control voice duplicate consistency check method, which aims at the defects of the prior art and comprises the following steps:
step 1, collecting control voice and repeated voice data to form positive sample training data; generating false repeated voice according to the control voice to form negative sample training data; processing the collected regulated speech and the composite speech training data using speech recognition techniques (ref: dong, deng Li, "analytical deep learning: speech recognition practice", electronic industry Press, 2016) to generate text data, the text data comprising the regulated text data and the composite text data;
step 2, constructing a single-voice single-text multi-mode fusion model, inputting text data into the single-voice single-text multi-mode fusion model, and outputting probability distribution;
step 3, constructing a double-voice double-text multi-mode fusion model, inputting text data into the double-voice double-text multi-mode fusion model, and outputting probability distribution;
and 4, constructing a full-connection neural network classification model, inputting the probability distribution obtained in the step 2 and the probability distribution obtained in the step 3 into the full-connection neural network classification model, and outputting a control voice repetition consistency check result.
In step 2, the single-voice single-text multi-mode fusion model comprises a first high-level feature extraction layer, a first feature alignment layer based on an attention mechanism, a first multi-mode feature fusion layer and a first semantic consistency check layer.
In step 2, the construction of the single-voice single-text multi-mode fusion model specifically includes:
step 2-1, constructing a first high-level feature extraction layer to obtain high-level features: taking the collected control voice and the repeated voice training data as input, framing the training data, dividing the control voice data with the length of n seconds into m frames, wherein the length of each frame signal is thatRespectively carrying out fast Fourier transform processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient, thus obtaining the low-level characteristics of the voice signal;
word embedding processing is carried out on the text data, word sequences are generated through Word segmentation, then each Word is converted into a Word vector form by using a Word2Vec method, and a vector representation method of the text data is formed by combination, so that low-level features of the text data are obtained;
constructing a bidirectional Long short term memory network (LSTM) layer (reference: sepp Hochreiter and J ugen schmidhuber, "Long short-term memory". Neural computation,9 (8): 1735-1780, 1997), which refines the underlying features of speech signals and text data, respectively, to form high-level features of speech signals and text data;
step 2-2, constructing a first feature alignment layer based on an attention mechanism, wherein the first feature alignment layer uses a layer of fully-connected neural network to calculate voice features and text features generated by a LSTM layer of the two-way long-short-term memory network, so as to obtain attention value distribution among the voice features and the text features: setting the processed voice high-level characteristic and text high-level characteristic as respectivelyAnd->Wherein R is a real number set, m S And m T Representing the lengths of the voice and text feature sequences, respectively, and l representing the feature dimension, the attention value calculated by the full connection layer is as follows:
wherein a is ij Representing the similarity between the voice data of the ith frame and words in the jth text, and weighting the voice features by using the attention value distribution to realize alignment operation as output features: a, a ij ·E′ S
Step 2-3, inputting the output weighted characteristics into a two-way long-short-term memory network LSTM layer, and processing the weighted characteristics by the two-way long-term memory network LSTM layer to obtainThe text high-level features of the (2-2) are spliced with the voice high-level features subjected to weighted alignment, so as to obtain a splicing result E= [ E '' T ,a ij ·E′ S ]E is taken as the input of a model, and the high-level characteristics after the fusion of the two modal data are output;
step 2-4, constructing a forward fully-connected neural network as an output layer, and checking semantic consistency, namely:
y=softmax(W·E+b) (2)
wherein y is E R 1×2 Representing the output judgment result, i.e. the probability distribution of coincidence or non-coincidence, W E R l×2 Is the weight of the full connection layer, b epsilon R 1×2 The bias parameters of the full-connection layer are used, the high-level features output in the step 2-3 are used as input of the layer, and classification results based on binary probability distribution are output to respectively represent probability distribution with consistent semantics and probability distribution with inconsistent semantics.
In step 3, the bilingual multi-modal fusion model includes a second high-level feature extraction layer, a second feature alignment layer based on an attention mechanism, a second multi-modal feature fusion layer and an output layer.
In step 3, the construction of the dual-voice dual-text multi-mode fusion model specifically includes:
step 3-1, constructing an output layer high-level feature extraction layer to obtain a high-level feature representation: taking the collected regulated voice and the repeated voice training data as input, framing the voice signal, dividing the regulated voice data with the length of n seconds into m frames, wherein the length of each frame signal isRespectively carrying out fast Fourier transform processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient, thus obtaining the low-level characteristics of the voice signal; respectively splicing the control voice and the data of the repeated voice and the control text data and the repeated text data to form voice input and text input;
because the transducer model has stronger feature extraction capability and fusion capability of multi-modal data features, the transducer is selected as the feature extraction layer. Constructing a transducer model to process voice input and text input respectively, wherein the input of the transducer model is a vector sequence of voice signals and text data, and meanwhile, position information of each feature vector is represented by position codes, and the position codes are represented by the following formula:
wherein PE (pos) represents the position encoding of the feature vector at the position pos, wherein PE (pos, 2 i) represents the sine component and PE (pos, 2i+1) represents the cosine component. d, d model The dimension =512 is i e (0, 512), the odd-dimensional position codes are obtained by using the formula (1), the even-dimensional position codes are obtained by using the formula (2), the position code dimension is (l, 512), l represents the length of an input sequence, and the position codes are added with the input vector sequence to obtain the input of a transducer model;
the transducer model uses the multi-head attention to calculate the attention value between feature vectors in the input sentence, and uses the attention value to improve the word vector representation of the input sentence, and specifically includes: the multi-head attention includes h single proportional Dot product attention (Scaled Dot-Product Attention), and the feature vector of the input sentence is calculated by the following formula:
X×W Q =Q (5)
X×W K =K (6)
X×W V =V (7)
wherein X represents the characteristics of the input sequence, Q, K, V represent the query vector, the key vector and the value vector, respectively, K-Q represents the key-value pair, W Q 、W K 、W V The parameter matrixes Q, K and V are respectively used for obtaining Q, K and V; attention value the Attention value Attention (Q, K, V) is calculated as follows:
wherein Attention (Q, K, V) is the Attention value of the speech signal and text data, d k For a scaling factor, the Softmax function is a normalized activation function, mapping the output probability into a (0, 1) interval;
in the transducer model, Q represents an input sequence of a voice signal, K and V represent input sequences of text data, and semantic processing high-level characteristic representation is obtained after the transducer model is adopted;
step 3-2, constructing a second feature alignment layer based on an attention mechanism, integrating feature results output by a multi-head attention mechanism by using a layer of fully-connected neural network by the second feature alignment layer based on the attention mechanism, and outputting features after high-level feature alignment operation of voice signals and high-level features of text data;
and 3-3, constructing an output layer by using a forward fully-connected neural network, checking whether the semantics are consistent, taking the characteristics obtained in the step 3-2 as input, and obtaining a classification result based on binary probability distribution as an output result, wherein the classification result respectively represents probability distribution of consistent semantics and inconsistent semantics.
In step 3-1, the definition of the Softmax function is as follows:
step 4 comprises: the method comprises the steps of constructing a classification layer by using a fully-connected neural network, wherein input is probability distribution obtained from two single-voice single-text multi-mode fusion models (respectively taking control voice and compound text as input and taking control text and compound voice as input, wherein the models are single-voice single-text multi-mode fusion models but different in input) and a double-voice double-text multi-mode fusion model, and output is subjected to a Softmax function to obtain normalized probability distribution, namely the occurrence probability of consistency and the occurrence probability of inconsistency, wherein the occurrence probability is a judgment result.
The beneficial effects are that: the invention has the following technical effects: the method of the invention realizes the check of the consistency of the repeated semantics of the intelligent control, and uses the voice signal to compensate the problem of text data information deficiency by utilizing the characteristic of information complementation among the multi-mode data, and the problem is usually caused by the lack of accuracy of the voice recognition technology. Therefore, a highly reliable judgment result can be obtained.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a block diagram of a single-voice single-text multimodal fusion model.
FIG. 3 is a block diagram of a bilingual, and multimodal fusion model.
Fig. 4 is a diagram of a bidirectional LSTM model structure.
Fig. 5 is a schematic diagram of attention alignment weighting.
FIG. 6 is a diagram of a semantic role annotation model, transformer model.
Detailed Description
The invention provides a control voice repeated consistency check method based on multi-mode fusion, wherein a flow chart is shown in fig. 1, and the method specifically comprises the following steps:
step 1: and collecting the control voice and the repeated voice data to form a positive sample data pair. And generating wrong repeated voice by professionals according to the regulated voice to form a negative sample data pair. The voice data is processed using voice recognition techniques to generate text data. And training the deep neural network model by using the data to obtain a trained model. Since the actual consistency check process is the same as the training process, the composition of the model and the judgment (training) process are described in detail in the judgment process.
Examples are as follows: the following regulated voice data is collected: the eastern three nine eight four, the runway three five, can take off. And repeating the voice data: the runway is three, five, can take off, and the eastern is three, nine, eight and four. Thereafter, a false rendition is generated by the practitioner: and (3) waiting for take-off, and generating wrong negative sample pairs if the runway is three, five, three, nine, eight and four in the eastern direction. These voice data are converted into text, forming voice and text multimodal training data.
Step 2: a single speech single text multimodal fusion model (see FIG. 2) was constructed (ref: haiyang Xu, hui Zhang, kun Han, yun Wang, YIping Peng, xiangang Li, "Learning Alignment for Multimodal Emotion Recognition from Speech", in Interppech 2019, https:// arxiv. Org/abs/1909.05645.) that served to determine the consistency of the regulated speech with the repeated text, or the consistency of the repeated speech with the regulated text. The model mainly consists of four parts: 1) A high-level feature extraction layer; 2) A feature alignment layer based on an attention mechanism; 3) A multi-modal feature fusion layer; 4) A semantic consistency check layer.
1) Framing an input speech signal, dividing the regulated speech data of length n seconds into m frames, each frame having a signal length ofAnd respectively performing Fast Fourier Transform (FFT) processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient (MFCC), thus obtaining the low-level characteristics of the voice signal.
Setting the duration of the input voice data as 20 seconds and taking 2 seconds as one frame, dividing the voice data into an input sequence of 10 frames, and processing the input sequence to obtain the input ofWherein l is an input sequence of S Representing the feature dimension.
And carrying out Word embedding processing on the input text data, generating Word sequences through Word segmentation, then converting each Word into a Word vector form by using a Word2Vec method, and combining to form a vector representation method of the text data, thus obtaining the low-level features of the text data.
Taking the example of 'eastern thirty-eight four, runway thirty-five, take off', the sentence length is 16 words, and the unified word number is adopted as the length of the input text, for example: 50 words, the text is subjected to word number complement to generate input characteristics:wherein l T Representing the feature dimension.
Two-way LSTM model layers (see fig. 4) are built separately, which refine the low-level features of the input speech and text, respectively, to form high-level features. After bi-directional LSTM treatment E S And E is connected with T Turning to high-level features of the same dimension: e's' S ∈R 10×l ,E′ T ∈R 50×l Wherein l represents the dimension of the high-level feature, and the dimensions of the voice feature and the text feature are unified and then can be spliced.
2) An attention layer (see fig. 5) is constructed, the attention layer uses a layer of fully connected neural network to calculate and match the input voice characteristics and text characteristics, attention value distribution between the voice characteristics and the text characteristics is obtained, the attention value distribution is used for carrying out weighting processing on the voice characteristics, the alignment operation with the text characteristics is realized, the voice characteristics after weighted alignment are output, and the weighting characteristics of the output voice data are obtained through processing: e's' S ∈R 10×l
3) And constructing a multi-modal feature fusion layer by using the bidirectional LSTM model, splicing the text features and the voice features subjected to weighted alignment treatment, inputting the text features and the voice features into the bidirectional LSTM model of the multi-modal feature fusion layer, and outputting fusion features.
4) The method comprises the steps of constructing a semantic consistency check layer by using a forward fully-connected neural network, taking fusion characteristics as input, obtaining a binary probability distribution as an output result, namely a probability distribution with consistent semantics and a probability distribution with inconsistent semantics, and assuming that an input voice-text data pair is correct, outputting the probability of 1 to be larger than the probability of 0, otherwise outputting the probability of 1 to be smaller than the probability of 0.
Step 3: a double-voice double-text multi-mode fusion model (see figure 3) is constructed, and the model has the function of simultaneously carrying out semantic consistency check by using four types of data, namely, control voice, reproduction voice, control text and reproduction text. The model mainly consists of four parts: 1) A high-level feature extraction layer; 2) A feature alignment layer based on an attention mechanism; 3) A multi-modal feature fusion layer; 4) A semantic consistency check layer.
For this step, the input is to regulate and repeat the concatenation of speech or text, i.e.: the eastern three, nine, eight and four, three, five runway, can take off; the runway is three, five, can take off, and the eastern is three, nine, eight and four. Or: the eastern three, nine, eight and four, three, five runway, can take off; runway three and five, wait to take off, eastern three, nine, eight and four. Correspondingly, the feature dimensions of the input are respectively:the length of the input speech or sentence is twice that of the single input, respectively, and is used for separating the representations "; "symbol".
1) The low-level feature extraction process for the speech signal and the text signal is the same as step 2. After the low-level features are obtained, respectively splicing the control voice and the data of the repeated voice and the data of the control text and the data of the repeated text to form voice input and text input;
a transducer model (see fig. 6) is constructed to process the spliced voice input and text input respectively, so as to find the semantic correlation degree between the control voice (text) and the repeated voice (text), and the feature representation is improved by utilizing the semantic correlation degree, so that the voice input and text input have stronger semantic correlation. The input feature vector sequence of the transducer model represents the position information of each feature vector by using a position code, and the position code formula is as follows:
wherein PE (pos) denotes the position encoding of the feature vector at the position pos, d model =512 is the dimension i e (0,512), so that the odd-dimensional position codes are obtained using equation (1), the even-dimensional position codes are obtained using equation (2), the position code dimension is (l, 512), and the two are added to obtain the input of the transducer model, as are the input feature vector dimensions.
The transducer model calculates attention values among feature vectors in an input sentence by utilizing the multi-head attention, improves word vector representation of the input sentence by utilizing the attention values, and improves the capability of extracting semantic features. The multi-head attention is mainly composed of h=8 Scaled Dot-Product Attention, and the feature vector of the input sentence will be calculated by:
X×W Q =Q (3)
X×W k =K (4)
X×W V =V (5)
obtaining Q, K and V, wherein Q, K and V respectively represent a Query vector, a Key vector and a Value vector, W Q 、W K 、W V Respectively a conversion matrix. The formula of Scaled Dot-Product Attention is as follows:
wherein d is k To scale the factor, the Softmax function is a normalized activation function that maps the output probabilities of multiple neurons into (0, 1) intervals. The definition of the Softmax function is as follows:
after the transformation model, the semantic processed high-level characteristic representation is obtained.
2) An attention layer (see fig. 5) is constructed, the attention layer uses a layer of fully connected neural network to calculate and match the input voice features and text features, attention value distribution between the voice features and the text features is obtained, the attention value distribution is used for carrying out weighting processing on the voice features, alignment operation with the text features is achieved, and the voice features subjected to weighting alignment are output.
3) And constructing a multi-mode fusion layer by using the bidirectional LSTM model, splicing the text features and the aligned voice features, inputting the text features and the aligned voice features into the bidirectional LSTM model of the multi-mode fusion layer, and inputting fusion features.
4) The semantic consistency check layer is constructed by using a forward fully-connected neural network, fusion characteristics are taken as input, and an output result is a binary probability distribution which is respectively a probability distribution with consistent semantic and a probability distribution with inconsistent semantic.
Step 4: a simple fully-connected neural network classification model is built, probability distribution obtained from three models (two single-voice single-text multi-mode fusion models and a double-voice double-text multi-mode fusion model) is input, normalized probability distribution is obtained after output is subjected to a Softmax function (see formula (7)), and the probability of occurrence of consistency and the probability of occurrence of inconsistency are respectively represented, wherein the probability of occurrence is a judgment result, such as: the probability of consistency is 0.76 after normalization treatment, and the probability of inconsistency is 0.24, so that the control voice and the repeated voice data are consistent.
The probability of sample coincidence or non-coincidence is respectively output through three classifiers, namely y epsilon R 1×2 Splicing these outputs to form x=r 1×6 Input the input signals of the full-connection neural network classification model to obtain a final result of Y E R 1×2 I.e. the classification result.
The method of the invention is as shown in figure 1, and needs the steps of preparing sample data, training a model, judging whether repeated descriptions are consistent or not, and the like. For convenience of description, description will be divided into three steps.
Step one, preparing control voice data and repeated data, and manually labeling, wherein the manual labeling requires a professional controller to generate negative samples according to repeated errors possibly occurring in actual work, then respectively generating control texts and repeated texts by using a voice recognition technology, labeling the samples, wherein the output of repeated consistent sample labeling is 1, and the output of repeated inconsistent sample labeling is 0.
Step two, respectively dividing the four types of sample data into: 1) Policing speech signals-repeating text data; 2) The two types of voice signal-control text data are repeated, the contents of the two types of voice signal-control text data are required to correspond to each other one by one according to the labeling result, and the samples are used for training a single-voice single-text multi-mode fusion model. And splicing the voice data back and forth, and splicing the corresponding text data back and forth to generate sample data, wherein the samples are used for training a double-voice double-text multi-mode fusion model.
Training the multi-mode fusion model by using the marked sample data to obtain a classification model, training the forward full-connection model according to the classification result, and outputting a judgment result (1 or 0) of the repeated consistency. And training to obtain a result integration model.
And thirdly, in the actual control work, collecting control voice sent by a controller and voice signals repeated by a pilot, obtaining corresponding text data by using voice recognition, and analyzing the data by using a trained classification model to obtain repeated consistency check results.
The invention provides a method for checking consistency of control voice repetition based on multi-mode fusion, which has a plurality of methods and approaches for realizing the technical scheme, the above description is only a preferred embodiment of the invention, and it should be noted that a plurality of improvements and modifications can be made by those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (3)

1. A multi-mode fusion-based control voice repeated consistency check method is characterized by comprising the following steps:
step 1, collecting control voice and repeated voice data to form positive sample training data; generating false repeated voice according to the control voice to form negative sample training data; processing the collected control voice and the repeated voice training data by using a voice recognition technology to generate text data, wherein the text data comprises control text data and repeated text data;
step 2, constructing a single-voice single-text multi-mode fusion model, inputting text data into the single-voice single-text multi-mode fusion model, and outputting probability distribution;
step 3, constructing a double-voice double-text multi-mode fusion model, inputting text data into the double-voice double-text multi-mode fusion model, and outputting probability distribution;
step 4, constructing a full-connection neural network classification model, inputting the probability distribution obtained in the step 2 and the probability distribution obtained in the step 3 into the full-connection neural network classification model, and outputting a control voice repetition consistency check result;
in the step 2, the single-voice single-text multi-mode fusion model comprises a first high-level feature extraction layer, a first feature alignment layer based on an attention mechanism, a first multi-mode feature fusion layer and a first semantic consistency check layer;
in step 2, the construction of the single-voice single-text multi-mode fusion model specifically includes:
step 2-1, constructing a first high-level feature extraction layer to obtain high-level features: taking the collected control voice and the repeated voice training data as input, framing the training data, dividing the control voice data with the length of n seconds into m frames, wherein the length of each frame signal is thatRespectively carrying out fast Fourier transform processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient, thus obtaining the low-level characteristics of the voice signal;
word embedding processing is carried out on the text data, word sequences are generated through Word segmentation, then each Word is converted into a Word vector form by using a Word2Vec method, and a vector representation method of the text data is formed by combination, so that low-level features of the text data are obtained;
constructing a LSTM layer of the two-way long-short-term memory network, wherein the LSTM layer respectively refines the low-level characteristics of the voice signal and the bottom-level characteristics of the text data to form the high-level characteristics of the voice signal and the high-level characteristics of the text data;
step 2-2, constructing a first feature alignment layer based on an attention mechanism, wherein the first feature alignment layer uses a layer of fully-connected neural network to calculate voice features and text features generated by a LSTM layer of the two-way long-short-term memory network, so as to obtain attention value distribution among the voice features and the text features: setting the processed voice high-level characteristic and text high-level characteristic as respectivelyAndwherein R is a real number set, m S And m T Representing the lengths of the voice and text feature sequences, respectively, and l representing the feature dimension, the attention value calculated by the full connection layer is as follows:
a ij =softmax(E′ S ·E′ T T ) (1)
wherein a is ij Representing the similarity between the voice data of the ith frame and words in the jth text, and weighting the voice features by using the attention value distribution to realize alignment operation as output features: a, a ij ·E′ S
Step 2-3, inputting the output weighted features into a two-way long-short-term memory network LSTM layer, and splicing the text high-level features obtained after the processing of the two-way long-term memory network LSTM layer with the voice high-level features obtained in the step 2-2 after the weighted alignment to obtain a splicing result E= [ E '' T ,a ij ·E′ S ]E is taken as the input of a model, and the high-level characteristics after the fusion of the two modal data are output;
step 2-4, constructing a forward fully-connected neural network as an output layer, and checking semantic consistency, namely:
y=softmax(W·E+b) (2)
wherein y is E R 1×2 Representing the output judgment result, i.e. the probability distribution of coincidence or non-coincidence, W E R l×2 Is the weight of the full connection layer, b epsilon R 1×2 The bias parameters of the full-connection layer are used, the high-level features output in the step 2-3 are used as input of the layer, and classification results based on binary probability distribution are output to respectively represent probability distribution with consistent semantics and probability distribution with inconsistent semantics;
in step 3, the bilingual multi-modal fusion model comprises a second high-level feature extraction layer, a second feature alignment layer based on an attention mechanism, a second multi-modal feature fusion layer and an output layer;
in step 3, the construction of the dual-voice dual-text multi-mode fusion model specifically includes:
step 3-1, constructing a second high-level feature extraction layer to obtain a high-level feature representation: taking the collected regulated voice and the repeated voice training data as input, framing the input voice signal, dividing the regulated voice data with the length of n seconds into m frames, wherein the length of each frame signal isRespectively carrying out fast Fourier transform processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient, thus obtaining the low-level characteristics of the voice signal; respectively splicing the control voice and the data of the repeated voice and the control text data and the repeated text data to form voice input and text input;
constructing a transducer model to process voice input and text input respectively, wherein the input of the transducer model is a vector sequence of voice signals and text data, and meanwhile, position information of each feature vector is represented by position codes, and the position codes are represented by the following formula:
wherein PE (pos) represents the position encoding of the feature vector at the position pos, wherein PE (pos, 2 i) represents the sine component and PE (pos, 2i+1) represents the cosine component; d, d model The number of dimensions 512 is the dimension i e (0,512), the odd-dimensional position codes are obtained by using the formula (1), the even-dimensional position codes are obtained by using the formula (2), the position code dimension is (l, 512), l represents the length of the input sequence, and the position codes are added with the input vector sequence to obtain the input of the transducer model;
the transducer model uses the multi-head attention to calculate the attention value between feature vectors in the input sentence, and uses the attention value to improve the word vector representation of the input sentence, and specifically includes: the multi-head attention comprises h single proportional dot product attention, and the feature vector of the input sentence is calculated by the following formula:
X×W Q =Q (5)
X×W K =K (6)
X×W V =V (7)
wherein X represents the characteristics of the input sequence, Q, K, V represent the query vector, the key vector and the value vector, respectively, K-Q represents the key-value pair, W Q 、W K 、W V The parameter matrix of Q, K, V, the formula of the attention value is as follows:
wherein Attention (Q, K, V) is the Attention value of the speech signal and text data, d k For a scaling factor, the Softmax function is a normalized activation function, mapping the output probability into a (0, 1) interval;
in the transducer model, Q represents an input sequence of a voice signal, K and V represent input sequences of text data, and semantic processing high-level characteristic representation is obtained after the transducer model is adopted;
step 3-2, constructing a second feature alignment layer based on an attention mechanism, integrating feature results output by a multi-head attention mechanism by using a layer of fully-connected neural network by the second feature alignment layer based on the attention mechanism, and outputting features after high-level feature alignment operation of voice signals and high-level features of text data;
and 3-3, constructing an output layer by using a forward fully-connected neural network, checking whether the semantics are consistent, taking the characteristics obtained in the step 3-2 as input, and obtaining a classification result based on binary probability distribution as an output result, wherein the classification result respectively represents probability distribution of consistent semantics and inconsistent semantics.
2. The method according to claim 1, wherein in step 3-1, the Softmax function is defined as follows:
3. the method of claim 2, wherein step 4 comprises: and constructing a classification layer by using a fully-connected neural network, inputting probability distribution obtained from two single-voice single-text multi-mode fusion models and one double-voice double-text multi-mode fusion model, and outputting normalized probability distribution obtained after a Softmax function, wherein the probability distribution is the occurrence probability of consistency and the occurrence probability of inconsistency respectively, and the occurrence probability is a judgment result.
CN202110270332.1A 2021-03-12 2021-03-12 Multi-mode fusion-based control voice duplicate consistency verification method Active CN113053366B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110270332.1A CN113053366B (en) 2021-03-12 2021-03-12 Multi-mode fusion-based control voice duplicate consistency verification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110270332.1A CN113053366B (en) 2021-03-12 2021-03-12 Multi-mode fusion-based control voice duplicate consistency verification method

Publications (2)

Publication Number Publication Date
CN113053366A CN113053366A (en) 2021-06-29
CN113053366B true CN113053366B (en) 2023-11-21

Family

ID=76511988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110270332.1A Active CN113053366B (en) 2021-03-12 2021-03-12 Multi-mode fusion-based control voice duplicate consistency verification method

Country Status (1)

Country Link
CN (1) CN113053366B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627266B (en) * 2021-07-15 2023-08-18 武汉大学 Video pedestrian re-recognition method based on transform space-time modeling
CN114267345B (en) * 2022-02-25 2022-05-17 阿里巴巴达摩院(杭州)科技有限公司 Model training method, voice processing method and device
CN115062143A (en) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 Voice recognition and classification method, device, equipment, refrigerator and storage medium
CN114898871A (en) * 2022-07-14 2022-08-12 陕西省人民医院 Heart disease diagnosis research method based on artificial neural network
CN115810351B (en) * 2023-02-09 2023-04-25 四川大学 Voice recognition method and device for controller based on audio-visual fusion
CN116011505B (en) * 2023-03-15 2024-05-14 图灵人工智能研究院(南京)有限公司 Multi-module dynamic model training method and device based on feature comparison
CN116701568A (en) * 2023-05-09 2023-09-05 湖南工商大学 Short video emotion classification method and system based on 3D convolutional neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108428447A (en) * 2018-06-19 2018-08-21 科大讯飞股份有限公司 A kind of speech intention recognition methods and device
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN110827799A (en) * 2019-11-21 2020-02-21 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for processing voice signal
CN111274784A (en) * 2020-01-15 2020-06-12 中国民航大学 Automatic verification method for air-ground communication repeating semantics based on BilSTM-Attention
CN112287675A (en) * 2020-12-29 2021-01-29 南京新一代人工智能研究院有限公司 Intelligent customer service intention understanding method based on text and voice information fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019204186A1 (en) * 2018-04-18 2019-10-24 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN108428447A (en) * 2018-06-19 2018-08-21 科大讯飞股份有限公司 A kind of speech intention recognition methods and device
CN110827799A (en) * 2019-11-21 2020-02-21 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for processing voice signal
CN111274784A (en) * 2020-01-15 2020-06-12 中国民航大学 Automatic verification method for air-ground communication repeating semantics based on BilSTM-Attention
CN112287675A (en) * 2020-12-29 2021-01-29 南京新一代人工智能研究院有限公司 Intelligent customer service intention understanding method based on text and voice information fusion

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A unified framework for multilingual speech recognition in air traffic control systems;Yi Lin等;IEEE Transactions on Neural Networks and Learning Systems;第32卷(第8期);全文 *
基于深度CNN的陆空通话语义一致性校验;杨金锋 等;中国民航大学学报(第01期);全文 *
基于深度学习的空管指挥安全监控技术研究;杨波 等;第一届空中交通管理系统技术学术年会论文集;全文 *
多模态情感分析研究综述;张亚洲 等;模式识别与人工智能(第05期);全文 *

Also Published As

Publication number Publication date
CN113053366A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN113053366B (en) Multi-mode fusion-based control voice duplicate consistency verification method
CN110321418B (en) Deep learning-based field, intention recognition and groove filling method
US11488586B1 (en) System for speech recognition text enhancement fusing multi-modal semantic invariance
CN114023316B (en) TCN-transducer-CTC-based end-to-end Chinese speech recognition method
CN111666381B (en) Task type question-answer interaction system oriented to intelligent control
CN111353029B (en) Semantic matching-based multi-turn spoken language understanding method
CN114973062A (en) Multi-modal emotion analysis method based on Transformer
CN110717341B (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN113160798B (en) Chinese civil aviation air traffic control voice recognition method and system
CN112101044B (en) Intention identification method and device and electronic equipment
CN113223509B (en) Fuzzy statement identification method and system applied to multi-person mixed scene
CN111785257B (en) Empty pipe voice recognition method and device for small amount of labeled samples
Liu et al. Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents.
CN114385802A (en) Common-emotion conversation generation method integrating theme prediction and emotion inference
CN115910066A (en) Intelligent dispatching command and operation system for regional power distribution network
CN115238029A (en) Construction method and device of power failure knowledge graph
CN111553157A (en) Entity replacement-based dialog intention identification method
CN114360584A (en) Phoneme-level-based speech emotion layered recognition method and system
CN114694255A (en) Sentence-level lip language identification method based on channel attention and time convolution network
CN113642862A (en) Method and system for identifying named entities of power grid dispatching instructions based on BERT-MBIGRU-CRF model
CN117591648A (en) Power grid customer service co-emotion dialogue reply generation method based on emotion fine perception
CN115359784B (en) Civil aviation land-air voice recognition model training method and system based on transfer learning
CN114238605B (en) Automatic conversation method and device for intelligent voice customer service robot
Prasad et al. Grammar Based Speaker Role Identification for Air Traffic Control Speech Recognition
CN115238048A (en) Quick interaction method for joint chart identification and slot filling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant