CN113053366B

CN113053366B - Multi-mode fusion-based control voice duplicate consistency verification method

Info

Publication number: CN113053366B
Application number: CN202110270332.1A
Authority: CN
Inventors: 王煊; 彭佳; 蒋伟煜; 徐秋程; 丁辉; 严勇杰
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2023-11-21
Anticipated expiration: 2041-03-12
Also published as: CN113053366A

Abstract

The invention provides a control voice duplicate consistency verification method based on multi-mode fusion. By performing semantic consistency verification of control instruction repetition, the method can realize the following auxiliary functions: and automatically judging whether the repeated description of the pilot is consistent with the control instruction semantics of the controller. The method has the advantage that the problem of text data information deficiency is compensated by using voice signals by utilizing the characteristic of information complementation among multi-mode data, and the problem is usually caused by the defect of accuracy of voice recognition technology. Therefore, a highly reliable judgment result can be obtained.

Description

Multi-mode fusion-based control voice duplicate consistency verification method

Technical Field

The invention belongs to the technical field of air traffic control automation systems, and particularly relates to a control voice repeated-description consistency verification method based on multi-mode fusion.

Background

In order to ensure the flight safety of an aircraft during the flight of the aircraft, air traffic controllers ("controllers") and aircraft pilots must be able to understand the intent of each other in a timely and accurate manner. In actual regulatory work, after a regulatory command is issued by a regulator, the pilot must repeat the regulatory command once the pilot understands the content of the command, a process called repeating. The accuracy of the repeated instruction is critical to aviation safety, however, the current judging method still depends on manual judgment of a controller, lacks assistance of an automatic system, can cause fatigue of the controller due to long-time high-strength work, and sometimes is difficult to judge whether the repeated instruction is consistent with the control instruction semantically or not, and the repeated accident caused by inconsistent repeated instruction occurs. Because the control instruction and the repeated instruction are transmitted in a voice form, semantic consistency verification is carried out by using artificial intelligence technologies such as voice recognition, natural language processing and the like, so that a controller can be helped to find repeated inconsistencies in time, the safety of air traffic operation is improved, and the workload of the controller is reduced.

The current control and repeat consistency verification method based on deep learning utilizes a voice recognition technology to convert voice data into a text form, and then utilizes a deep neural network to judge the consistency of the control and repeat text. The method has the problem of poor accuracy due to the limitation of the accuracy of the voice recognition technology. The multi-mode fusion technology utilizes the characteristic of information complementation among different modes, uses the control voice signal to compensate the control text data, enhances the semantic understanding capability of an automatic system on control instructions, and improves the accuracy and reliability of consistency check results.

Disclosure of Invention

The invention aims to: the invention aims to solve the technical problem of providing a multi-mode fusion-based control voice duplicate consistency check method, which aims at the defects of the prior art and comprises the following steps:

step 1, collecting control voice and repeated voice data to form positive sample training data; generating false repeated voice according to the control voice to form negative sample training data; processing the collected regulated speech and the composite speech training data using speech recognition techniques (ref: dong, deng Li, "analytical deep learning: speech recognition practice", electronic industry Press, 2016) to generate text data, the text data comprising the regulated text data and the composite text data;

step 2, constructing a single-voice single-text multi-mode fusion model, inputting text data into the single-voice single-text multi-mode fusion model, and outputting probability distribution;

step 3, constructing a double-voice double-text multi-mode fusion model, inputting text data into the double-voice double-text multi-mode fusion model, and outputting probability distribution;

and 4, constructing a full-connection neural network classification model, inputting the probability distribution obtained in the step 2 and the probability distribution obtained in the step 3 into the full-connection neural network classification model, and outputting a control voice repetition consistency check result.

In step 2, the single-voice single-text multi-mode fusion model comprises a first high-level feature extraction layer, a first feature alignment layer based on an attention mechanism, a first multi-mode feature fusion layer and a first semantic consistency check layer.

In step 2, the construction of the single-voice single-text multi-mode fusion model specifically includes:

step 2-1, constructing a first high-level feature extraction layer to obtain high-level features: taking the collected control voice and the repeated voice training data as input, framing the training data, dividing the control voice data with the length of n seconds into m frames, wherein the length of each frame signal is thatRespectively carrying out fast Fourier transform processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient, thus obtaining the low-level characteristics of the voice signal;

word embedding processing is carried out on the text data, word sequences are generated through Word segmentation, then each Word is converted into a Word vector form by using a Word2Vec method, and a vector representation method of the text data is formed by combination, so that low-level features of the text data are obtained;

constructing a bidirectional Long short term memory network (LSTM) layer (reference: sepp Hochreiter and J ugen schmidhuber, "Long short-term memory". Neural computation,9 (8): 1735-1780, 1997), which refines the underlying features of speech signals and text data, respectively, to form high-level features of speech signals and text data;

step 2-2, constructing a first feature alignment layer based on an attention mechanism, wherein the first feature alignment layer uses a layer of fully-connected neural network to calculate voice features and text features generated by a LSTM layer of the two-way long-short-term memory network, so as to obtain attention value distribution among the voice features and the text features: setting the processed voice high-level characteristic and text high-level characteristic as respectivelyAnd->Wherein R is a real number set, m _S And m _T Representing the lengths of the voice and text feature sequences, respectively, and l representing the feature dimension, the attention value calculated by the full connection layer is as follows:

wherein a is _ij Representing the similarity between the voice data of the ith frame and words in the jth text, and weighting the voice features by using the attention value distribution to realize alignment operation as output features: a, a _ij ·E′ _S ；

Step 2-3, inputting the output weighted characteristics into a two-way long-short-term memory network LSTM layer, and processing the weighted characteristics by the two-way long-term memory network LSTM layer to obtainThe text high-level features of the (2-2) are spliced with the voice high-level features subjected to weighted alignment, so as to obtain a splicing result E= [ E '' _T ，a _ij ·E′ _S ]E is taken as the input of a model, and the high-level characteristics after the fusion of the two modal data are output;

step 2-4, constructing a forward fully-connected neural network as an output layer, and checking semantic consistency, namely:

y＝softmax(W·E+b) (2)

wherein y is E R ^1×2 Representing the output judgment result, i.e. the probability distribution of coincidence or non-coincidence, W E R ^l×2 Is the weight of the full connection layer, b epsilon R ^1×2 The bias parameters of the full-connection layer are used, the high-level features output in the step 2-3 are used as input of the layer, and classification results based on binary probability distribution are output to respectively represent probability distribution with consistent semantics and probability distribution with inconsistent semantics.

In step 3, the bilingual multi-modal fusion model includes a second high-level feature extraction layer, a second feature alignment layer based on an attention mechanism, a second multi-modal feature fusion layer and an output layer.

In step 3, the construction of the dual-voice dual-text multi-mode fusion model specifically includes:

step 3-1, constructing an output layer high-level feature extraction layer to obtain a high-level feature representation: taking the collected regulated voice and the repeated voice training data as input, framing the voice signal, dividing the regulated voice data with the length of n seconds into m frames, wherein the length of each frame signal isRespectively carrying out fast Fourier transform processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient, thus obtaining the low-level characteristics of the voice signal; respectively splicing the control voice and the data of the repeated voice and the control text data and the repeated text data to form voice input and text input;

because the transducer model has stronger feature extraction capability and fusion capability of multi-modal data features, the transducer is selected as the feature extraction layer. Constructing a transducer model to process voice input and text input respectively, wherein the input of the transducer model is a vector sequence of voice signals and text data, and meanwhile, position information of each feature vector is represented by position codes, and the position codes are represented by the following formula:

wherein PE (pos) represents the position encoding of the feature vector at the position pos, wherein PE (pos, 2 i) represents the sine component and PE (pos, 2i+1) represents the cosine component. d, d _model The dimension =512 is i e (0, 512), the odd-dimensional position codes are obtained by using the formula (1), the even-dimensional position codes are obtained by using the formula (2), the position code dimension is (l, 512), l represents the length of an input sequence, and the position codes are added with the input vector sequence to obtain the input of a transducer model;

the transducer model uses the multi-head attention to calculate the attention value between feature vectors in the input sentence, and uses the attention value to improve the word vector representation of the input sentence, and specifically includes: the multi-head attention includes h single proportional Dot product attention (Scaled Dot-Product Attention), and the feature vector of the input sentence is calculated by the following formula:

X×W ^Q ＝Q (5)

X×W ^K ＝K (6)

X×W ^V ＝V (7)

wherein X represents the characteristics of the input sequence, Q, K, V represent the query vector, the key vector and the value vector, respectively, K-Q represents the key-value pair, W ^Q 、W ^K 、W ^V The parameter matrixes Q, K and V are respectively used for obtaining Q, K and V; attention value the Attention value Attention (Q, K, V) is calculated as follows:

wherein Attention (Q, K, V) is the Attention value of the speech signal and text data, d _k For a scaling factor, the Softmax function is a normalized activation function, mapping the output probability into a (0, 1) interval;

in the transducer model, Q represents an input sequence of a voice signal, K and V represent input sequences of text data, and semantic processing high-level characteristic representation is obtained after the transducer model is adopted;

step 3-2, constructing a second feature alignment layer based on an attention mechanism, integrating feature results output by a multi-head attention mechanism by using a layer of fully-connected neural network by the second feature alignment layer based on the attention mechanism, and outputting features after high-level feature alignment operation of voice signals and high-level features of text data;

and 3-3, constructing an output layer by using a forward fully-connected neural network, checking whether the semantics are consistent, taking the characteristics obtained in the step 3-2 as input, and obtaining a classification result based on binary probability distribution as an output result, wherein the classification result respectively represents probability distribution of consistent semantics and inconsistent semantics.

In step 3-1, the definition of the Softmax function is as follows:

step 4 comprises: the method comprises the steps of constructing a classification layer by using a fully-connected neural network, wherein input is probability distribution obtained from two single-voice single-text multi-mode fusion models (respectively taking control voice and compound text as input and taking control text and compound voice as input, wherein the models are single-voice single-text multi-mode fusion models but different in input) and a double-voice double-text multi-mode fusion model, and output is subjected to a Softmax function to obtain normalized probability distribution, namely the occurrence probability of consistency and the occurrence probability of inconsistency, wherein the occurrence probability is a judgment result.

The beneficial effects are that: the invention has the following technical effects: the method of the invention realizes the check of the consistency of the repeated semantics of the intelligent control, and uses the voice signal to compensate the problem of text data information deficiency by utilizing the characteristic of information complementation among the multi-mode data, and the problem is usually caused by the lack of accuracy of the voice recognition technology. Therefore, a highly reliable judgment result can be obtained.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a block diagram of a single-voice single-text multimodal fusion model.

FIG. 3 is a block diagram of a bilingual, and multimodal fusion model.

Fig. 4 is a diagram of a bidirectional LSTM model structure.

Fig. 5 is a schematic diagram of attention alignment weighting.

FIG. 6 is a diagram of a semantic role annotation model, transformer model.

Detailed Description

The invention provides a control voice repeated consistency check method based on multi-mode fusion, wherein a flow chart is shown in fig. 1, and the method specifically comprises the following steps:

step 1: and collecting the control voice and the repeated voice data to form a positive sample data pair. And generating wrong repeated voice by professionals according to the regulated voice to form a negative sample data pair. The voice data is processed using voice recognition techniques to generate text data. And training the deep neural network model by using the data to obtain a trained model. Since the actual consistency check process is the same as the training process, the composition of the model and the judgment (training) process are described in detail in the judgment process.

Examples are as follows: the following regulated voice data is collected: the eastern three nine eight four, the runway three five, can take off. And repeating the voice data: the runway is three, five, can take off, and the eastern is three, nine, eight and four. Thereafter, a false rendition is generated by the practitioner: and (3) waiting for take-off, and generating wrong negative sample pairs if the runway is three, five, three, nine, eight and four in the eastern direction. These voice data are converted into text, forming voice and text multimodal training data.

Step 2: a single speech single text multimodal fusion model (see FIG. 2) was constructed (ref: haiyang Xu, hui Zhang, kun Han, yun Wang, YIping Peng, xiangang Li, "Learning Alignment for Multimodal Emotion Recognition from Speech", in Interppech 2019, https:// arxiv. Org/abs/1909.05645.) that served to determine the consistency of the regulated speech with the repeated text, or the consistency of the repeated speech with the regulated text. The model mainly consists of four parts: 1) A high-level feature extraction layer; 2) A feature alignment layer based on an attention mechanism; 3) A multi-modal feature fusion layer; 4) A semantic consistency check layer.

1) Framing an input speech signal, dividing the regulated speech data of length n seconds into m frames, each frame having a signal length ofAnd respectively performing Fast Fourier Transform (FFT) processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient (MFCC), thus obtaining the low-level characteristics of the voice signal.

Setting the duration of the input voice data as 20 seconds and taking 2 seconds as one frame, dividing the voice data into an input sequence of 10 frames, and processing the input sequence to obtain the input ofWherein l is an input sequence of _S Representing the feature dimension.

And carrying out Word embedding processing on the input text data, generating Word sequences through Word segmentation, then converting each Word into a Word vector form by using a Word2Vec method, and combining to form a vector representation method of the text data, thus obtaining the low-level features of the text data.

Taking the example of 'eastern thirty-eight four, runway thirty-five, take off', the sentence length is 16 words, and the unified word number is adopted as the length of the input text, for example: 50 words, the text is subjected to word number complement to generate input characteristics:wherein l _T Representing the feature dimension.

Two-way LSTM model layers (see fig. 4) are built separately, which refine the low-level features of the input speech and text, respectively, to form high-level features. After bi-directional LSTM treatment E _S And E is connected with _T Turning to high-level features of the same dimension: e's' _S ∈R ^10×l ,E′ _T ∈R ^50×l Wherein l represents the dimension of the high-level feature, and the dimensions of the voice feature and the text feature are unified and then can be spliced.

2) An attention layer (see fig. 5) is constructed, the attention layer uses a layer of fully connected neural network to calculate and match the input voice characteristics and text characteristics, attention value distribution between the voice characteristics and the text characteristics is obtained, the attention value distribution is used for carrying out weighting processing on the voice characteristics, the alignment operation with the text characteristics is realized, the voice characteristics after weighted alignment are output, and the weighting characteristics of the output voice data are obtained through processing: e's' _S ∈R ^10×l 。

3) And constructing a multi-modal feature fusion layer by using the bidirectional LSTM model, splicing the text features and the voice features subjected to weighted alignment treatment, inputting the text features and the voice features into the bidirectional LSTM model of the multi-modal feature fusion layer, and outputting fusion features.

4) The method comprises the steps of constructing a semantic consistency check layer by using a forward fully-connected neural network, taking fusion characteristics as input, obtaining a binary probability distribution as an output result, namely a probability distribution with consistent semantics and a probability distribution with inconsistent semantics, and assuming that an input voice-text data pair is correct, outputting the probability of 1 to be larger than the probability of 0, otherwise outputting the probability of 1 to be smaller than the probability of 0.

Step 3: a double-voice double-text multi-mode fusion model (see figure 3) is constructed, and the model has the function of simultaneously carrying out semantic consistency check by using four types of data, namely, control voice, reproduction voice, control text and reproduction text. The model mainly consists of four parts: 1) A high-level feature extraction layer; 2) A feature alignment layer based on an attention mechanism; 3) A multi-modal feature fusion layer; 4) A semantic consistency check layer.

For this step, the input is to regulate and repeat the concatenation of speech or text, i.e.: the eastern three, nine, eight and four, three, five runway, can take off; the runway is three, five, can take off, and the eastern is three, nine, eight and four. Or: the eastern three, nine, eight and four, three, five runway, can take off; runway three and five, wait to take off, eastern three, nine, eight and four. Correspondingly, the feature dimensions of the input are respectively:the length of the input speech or sentence is twice that of the single input, respectively, and is used for separating the representations "; "symbol".

1) The low-level feature extraction process for the speech signal and the text signal is the same as step 2. After the low-level features are obtained, respectively splicing the control voice and the data of the repeated voice and the data of the control text and the data of the repeated text to form voice input and text input;

a transducer model (see fig. 6) is constructed to process the spliced voice input and text input respectively, so as to find the semantic correlation degree between the control voice (text) and the repeated voice (text), and the feature representation is improved by utilizing the semantic correlation degree, so that the voice input and text input have stronger semantic correlation. The input feature vector sequence of the transducer model represents the position information of each feature vector by using a position code, and the position code formula is as follows:

wherein PE (pos) denotes the position encoding of the feature vector at the position pos, d _model =512 is the dimension i e (0,512), so that the odd-dimensional position codes are obtained using equation (1), the even-dimensional position codes are obtained using equation (2), the position code dimension is (l, 512), and the two are added to obtain the input of the transducer model, as are the input feature vector dimensions.

The transducer model calculates attention values among feature vectors in an input sentence by utilizing the multi-head attention, improves word vector representation of the input sentence by utilizing the attention values, and improves the capability of extracting semantic features. The multi-head attention is mainly composed of h=8 Scaled Dot-Product Attention, and the feature vector of the input sentence will be calculated by:

X×W ^Q ＝Q (3)

X×W ^k ＝K (4)

X×W ^V ＝V (5)

obtaining Q, K and V, wherein Q, K and V respectively represent a Query vector, a Key vector and a Value vector, W ^Q 、W ^K 、W ^V Respectively a conversion matrix. The formula of Scaled Dot-Product Attention is as follows:

wherein d is _k To scale the factor, the Softmax function is a normalized activation function that maps the output probabilities of multiple neurons into (0, 1) intervals. The definition of the Softmax function is as follows:

after the transformation model, the semantic processed high-level characteristic representation is obtained.

2) An attention layer (see fig. 5) is constructed, the attention layer uses a layer of fully connected neural network to calculate and match the input voice features and text features, attention value distribution between the voice features and the text features is obtained, the attention value distribution is used for carrying out weighting processing on the voice features, alignment operation with the text features is achieved, and the voice features subjected to weighting alignment are output.

3) And constructing a multi-mode fusion layer by using the bidirectional LSTM model, splicing the text features and the aligned voice features, inputting the text features and the aligned voice features into the bidirectional LSTM model of the multi-mode fusion layer, and inputting fusion features.

4) The semantic consistency check layer is constructed by using a forward fully-connected neural network, fusion characteristics are taken as input, and an output result is a binary probability distribution which is respectively a probability distribution with consistent semantic and a probability distribution with inconsistent semantic.

Step 4: a simple fully-connected neural network classification model is built, probability distribution obtained from three models (two single-voice single-text multi-mode fusion models and a double-voice double-text multi-mode fusion model) is input, normalized probability distribution is obtained after output is subjected to a Softmax function (see formula (7)), and the probability of occurrence of consistency and the probability of occurrence of inconsistency are respectively represented, wherein the probability of occurrence is a judgment result, such as: the probability of consistency is 0.76 after normalization treatment, and the probability of inconsistency is 0.24, so that the control voice and the repeated voice data are consistent.

The probability of sample coincidence or non-coincidence is respectively output through three classifiers, namely y epsilon R ^1×2 Splicing these outputs to form x=r ^1×6 Input the input signals of the full-connection neural network classification model to obtain a final result of Y E R ^1×2 I.e. the classification result.

The method of the invention is as shown in figure 1, and needs the steps of preparing sample data, training a model, judging whether repeated descriptions are consistent or not, and the like. For convenience of description, description will be divided into three steps.

Step one, preparing control voice data and repeated data, and manually labeling, wherein the manual labeling requires a professional controller to generate negative samples according to repeated errors possibly occurring in actual work, then respectively generating control texts and repeated texts by using a voice recognition technology, labeling the samples, wherein the output of repeated consistent sample labeling is 1, and the output of repeated inconsistent sample labeling is 0.

Step two, respectively dividing the four types of sample data into: 1) Policing speech signals-repeating text data; 2) The two types of voice signal-control text data are repeated, the contents of the two types of voice signal-control text data are required to correspond to each other one by one according to the labeling result, and the samples are used for training a single-voice single-text multi-mode fusion model. And splicing the voice data back and forth, and splicing the corresponding text data back and forth to generate sample data, wherein the samples are used for training a double-voice double-text multi-mode fusion model.

Training the multi-mode fusion model by using the marked sample data to obtain a classification model, training the forward full-connection model according to the classification result, and outputting a judgment result (1 or 0) of the repeated consistency. And training to obtain a result integration model.

And thirdly, in the actual control work, collecting control voice sent by a controller and voice signals repeated by a pilot, obtaining corresponding text data by using voice recognition, and analyzing the data by using a trained classification model to obtain repeated consistency check results.

The invention provides a method for checking consistency of control voice repetition based on multi-mode fusion, which has a plurality of methods and approaches for realizing the technical scheme, the above description is only a preferred embodiment of the invention, and it should be noted that a plurality of improvements and modifications can be made by those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. A multi-mode fusion-based control voice repeated consistency check method is characterized by comprising the following steps:

step 1, collecting control voice and repeated voice data to form positive sample training data; generating false repeated voice according to the control voice to form negative sample training data; processing the collected control voice and the repeated voice training data by using a voice recognition technology to generate text data, wherein the text data comprises control text data and repeated text data;

step 4, constructing a full-connection neural network classification model, inputting the probability distribution obtained in the step 2 and the probability distribution obtained in the step 3 into the full-connection neural network classification model, and outputting a control voice repetition consistency check result;

in the step 2, the single-voice single-text multi-mode fusion model comprises a first high-level feature extraction layer, a first feature alignment layer based on an attention mechanism, a first multi-mode feature fusion layer and a first semantic consistency check layer;

constructing a LSTM layer of the two-way long-short-term memory network, wherein the LSTM layer respectively refines the low-level characteristics of the voice signal and the bottom-level characteristics of the text data to form the high-level characteristics of the voice signal and the high-level characteristics of the text data;

step 2-2, constructing a first feature alignment layer based on an attention mechanism, wherein the first feature alignment layer uses a layer of fully-connected neural network to calculate voice features and text features generated by a LSTM layer of the two-way long-short-term memory network, so as to obtain attention value distribution among the voice features and the text features: setting the processed voice high-level characteristic and text high-level characteristic as respectivelyAndwherein R is a real number set, m _S And m _T Representing the lengths of the voice and text feature sequences, respectively, and l representing the feature dimension, the attention value calculated by the full connection layer is as follows:

a _ij ＝softmax(E′ _S ·E′ _T ^T ) (1)

Step 2-3, inputting the output weighted features into a two-way long-short-term memory network LSTM layer, and splicing the text high-level features obtained after the processing of the two-way long-term memory network LSTM layer with the voice high-level features obtained in the step 2-2 after the weighted alignment to obtain a splicing result E= [ E '' _T ,a _ij ·E′ _S ]E is taken as the input of a model, and the high-level characteristics after the fusion of the two modal data are output;

y＝softmax(W·E+b) (2)

wherein y is E R ^1×2 Representing the output judgment result, i.e. the probability distribution of coincidence or non-coincidence, W E R ^l×2 Is the weight of the full connection layer, b epsilon R ^1×2 The bias parameters of the full-connection layer are used, the high-level features output in the step 2-3 are used as input of the layer, and classification results based on binary probability distribution are output to respectively represent probability distribution with consistent semantics and probability distribution with inconsistent semantics;

in step 3, the bilingual multi-modal fusion model comprises a second high-level feature extraction layer, a second feature alignment layer based on an attention mechanism, a second multi-modal feature fusion layer and an output layer;

step 3-1, constructing a second high-level feature extraction layer to obtain a high-level feature representation: taking the collected regulated voice and the repeated voice training data as input, framing the input voice signal, dividing the regulated voice data with the length of n seconds into m frames, wherein the length of each frame signal isRespectively carrying out fast Fourier transform processing on each frame of signal, converting the time domain representation into frequency spectrum representation, and then using a Mel filter to process the signal to obtain a sequence representation method based on Mel cepstrum coefficient, thus obtaining the low-level characteristics of the voice signal; respectively splicing the control voice and the data of the repeated voice and the control text data and the repeated text data to form voice input and text input;

constructing a transducer model to process voice input and text input respectively, wherein the input of the transducer model is a vector sequence of voice signals and text data, and meanwhile, position information of each feature vector is represented by position codes, and the position codes are represented by the following formula:

wherein PE (pos) represents the position encoding of the feature vector at the position pos, wherein PE (pos, 2 i) represents the sine component and PE (pos, 2i+1) represents the cosine component; d, d _model The number of dimensions 512 is the dimension i e (0,512), the odd-dimensional position codes are obtained by using the formula (1), the even-dimensional position codes are obtained by using the formula (2), the position code dimension is (l, 512), l represents the length of the input sequence, and the position codes are added with the input vector sequence to obtain the input of the transducer model;

the transducer model uses the multi-head attention to calculate the attention value between feature vectors in the input sentence, and uses the attention value to improve the word vector representation of the input sentence, and specifically includes: the multi-head attention comprises h single proportional dot product attention, and the feature vector of the input sentence is calculated by the following formula:

X×W ^Q ＝Q (5)

X×W ^K ＝K (6)

X×W ^V ＝V (7)

wherein X represents the characteristics of the input sequence, Q, K, V represent the query vector, the key vector and the value vector, respectively, K-Q represents the key-value pair, W ^Q 、W ^K 、W ^V The parameter matrix of Q, K, V, the formula of the attention value is as follows:

2. The method according to claim 1, wherein in step 3-1, the Softmax function is defined as follows:

3. the method of claim 2, wherein step 4 comprises: and constructing a classification layer by using a fully-connected neural network, inputting probability distribution obtained from two single-voice single-text multi-mode fusion models and one double-voice double-text multi-mode fusion model, and outputting normalized probability distribution obtained after a Softmax function, wherein the probability distribution is the occurrence probability of consistency and the occurrence probability of inconsistency respectively, and the occurrence probability is a judgment result.