CN117912455A

CN117912455A - Land-air communication voice conversion method and device, terminal equipment and storage medium

Info

Publication number: CN117912455A
Application number: CN202311399415.6A
Authority: CN
Inventors: 任晋; 葛淑婷; 师一华; 江学锋; 杨顺志; 杨金锋
Original assignee: Shenzhen Vocational And Technical University
Current assignee: Shenzhen Vocational And Technical University
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-04-19

Abstract

The invention discloses a land-air call voice conversion method, a device, terminal equipment and a storage medium, and land-air call voice data are obtained; inputting the land-air communication voice data into a land-air voice recognition model to obtain text data corresponding to the land-air communication voice data; the training of the land-air voice recognition model specifically comprises the following steps: based on a double-stage training strategy, inputting voice training data and text training data into a multi-mode voice recognition model for training to obtain the land-air voice recognition model. According to the invention, the multi-mode voice recognition model is trained through the double-stage training strategy, so that the land-air communication voice data is automatically recognized and the text data is output through the land-air communication voice recognition model obtained through training, the two parties of the land-air communication are assisted in understanding the conversation intention, and the voice conversion efficiency of the land-air radio communication is improved.

Description

Land-air communication voice conversion method and device, terminal equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a method and apparatus for converting land-air call speech, a terminal device, and a storage medium.

Background

An air traffic Controller (AIR TRAFFIC Controller, ATCO) interacts with the pilot by radio to confirm important flight information and maintain the safety and reliability of the aircraft flight in a double check manner. However, communication misunderstanding may occur due to background noise interference, distraction, fatigue, stress, etc., which may lead to catastrophic aviation accidents. Therefore, the existing land-air radio communication has the problem of low semantic conversion efficiency.

Therefore, a voice conversion strategy for land-air communication is needed, so as to solve the problem of low semantic conversion efficiency of land-air radio communication.

Disclosure of Invention

The embodiment of the invention provides a land-air communication voice conversion method, a device, terminal equipment and a storage medium, so as to improve the semantic conversion efficiency of land-air radio communication.

In order to solve the above-mentioned problems, an embodiment of the present invention provides a method for converting land-air communication voice, comprising:

acquiring land-air call voice data;

Inputting the land-air communication voice data into a land-air voice recognition model to obtain text data corresponding to the land-air communication voice data; the training of the land-air voice recognition model specifically comprises the following steps: based on a double-stage training strategy, inputting voice training data and text training data into a multi-mode voice recognition model for training to obtain the land-air voice recognition model.

As an improvement of the above solution, the multi-modal speech recognition model includes: a text input representation module, a speech input representation module, a text encoder module, a cross-modal speech encoder module, and a decoder module; wherein the text encoder module comprises a plurality of text encoder units, and the cross-modal speech encoder module comprises a plurality of speech encoder units;

The output end of the text input representation module is connected with the input end of the text encoder module, the output end of the voice input representation module is connected with the input end of the cross-mode voice encoder module, and the output end of the cross-mode voice encoder module is connected with the input end of the decoder module; each text encoder unit in the text encoder module is connected with each other, each voice encoding unit of the cross-mode voice encoder module is connected with each other, each text encoder unit corresponds to each voice encoder unit one by one, and the text encoder units are connected with the corresponding voice encoder units.

As an improvement of the above, the text encoder unit includes: the system comprises a first multi-head self-attention layer, a first residual error connection and normalization layer, a first position feedforward network layer, a second residual error connection and normalization layer;

When the current text encoder unit is connected with the last text encoder unit, the received data of the current text encoder unit is text encoding data;

when the current text encoder unit is connected with the text input representation module, the received data of the current text encoder unit is text preprocessing data;

The data transmission of the current text encoder unit is specifically: transmitting the received data to the first multi-head self-attention layer and the first residual error connection and normalization layer respectively, transmitting the output data of the first multi-head self-attention layer to the first residual error connection and normalization layer, transmitting the output data of the first residual error connection and normalization layer to the first position feedforward network layer and the second residual error connection and normalization layer respectively, transmitting the output data of the first position feedforward network layer to the second residual error connection and normalization layer, and transmitting the output data of the second residual error connection and normalization layer to a next text encoder unit and a speech encoder unit corresponding to the current text encoder unit respectively.

As an improvement of the above-described scheme, the text encoder unit satisfies the following condition:

In the method, in the process of the invention, The respective representations are a query by text feature sequence, a key and a value matrix, Is a linear transformation matrix after all the attention heads are connected along the corresponding columns; the output of layer i-1 serves as the input of layer i, while/>E _w, outputting a text input representation module; /(I)Text encoding data output by the text encoder unit of the ith layer and i is not equal to 0 layer; w ₁ and W ₂ are weight matrix parameters to be trained, and b ₁ and b ₂ are bias parameters to be trained.

As an improvement of the above, the speech encoder unit includes: the system comprises a second multi-head self-attention layer, a third residual error connection and normalization layer, a multi-head cross-modal attention layer, a fourth residual error connection and normalization layer, a second position feedforward network layer and a fifth residual error connection and normalization layer;

When the current voice encoder unit is connected with the last voice encoder unit, the received data of the current voice encoder unit is voice encoded data;

When the current voice encoder unit is connected with the voice input representation module, the received data of the current voice encoder unit is voice preprocessing data;

the data transmission of the current speech encoder unit is specifically: transmitting the received data to the second multi-head self-attention layer and the third residual error connection and normalization layer respectively, transmitting the output data of the second multi-head self-attention layer to the third residual error connection and normalization layer, and transmitting the output data of the third residual error connection and normalization layer to the multi-head cross-modal attention layer and the fourth residual error connection and normalization layer respectively, wherein the multi-head cross-modal attention layer receives the text coding data transmitted by the text encoder unit corresponding to the current speech encoder unit;

and transmitting the output data of the fourth residual connection and the normalization layer to the second position feedforward network layer and the fifth residual connection and the normalization layer respectively, transmitting the output data of the second position feedforward network layer to the fifth residual connection and the normalization layer, and transmitting the output data of the fifth residual connection and the normalization layer to a next voice encoder unit or decoder.

As an improvement of the above, the speech encoder unit satisfies the following condition:

In the method, in the process of the invention, Respectively representing a query, a key and a matrix of values that are sequences of speech features, Is a linear transformation matrix after all the attention heads are connected along the corresponding columns; Information representing propagation within a speech modality; notably, the output of layer j-1 serves as the input of layer j, while/> E _s, outputting a voice input representation module; /(I)Representing cross-modal interaction information transferred from text to voice, and simultaneously ensuring d _s and d _w to be equal to each other in a unified mode; /(I)The j is the voice coding data output by the voice coder unit of the j < 0 > layer.

As an improvement of the above solution, the two-stage training strategy includes: a first stage pre-training strategy and a second stage fine tuning strategy; the dual-stage training strategy is based on, inputting voice training data and text training data into a multi-mode voice recognition model for training, and obtaining the land-air voice recognition model comprises the following steps:

acquiring voice training data and text training data based on the land-air call history data; wherein, each voice training data corresponds to each text data one by one;

In the first stage pre-training strategy, text training data is input into a text encoder module for training based on a mask language modeling strategy, and voice training data is input into a cross-modal voice encoder module for training based on a cross-modal mask acoustic modeling strategy so as to obtain an initial voice recognition model;

And in the second stage fine tuning strategy, performing secondary training on the initial speech recognition model based on a preset parameter adjustment strategy to obtain the land-air speech recognition model.

As an improvement of the above solution, the adjusting the parameters of the initial speech recognition model based on a preset adjustment policy to obtain the land-air speech recognition model includes:

Acquiring first voice secondary training data;

And inputting the first voice secondary training data into an initial voice recognition model of the deactivated text encoder module for secondary training to obtain the air-ground voice recognition model.

As an improvement of the scheme, mask text data and second voice secondary training data are obtained;

And inputting the mask text data into a text encoder module of an initial speech recognition model, and inputting the second speech secondary training data into a cross-mode speech encoder module of the initial speech recognition model for secondary training to obtain the air-ground speech recognition model.

Correspondingly, an embodiment of the present invention further provides a device for converting voice of a land-air call, including: the data acquisition module and the voice recognition module;

The data acquisition module is used for acquiring the land-air call voice data;

The voice recognition module is used for inputting the land-air communication voice data into a land-air voice recognition model to obtain text data corresponding to the land-air communication voice data; the training of the land-air voice recognition model specifically comprises the following steps: based on a double-stage training strategy, inputting voice training data and text training data into a multi-mode voice recognition model for training to obtain the land-air voice recognition model.

From the above, the invention has the following beneficial effects:

The invention provides a land-air call voice conversion method, which is used for obtaining land-air call voice data; inputting the land-air communication voice data into a land-air voice recognition model to obtain text data corresponding to the land-air communication voice data; the training of the land-air voice recognition model specifically comprises the following steps: based on a double-stage training strategy, inputting voice training data and text training data into a multi-mode voice recognition model for training to obtain the land-air voice recognition model. According to the invention, the multi-mode voice recognition model is trained through the double-stage training strategy, so that the land-air communication voice data is automatically recognized and the text data is output through the land-air communication voice recognition model obtained through training, the two parties of the land-air communication are assisted in understanding the conversation intention, and the voice conversion efficiency of the land-air radio communication is improved.

More preferably, the invention fuses the multi-modal information through the interaction between the intra-mode and the inter-mode, the voice recognition model can realize the tight semantic alignment between the voice and the text modes in the encoding stage, and excellent recognition performance is obtained while the efficiency is maintained. A two-stage training strategy is designed to effectively obtain semantically perceived acoustic characterizations. The first stage focus pre-trains the speech-text multimodal coding module to enhance inter-modality semantic alignment and acoustic long-range context dependent modeling. And in the second stage, the whole network is subjected to end-to-end fine tuning to relieve the input modal change difference generation gap in the training and reasoning stage so as to improve generalization performance.

Drawings

Fig. 1 is a flow chart of a voice conversion method for land-air communication according to an embodiment of the present invention;

Fig. 2 is a schematic structural diagram of a voice conversion device for land-air communication according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-modal speech recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the structure of a text encoder unit and a speech encoder unit according to an embodiment of the present invention;

FIG. 5 is a flow chart of a dual stage training strategy according to an embodiment of the present invention;

FIG. 6 is a flow chart of training strategies for two experimental setups provided in accordance with one embodiment of the present invention;

Fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, fig. 1 is a flow chart of a method for converting a voice of a land-air call according to an embodiment of the present invention, as shown in fig. 1, the embodiment includes steps 101 to 102, and the steps are as follows:

step 101: and acquiring the land-air communication voice data.

In the present embodiment, the land-air call voice data is obtained by receiving the voice data acquired by the land-air radio communication system.

Step 102: inputting the land-air communication voice data into a land-air voice recognition model to obtain text data corresponding to the land-air communication voice data; the training of the land-air voice recognition model specifically comprises the following steps: based on a double-stage training strategy, inputting voice training data and text training data into a multi-mode voice recognition model for training to obtain the land-air voice recognition model.

In a specific embodiment, please refer to fig. 3 for the structure of the multi-modal speech recognition model, the text encoder in fig. 3 is the text encoder module according to the present invention, the cross-modal speech encoder is the cross-modal speech encoder module according to the present invention, and the attention-based decoder is the decoder module according to the present invention.

In a specific embodiment, the composition of the text encoder units, the composition of the speech encoder units, the connection relationship of each text encoder unit, the connection relationship of each speech encoder unit and the connection relationship of each text encoder and each speech encoder are shown in fig. 4, as shown in fig. 4, the multi-head self-attention of the text encoder is the first multi-head self-attention layer according to the present invention, the lower residual connection & normalization is the first residual connection and normalization layer according to the present invention, the position feedforward network is the first position feedforward network layer according to the present invention, and the upper residual connection & normalization is the second residual connection and normalization layer according to the present invention;

the multi-head self-attention of the cross-mode voice encoder is the second multi-head self-attention layer, the lower residual connection and normalization is the third residual connection and normalization layer, the multi-head cross-mode attention is the multi-head cross-mode attention layer, the middle residual connection and normalization is the fourth residual connection and normalization layer, the position feedforward network is the second position feedforward network layer, and the upper residual connection and normalization is the fifth residual connection and normalization layer.

In a specific embodiment, the text input representation module is specifically: for text modality, the input text is encoded using RoBERTa-wwm word segmenter, the vocabulary size being 21128. In addition, special identifiers < s > and < \s > are introduced to indicate the start and end identifiers. The encoded text is then added to the corresponding position-embedded representation as the final text-in representation, which is represented asWhere T _w represents the text sequence length and d _w represents the implicit state dimension of the text representation.

In a specific embodiment, the voice input representation module is specifically: for speech mode, the input audio signal is first divided into a sequence of frames, which are 50ms long and 12.5ms frame-shifted. The corresponding mel-spectrum is then calculated using Librosa tools to extract the FBank (filter bank) features of 80 dimensions from each frame. To comprehensively capture the time and frequency domain features of the speech signal, the speech features are spliced with their first derivatives to expand the feature dimensions to 160. Finally, the processed acoustic features are projected through dense layers and combined with position embedding to obtain an input representation of the speech encoderWhere T _s is the total number of frames of audio and d _s represents the hidden state dimension of the speech characterization.

In a specific embodiment, the decoder employs an attention-based decoder: the attention mechanism based decoder takes as input the output of the cross-modal speech encoder, dynamically focusing on various portions of the encoder output to generate each target sequence element. The context vector c _t―1 of the decoder at time step t-1 is derived by weighted summing the context vectors of the encoder using the attention weights as follows:

Wherein α _t―1,k is the calculation of the corresponding decoder state by a position-aware attention mechanism And encoder implicit state/>The specific calculation process of the attention weight is as follows:

wherein s _t―1,k is a measure of decoder state And encoder hidden state/>Attention score of similarity between them, while hidden state/>Is the final output speech characterization/>Is the k-th row vector of (c).

Thereafter, a long short-term memory (LSTM) calculates a new hidden state by autoregressively calculating a new hidden state along a time sequence using a current hidden state, a decoder output, and a context vector, as follows:

The final decoder state is passed through a dense layer and softmax function to obtain the predicted distribution vector. Finally, cross entropy loss is used to measure the difference between the predicted distribution and the real label, and the calculation formula is as follows:

where V represents the vocabulary size, T represents the sequence length of the predicted text, And y _u represent the prediction distribution vector respectivelyAnd the u-th element of the true tag vector y.

In a specific embodiment, the text encoder employs a stacked N-layer original Transformer encoder structure. Each layer is mainly composed of a multi-head self-attention (multi-head self-attention) sub-layer, a position-feed-forward (FFN) sub-layer and two residual connection and normalization (Add & Norm) modules.

In a specific embodiment, for the i-th layer text encoder unit, the text feature sequences are first linearly projected by different weight matrices, and converted into query, key, and value matrices, respectively, to map to respective different token spaces. Then, calculating the attention of each single head in parallel by adopting a multi-head self-attention mechanism, and outputting the calculation result of each head in series; the model captures semantic associations and importance in the text input sequence by weighting and aggregating context information for different locations as shown in the following:

In addition, the position feedforward sub-layer introduces a nonlinear activation function to enhance the fitting capability of the model, and the Add & Norm module is used for promoting the deep propagation of information and accelerating the convergence of the model. Thus, the text representation that is ultimately output by the i-th layer text encoder unit The derivation can be derived from the following formula:

In a specific embodiment, the cross-modal speech encoder differs from the original transducer in two attention modules: a self-attention module and a cross-modal attention module.

In a specific embodiment, unlike the original transducer decoder, which eliminates the prospective masking operation of future locations, instead employs a bi-directional self-attention module to fully learn the internal information of the speech modality, the overall calculation process can be summarized as follows:

Another difference is a cross-modal attention module that aims to establish strong links and semantic alignment between text and speech, and thus determine how tightly the two modalities are associated by calculating the dot product between the speech query and the text key. Then, the value items of each feature sequence in the text mode are weighted and summed through the attention weight value to obtain cross-mode interaction information, and the calculation formula is as follows:

wherein, Representing cross-modal interaction information from text to speech delivery while ensuring d _s and d _w are equal to unify modalities. Notably,/>And/>Representing a representation of the corresponding key-value item of the last layer in the text encoder.

Finally, the characterization capability of the acoustic characterization device is further enhanced through a fully-connected feedforward layer, and the acoustic characterization of the cross-modal semantic perception is obtainedThe following formula can be used to derive:

In a specific embodiment, referring to fig. 5, fig. 5 discloses a flow diagram of a dual-stage training strategy;

In a first stage pre-training strategy, the voice text is used as the input of a multi-mode voice recognition model together for pre-training so as to realize the tight semantic alignment between voice and text modes;

The text encoder module employs a mask language modeling (masked language modeling, MLM) strategy like shape filling to help the model learn context and semantic information of the text modality, with the goal of predicting certain mask symbols in the input text based on the context information. The present embodiment takes over the configuration in RoBERTa and inputs the passsymbol with a 15% probability dynamic mask, where 80% of the mask positions are replaced with < mask > passsymbol, 10% are replaced with random passsymbol, and the remaining 10% remain unchanged. Finally, the model is optimized by taking the cross entropy loss as an objective function, and the calculation formula is as follows:

wherein m (w) and w\ _m(w) represent the masked character and the remaining contextual characters, respectively, in the text sequence w;

The cross-modal speech encoder module employs a cross-modal masking acoustic modeling (cross-modal masked acoustic modeling, CMAM) strategy to learn speech characterization by masking a portion of an audio frame and directing the model to predict the masked portion according to context. Thus, it extracts multimodal interaction information with tight semantic alignment by combining context information from both modalities. Specifically, we first divide the audio segments, each containing a certain number of consecutive frames, and select some of them with a 15% probability, 80% of which are all masked to zero, 10% are replaced by randomly selected audio frames, and the rest remain unchanged. Thus, the model reconstructs the masked acoustic features by optimizing the L1 penalty to reduce the distribution difference between the predictive representation and the true labels of the masked locations, as shown in the following:

Where T _masked represents the number of masked, s _i represents the original audio feature of the masked location, Representing predicted audio features.

In the second stage fine tuning strategy, unpaired pure speech input is employed and cross entropy loss is used to fine tune the entire speech-text multimodal speech recognition network end-to-end.

In particular, a speech-text multimodal coding module obtained by a one-stage pre-training is employed to initialize a speech-text multimodal encoder to take full advantage of the effectively aligned multimodal representation that it obtained in the first stage.

In this embodiment, two adjustment strategies are provided to accommodate pure speech input: 1) Disabling the text encoder and text input; 2) Using the full < mask > symbol as a text-side input;

Acquiring first voice secondary training data;

In a particular embodiment, the cross-modal attention layer in the text encoder module and each speech encoder unit is disabled because the dual-tower architecture captures the two modal features independently and uses cross-modal attention for interaction.

In a specific embodiment, the original multimodal architecture is still valid while utilizing a sequence of all < mask > symbols as the text encoder module input. However, it is challenging for the model to convey enough meaningful information in a fixed text sequence input so it preferentially fine-tunes the speech encoder and decoder modules to reduce domain offset.

It will be appreciated that the two-stage training strategy provides a viable approach to establishing semantic associations between speech and text modalities, enabling multimodal models to better handle pure speech input during reasoning to obtain semantically perceived speech recognition results.

To better illustrate the effects of the present invention, the following examples are provided for illustration:

The present example uses a chinese land-air call (ATCC) dataset, recorded in a quiet environment with reference to professional curriculum materials under supervision and guidance of an experienced air traffic control expert. The sample in the repeating data set consists of the instruction of the air traffic controller and the repeating splice of the pilot, so that sentences are longer than the common data set. Each speech sample is stored in WAV format with a sample rate of 16khz,16 bits, mono encoded. The data set consisted of 10971 speech samples with a total duration of about 25.2 hours. We randomly shuffle the whole dataset and divide it into training, validation and test sets in a ratio of 8:1:1.

To further verify the effectiveness of the proposed method, the present embodiment also performed experiments on AISHELL-1 dataset, which is an open-source, high-quality Mandarin Chinese speech recognition corpus, which was also recorded in a quiet indoor environment. The method is widely applied to voice communities, and covers 11 fields including smart home, automatic driving, industrial production and the like. The total duration of the data set was 178 hours, the sampling rate was 16khz,16 bits, mono encoded, and the storage format was WAV. Furthermore, the present embodiment divides the data set into training, validation and test sets at the same ratio of 8:1:1. Table 1 shows detailed divisions of training sets, validation sets and test sets for two data sets.

Table 1 the partitioning and size of the two data sets. "Utterances" and "Hours" denote the sample size and total duration of speech, respectively

In this embodiment, the text encoder module and the cross-modal speech encoder module consist of 6 converger stacks of hidden layer dimension 768, 12 attention heads, and one feed-forward layer of size 3027, respectively, while the decoder module consists of a single layer LSTM of dimension 768. The initial learning rate was 5e-5, and the present embodiment uses an Adam optimizer to optimize trainable parameters during model training. In addition, a linear decay learning rate strategy is employed during the warm-up period (warm-up). In the fine tuning stage, the embodiment adopts a AdamW device with an initial learning rate of 1e-5 and is matched with a cosine annealing learning rate strategy to achieve the best performance. This example uses a 4-block NVIDIA A100-SXM4-40G GPU for experiments with a batch size (batch_size) of 8. The main modeling unit is Character (Character), and the Character error rate (CHARACTER ERROR RATE, CER) is used as an evaluation index, and the formula is as follows:

Wherein I, S, D represent the number of characters inserted, replaced and deleted, and N represents the total number of characters of the real tag.

The two-stage training strategy proposed in this embodiment includes: the first stage performs 50 epoch pretrains on the speech-text multi-mode coding module, and the second stage performs 30 epoch fine adjustments on the whole end-to-end network. The comparison was made by introducing an additional experimental setup to demonstrate the effectiveness of the proposed strategy, as shown in fig. 6 (b). In the first stage, the encoding and decoding module uses paired speech-text multimodal input for end-to-end training while optimizing the entire network in combination with MLM loss, CMAM loss, and cross entropy loss. In the second stage, the entire end-to-end speech-to-text multimodal speech recognition network is also fine tuned using only speech data.

Tables 2 and 3 show the identification results for the same settings on the ATCC and AISHELL-1 datasets, respectively. Experimental results show that the multi-modal model herein performs much better than the single-modal baseline model for both data sets. Notably, the multi-modal approach (Ours ^★) employed by the present invention achieves optimal results on both datasets, proving the effectiveness of the proposed two-stage training strategy. This suggests that the first stage focus on the pre-training of the multi-modal encoder helps learn the tight semantic alignment between speech and text modalities and provides a beneficial initialization for the overall training process. Furthermore, the two second-stage trimming strategies are also correspondingly applicable to the two data sets, respectively.

Table 2 ATCC comparison of the performance of the different methods on the dataset. "∈" and "#", respectively, represent two experimental set-up variants of the proposed two-stage training strategy, as shown in fig. 6 (a) and 6 (b). "+ < mask >" and "-X" represent the second stage trim strategy, as shown in fig. 5 (a) and 5 (b), respectively.

For ATCC datasets, the single-mode baseline model performs poorly because of its limited modeling ability on a small number of and long-sequence datasets. In contrast, the multi-modal method can effectively capture and align cross-modal information, has richer context modeling capability, and therefore improves recognition accuracy and generalization capability. Thus, as shown in Table 2, the best results are obtained when only the multi-modal encoder is pre-trained in the first stage and the text encoder is directly disabled in the second stage. The multi-modal interaction of the first stage enhances the acoustic long-distance context dependency, and the second stage disables the text encoder to facilitate the improvement of the generalization performance of semantic perception acoustic modeling.

For the AISHELL-1 dataset, the performance of the model was best when using the full < mask > token as the text-side input in the second stage, which demonstrates the feasibility of this approach for reconciliation and reasoning, as shown in Table 3. Based on the experimental results of the ATCC data set, the joint training of the end-to-end encoder-decoder is inferior to the training of only the multi-mode encoder module in the first stage. Ideally, after pre-training the multi-modal encoder, the model is more focused on the transfer learning of the semantic aware acoustic feature distribution during the fine tuning phase than on feature extraction. Otherwise, the model focuses more on parameter adjustment of the speech recognition task when simultaneously training the three target encoder-decoder network in the first stage, resulting in potential interference between sub-tasks.

Table 3AISHLL-1 shows a comparison of the performance of the different methods on the dataset. Two experimental setup variants of the proposed two-stage training strategy are shown in fig. 6 (a) and 6 (b), respectively. "+ < mask >" and "-X" represent the second stage trim strategy, as shown in fig. 5 (a) and 5 (b), respectively.

This example performed an ablation experiment on ATCC and AISHELL-1 datasets to verify the effectiveness of key components in the model, as shown in table 4. By comparing the results of setting (a) with our method, it can be observed that the performance decreases when modeled without text modality, which means the effectiveness of the speech-text multimodal method. In configuration (b), CER rises when the MLM task is cancelled from the text encoder, illustrating the importance of learning a high-level semantic representation of the text modality for the speech recognition task. For setting (c), we only preserve the information flow between speech and text, deleting the CMAM task from the cross-modal speech encoder. The dramatic drop in performance indicates the necessity of modeling intra-modal acoustic information and modal interactions between speech and text to fully understand and model the speech signal.

Table 4 key component ablation experiments for speech-text multimodal methods. "MAM" means that the speech coding module is pre-trained by masking the acoustic modeling target.

In this example, a super-parametric model selection experiment was performed on the ATCC dataset, and the effect of model capacity on recognition performance was studied by changing the number of encoder layers for both modes, as shown in table 5. As the number of encoder layers increases, the recognition performance increases. The greater capacity enables the model to pass more semantic information, thereby yielding better results for semantically related speech recognition tasks. Finally, the model performance reaches a peak value when the number of encoder layers is 6; however, models may be difficult to learn and adapt due to the small amount of data, thus resulting in performance degradation.

TABLE 5 influence of model Capacity on recognition Performance on ATCC dataset

Recognition cases of the speech-text multimodal speech recognition model presented herein on ATCC and AISHELL-1 datasets are shown in tables 6 and 7, respectively. The table provides the Chinese phonetic pronunciation of the key words in the sentences and the English translation corresponding to the whole sentences, and the light gray background marks the character with wrong recognition. The multi-modal approach herein has better recognition accuracy over both datasets than the single-modal approach. As shown in table 6, the single-mode model has significant insertion type errors due to insufficient acoustic modeling capability for long sequences. The results obtained by the unimodal approach are syntactically reasonable in terms of semantic reliability, but there are some semantic recognition errors that are particularly apparent in the AISHELL-1 dataset.

In some cases, the recognition result of some specific characters sounds the same but semantically incorrect as compared to the real label. For example, in case 1 of table 7, the single-modality approach recognizes "as" try-on "(shi 4 zuo 4) by mistake, while the multi-modality approach recognizes correctly, indicating that the multi-modality approach has superior characterization capability, with more tight semantic alignment between speech and text. Nevertheless, as shown in case 3 of Table 7, the multimodal framework still does not work well when dealing with rare proper nouns, and a more efficient semantic aware acoustic characterization learning approach may be needed.

Table 6 ATCC data sets for cases of different speech recognition methods. The light gray background marks the erroneously recognized character,/…/represents the pinyin of a chinese character, where the digits represent the tones of the chinese character. The term "is used to align character units so as to facilitate the observation of insertion errors.

Table 7 different speech recognition methods are shown for the cases on AISHELL-1 datasets. The light gray background marks the erroneously recognized character. And/…/the phonetic transcription of Chinese characters, in which the number represents the tone of Chinese character.

The present embodiment designs a two-stage training strategy to effectively obtain an acoustic representation of semantic perception. The first stage focus pre-trains the speech-text multi-modal coding module using MLM and CMAM strategies to enhance inter-modal semantic alignment and acoustic long-range context-dependent modeling. The second stage fine tunes the entire network end-to-end by disabling the text encoder or using the full < mask > symbol as the text-side input to mitigate modal input variation differences generation gap between training and reasoning stages.

Sufficient experimentation demonstrated the effectiveness of the speech-to-text multi-modal speech recognition method presented in this section on ATCC and AISHELL-1 published datasets, which reduced the character error rate to 6.54% and 8.73%, respectively. Notably, the first stage pre-training process focuses on the multi-modal encoder, facilitating tight semantic alignment between speech and text modalities, providing beneficial initialization for the second stage training process. The multi-modal interactions of the first stage facilitate acoustic long-range context dependency modeling despite the small data size and long sequence of ATCC datasets. Furthermore, the two second stage trim strategies are correspondingly applicable to both datasets, with best results for ATCC dataset disabling text encoder, while AISHELL-1 dataset performs best when the second stage uses full < mask > symbols as text-side inputs.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a land-air call voice conversion device according to an embodiment of the present invention, including: a data acquisition module 201 and a speech recognition module 202;

The data acquisition module is used for acquiring the land-air call voice data;

It can be understood that the above system item embodiment corresponds to the method item embodiment of the present invention, and may implement the land-air call voice conversion method provided by any one of the method item embodiments of the present invention.

The embodiment obtains the land-air communication voice data; inputting the land-air communication voice data into a land-air voice recognition model to obtain text data corresponding to the land-air communication voice data; the training of the multi-modal voice recognition model specifically comprises the following steps: based on a double-stage training strategy, inputting voice training data and text training data into a multi-mode voice recognition model for training to obtain the land-air voice recognition model. According to the invention, the multi-mode voice recognition model is trained through the double-stage training strategy, so that the land-air communication voice data is automatically recognized and the text data is output through the land-air communication voice recognition model obtained through training, the two parties of the land-air communication are assisted in understanding the conversation intention, and the voice conversion efficiency of the land-air radio communication is improved.

Example two

Referring to fig. 7, fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

A terminal device of this embodiment includes: a processor 701, a memory 702 and a computer program stored in the memory 702 and executable on the processor 701. The processor 701, when executing the computer program, implements the steps of the above-described respective land-air call voice conversion method in the embodiment, such as all the steps of the land-air call voice conversion method shown in fig. 1. Or the processor, when executing the computer program, performs the functions of the modules in the above apparatus embodiments, for example: all the modules of the land-air call voice conversion device shown in fig. 2.

In addition, the embodiment of the invention also provides a computer readable storage medium, which comprises a stored computer program, wherein the computer program is used for controlling equipment where the computer readable storage medium is located to execute the land-air communication voice conversion method according to any embodiment.

It will be appreciated by those skilled in the art that the schematic diagram is merely an example of a terminal device and does not constitute a limitation of the terminal device, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.

The processor 701 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf programmable gate array (field-programmable GATE ARRAY, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 701 is a control center of the terminal device, and connects various parts of the entire terminal device using various interfaces and lines.

The memory 702 may be used to store the computer program and/or module, and the processor 701 implements various functions of the terminal device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory 702. The memory 702 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMART MEDIA CARD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

Wherein the terminal device integrated modules/units may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. A land-air call voice conversion method, comprising:

acquiring land-air call voice data;

2. The land air call voice conversion method according to claim 1, wherein said multi-modal voice recognition model comprises: a text input representation module, a speech input representation module, a text encoder module, a cross-modal speech encoder module, and a decoder module; wherein the text encoder module comprises a plurality of text encoder units, and the cross-modal speech encoder module comprises a plurality of speech encoder units;

3. The land-air call voice conversion method according to claim 2, wherein said text encoder unit comprises: the system comprises a first multi-head self-attention layer, a first residual error connection and normalization layer, a first position feedforward network layer, a second residual error connection and normalization layer;

4. A land-air call voice conversion method according to claim 3, wherein said text encoder unit satisfies the following condition:

In the method, in the process of the invention, Respectively representing a query, key and value matrix by a text feature sequence,/> Is a linear transformation matrix after all the attention heads are connected along the corresponding columns; the output of layer i-1 serves as the input of layer i, while/>E _w, outputting a text input representation module; /(I)Text encoding data output by the text encoder unit of the ith layer and i is not equal to 0 layer; w ₁ and W ₂ are weight matrix parameters to be trained, and b ₁ and b ₂ are bias parameters to be trained.

5. A land-air call voice conversion method according to claim 3, wherein said voice encoder unit comprises: the system comprises a second multi-head self-attention layer, a third residual error connection and normalization layer, a multi-head cross-modal attention layer, a fourth residual error connection and normalization layer, a second position feedforward network layer and a fifth residual error connection and normalization layer;

6. The land-air call voice conversion method according to claim 5, wherein said voice encoder unit satisfies the following condition:

7. The land-air call voice conversion method according to any one of claims 1 to 6, wherein the two-stage training strategy comprises: a first stage pre-training strategy and a second stage fine tuning strategy; the dual-stage training strategy is based on, inputting voice training data and text training data into a multi-mode voice recognition model for training, and obtaining the land-air voice recognition model comprises the following steps:

8. The method for converting an air-to-ground call voice according to claim 7, wherein said adjusting parameters of said initial voice recognition model based on a preset adjustment strategy to obtain said air-to-ground voice recognition model comprises:

Acquiring first voice secondary training data;

9. The method for converting an air-to-ground call voice according to claim 7, wherein said adjusting parameters of said initial voice recognition model based on a preset adjustment strategy to obtain said air-to-ground voice recognition model further comprises:

acquiring mask text data and second voice secondary training data;

10. A land-air call voice conversion device, comprising: the data acquisition module and the voice recognition module;

The data acquisition module is used for acquiring the land-air call voice data;