CN116450771A

CN116450771A - Multilingual speech translation model construction method and device

Info

Publication number: CN116450771A
Application number: CN202211625664.8A
Authority: CN
Inventors: 孟庆梁
Original assignee: Mgjia Beijing Technology Co ltd
Current assignee: Mgjia Beijing Technology Co ltd
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-07-18

Abstract

The invention discloses a multilingual speech translation model construction method, a speech translation method and a speech translation device, and particularly relates to the technical field of speech translation, wherein the method comprises the following steps: generating sample data; the sample data comprises multilingual audio data, multilingual text data and target text data; screening the sample data based on a preset quality evaluation standard to obtain training data; training the preset coding-decoding model by adopting training data to obtain a multilingual speech translation model. The invention solves the problems of single language of the vehicle-mounted operation instruction set and low data source in the vehicle-mounted voice translation field, can lead a user to wake up the vehicle-mounted system in multiple languages, adopts a pseudo tag scheme to train the voice translation system, overcomes errors caused by pseudo tag data, leads the model training to be successful and obtains good training results.

Description

Multilingual speech translation model construction method and device

Technical Field

The invention relates to the technical field of speech translation, in particular to a multilingual speech translation model construction method and device.

Background

In the field of vehicle control voice, since the clients facing are thousands of groups, it is inevitable that drivers are groups from various countries, and the problem of multilingual is involved. However, the wake-up or control for the vehicle-mounted system is generally a unique language set, so that the construction of a multilingual vehicle control instruction set is often high in resource, and the maintenance cost of subsequent cost is unacceptable along with the increase of the types of client countries.

At present, the research field of voice translation (Audio Speech Translation, abbreviated as ART) exists in academic practice, the background is that voice input into a plurality of countries is translated into single voice characters, the research field can be used for trying to alleviate the problem of high maintenance cost caused by constructing a multilingual instruction set in the vehicle-mounted field, and voice translation systems of all countries can be translated into single characters, namely, only a single language instruction set needs to be maintained.

Despite the solution proposed by academic research, the construction of AST systems still has the following problems: the data of the multilingual audio-target text is low in resource and the manual construction cost is too high; the open source vehicle-mounted field data of the multilingual audio-target text is zero; a single model supports limited types of specific languages; the technical proposal of the academic is difficult to land on an industrial level, and is mainly characterized in that the time consumption is long for single time deduction, the training is low due to the oversized model, and the effect of an open source model in the vehicle-mounted field is poor.

Disclosure of Invention

In order to solve the problem of construction of an AST system in the prior art, the invention provides a multilingual speech translation model construction method and a speech translation method and device.

To achieve the above object, a first aspect of the embodiments of the present invention provides a method for constructing a multilingual speech translation model, which may include, but is not limited to, at least one of the following steps.

Generating sample data; the sample data includes multilingual audio data, multilingual text data, and target text data.

And screening the sample data based on a preset quality evaluation standard to obtain training data.

Training the preset coding-decoding model by adopting training data to obtain a multilingual speech translation model.

According to the invention, sample data is generated unsupervised through a multidimensional generation flow, corresponding data quality evaluation standards are constructed to ensure data quality, the sample data is evaluated through the data quality evaluation standards, so that data meeting requirements is screened out for training, the obtained data is training data, model training is performed through the training data, and a multilingual speech translation model is obtained.

Optionally, generating the sample data includes: obtaining multilingual audio data; a speech-to-text transcription system is adopted to convert multilingual audio data into text data of corresponding languages; and converting the text data of the corresponding language into target text data by adopting a translation system.

According to the method, the acquired multilingual audio data are converted into the text data of the corresponding languages through the voice-to-text transcription system, and the corresponding translation system can be constructed because the translated text data are of high resources, the text is translated through the translation system, and the text data of the corresponding languages are converted into the target text data.

Optionally, generating the sample data further includes: acquiring target text data; converting the target text data into multilingual text data by adopting a translation system; and converting and synthesizing the multilingual text data by adopting a voice synthesis technology to obtain multilingual audio data.

The target text data in the invention is easy to obtain in the market, the target text data is converted into multilingual text data through a translation system, then the multilingual text data is generated by a machine through a voice synthesis technology, and the multilingual text data is converted and synthesized to obtain multilingual audio data.

The generation mode of the other text data is that multilingual text data to target text data are original, and the lack of the multilingual audio data is that the multilingual text data are original, so that the part lack of the audio data can be synthesized by adopting a voice synthesis technology, and can also be generated by adopting a manual recording mode.

Optionally, screening the sample data based on a preset quality evaluation criterion to obtain training data, including: constructing a preset quality evaluation standard through the first index, the second index, the third index, the fourth index and the fifth index; the first index is marked manually, the second index is the transcription accuracy of the voice to the characters, the third index is the translation index, the fourth index is the audio data quality, and the fifth index is the category precision mark; and grading the sample data according to a preset quality evaluation standard, and screening to obtain training data.

According to the invention, the data quality evaluation standard is constructed through five indexes of manual annotation, transcription accuracy of voice to characters, translation indexes, audio data quality and category fine marks, and sample data is subjected to quality classification, so that available data is obtained through screening for training.

Optionally, training the preset encoding-decoding model by using training data to obtain a multilingual speech translation model, including: extracting multilingual audio features; constructing a transcription system from multilingual audio to multilingual text based on the audio characteristics, and performing first coding pre-training on the transcription system from multilingual audio to multilingual text to obtain a first parameter of a first code; constructing a text system from the multilingual text to the target text, and performing second decoding pre-training on the text system from the multilingual text to the target text to obtain a second parameter of the second decoding; a multilingual speech translation model is constructed through a pre-trained multilingual audio-to-multilingual text transcription system and a multilingual text-to-target text system.

The invention adopts training data to train the model to obtain a multilingual speech translation model, firstly extracts multilingual audio characteristics, extracts useful information, then carries out first coding pre-training on a transcription system from multilingual audio constructed based on the audio characteristics to multilingual text, so that the model has the capability of converting speech into characters, and carries out second decoding pre-training on a text system from the constructed multilingual text to target text, so that the model has the sense of language, and constructs the multilingual speech translation model.

Optionally, constructing a multilingual speech translation model by a pre-trained multilingual audio to multilingual text transcription system and a multilingual text to target text system, comprising: loading the first parameter and the second parameter into a preset coding-decoding model for training once; the model inactivation rate is improved, secondary training is carried out on a preset encoding-decoding model, and the model inactivation rate reaches a first value after training is finished; and continuously improving the model inactivation rate, carrying out on-line sliding average of a preset encoding-decoding model, enabling the model inactivation rate to reach a second value after training, and converging the model to obtain the multilingual speech translation model.

The method and the device have the advantages that the model inactivation rate is continuously increased by training the constructed multilingual speech translation model for multiple times, the generalization of the model is increased, and errors caused by unsupervised data are further reduced until the model converges, so that the multilingual speech translation model is obtained.

Based on the above process, a second aspect of the embodiment of the present invention provides an apparatus for multi-language speech translation model construction, where the apparatus may include, but is not limited to, a sample data generating unit, a training data obtaining unit, and a model construction unit.

A sample data generation unit configured to generate sample data; the sample data comprises multilingual audio data, multilingual text data and target text data;

the training data obtaining unit is used for screening the sample data based on a preset quality evaluation standard to obtain training data;

the model construction unit is used for training the preset encoding-decoding model by adopting training data to obtain a multilingual speech translation model.

A third aspect of the embodiment of the present invention provides a multilingual speech translation method, including: acquiring the audio of the language to be translated; inputting the language audio to be translated into the multilingual speech translation model constructed by the multilingual speech translation model construction method in the first aspect of the embodiment of the invention and any embodiment of the first aspect of the invention to obtain the target text.

According to the method, the target text is obtained by acquiring the language audio to be translated and inputting the language audio to be translated into the constructed multilingual voice translation model, so that the problems of low AST data resource, high construction labor cost, zero vehicle-mounted field data and the like are solved.

A fourth aspect of embodiments of the present invention provides a multilingual speech translation apparatus that may include, but is not limited to, an audio acquisition unit and a translation unit.

And the audio acquisition unit is used for acquiring the voice audio to be translated.

The translation unit is used for inputting the language audio to be translated into the multilingual speech translation model constructed by the multilingual speech translation model construction method in the first aspect of the embodiment of the invention and any embodiment of the first aspect of the embodiment of the invention to obtain the target text.

To achieve the above object, a fifth aspect of the present invention provides an electronic device, which may include a memory and a processor, where the memory stores computer readable instructions, where the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the multilingual speech translation model construction method according to the first aspect of the present invention and any one of the embodiments of the first aspect of the present invention, and the steps of the multilingual speech translation method according to the third aspect of the present invention.

To achieve the above object, a sixth aspect of the present invention provides a storage medium storing computer readable instructions, where the computer readable instructions when executed by one or more processors cause the one or more processors to perform the steps of the multilingual speech translation model building method in any one of the first aspect and the first aspect of the present invention and the steps of the multilingual speech translation method in the third aspect of the present invention.

The beneficial effects of the invention are as follows:

according to the invention, the multilingual audio data, multilingual text data and target text data are generated through a multidimensional generation process, the problem of single language of a vehicle-mounted operation instruction set is solved, a user can wake up the vehicle-mounted system in multiple languages, the problem of low data source in the field of vehicle-mounted voice translation is solved, the quality of training data is ensured by screening sample data through a preset quality evaluation standard, the voice translation system is trained through the screened training data, the error caused by the sample data is overcome, the model training is successful, a good training result is obtained, and the effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a specific example of a multilingual speech translation model construction method according to an embodiment of the present invention;

FIG. 2 is a flowchart of sample data unsupervised system generation in multilingual speech translation model construction in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of a specific example of multilingual speech translation model construction and training in accordance with an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a specific example of a multilingual speech translation model construction apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a specific example of a multilingual speech translation apparatus in accordance with an embodiment of the present invention;

fig. 6 is a diagram illustrating an embodiment of an electronic device according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, an embodiment of the present invention provides a multilingual speech translation model construction method, which includes, but is not limited to, steps S100 to S300.

Step S100, generating sample data; the sample data includes multilingual audio data, multilingual text data, and target text data.

The sample data includes multilingual audio data, multilingual text data and target text data, and may be generated in various ways.

And step S200, screening the sample data based on a preset quality evaluation standard to obtain training data.

Illustratively, the quality evaluation criteria include a plurality of indicators, grading the data quality by corresponding criteria of the plurality of indicators, and filtering the sample data according to different grades to obtain training data for training.

And step S300, training a preset coding-decoding model by using training data to obtain a multilingual speech translation model.

Illustratively, the coding-decoding model is a pre-constructed model structure, and the multi-language speech translation model is obtained by training the pre-set model structure by using training data.

As shown in fig. 2, a flow is generated for the sample data unsupervised system. As an alternative embodiment of the present invention, generating sample data includes: obtaining multilingual audio data; a speech-to-text transcription system (Automatic Speech Recognition, ASR for short) is adopted to convert multilingual audio data into text data of corresponding languages; and converting the text data of the corresponding language into target text data by adopting a translation system.

Illustratively, the speech-to-text transcription system, the ASR system, has relatively high data resources relative to the AST system, but the ASR system lacks a corresponding translation for the sample data. In this embodiment, a translation system is employed to convert text data in a corresponding language into target text data. The Translation system adopts a Translation text Translation framework, and text Translation data corresponding to the system belongs to a high-resource data set, namely, more open source data exists, and the open source framework exists, so that the text data corresponding to languages can be converted into target text data.

As an optional embodiment of the invention, generating the sample data further comprises: acquiring target text data; converting the target text data into multilingual text data by adopting a translation system; and converting and synthesizing the multilingual text data by adopting a voice synthesis technology to obtain multilingual audio data.

Illustratively, the target text may be chinese text, but is not limited thereto. The Chinese Text is very easy To obtain, chinese Text data can be converted into multilingual Text data through a translation system, then the machine generation of multilingual audio data is realized through a Speech synthesis technology (TTS for short), a large number of open source data sources exist in the TTS, and the Text can be trained To synthesize Speech through various data sources; therefore, the multilingual text data can be converted and synthesized by adopting a mature voice synthesis technology to obtain multilingual audio data, so that sample data of the voice translation system is formed. The method has low cost for generating sample data, can be applied to the vehicle-mounted field, and solves the problem that the open source vehicle-mounted field data of multilingual audio-Chinese text is zero.

Illustratively, because the translation pairs of text belong to high-resource data, and the translation pairs of text are relatively easy to construct, i.e., multilingual text data to Chinese text data are open-sourced, the audio-deficient portions can be synthesized mechanically or recorded manually using speech synthesis techniques. When the sample data is generated, the existing multilingual text data and the corresponding target text data can be directly obtained, and then the corresponding multilingual audio data is formed by adopting a voice synthesis technology, so that the sample data of the voice translation system is formed. This way the quality of the generated sample data is higher.

For example, sample data required for constructing a speech translation system is high in cost, a multi-dimension generation flow is adopted to generate the sample data, the multi-dimension generation of the sample data starts from a relatively high data source, and the sample data is expanded through a corresponding deep learning technology.

As an optional embodiment of the present invention, screening sample data based on a preset quality evaluation criterion to obtain training data includes: constructing a preset quality evaluation standard through the first index, the second index, the third index, the fourth index and the fifth index; the first index is marked manually, the second index is the transcription accuracy of the voice to the characters, the third index is the translation index, the fourth index is the audio data quality, and the fifth index is the category precision mark; and grading the sample data according to a preset quality evaluation standard, and screening to obtain training data.

By adopting the three schemes to generate various unsupervised data as sample data, the labor cost can be greatly reduced, and the problems of low data resource and difficult acquisition of a voice translation system are solved. Data evaluation criteria are constructed to ensure data quality. And performing star-level evaluation on the generated unsupervised sample data through the index of data quality evaluation to judge whether the screened data is available, wherein data of three stars and above can be used, namely, data of more than three stars can be used as training data for model training. The data quality evaluation criteria are shown in the following table.

For the five indexes, the manual labeling of the data is a behavior of labeling the data by a person by means of a labeling tool, and particularly, labeling of the data can be realized by labeling modes such as classification, picture frame, annotation, marking and the like. Accurate marking refers to accurate marking and accurate marking; multiple rounds of manual labeling is generally accurate labeling. The ASR precision mark refers to that the accuracy rate of the ASR is larger than 90%, the accuracy rate is 1 minus word error rate (word error rate, or WER for short), and the word error rate is the percentage of the total number of inserted, replaced and deleted words to the number of words in the standard word sequence. Translation BLEU refers to bilingual evaluation alternative (bilingual evaluation understudy, abbreviated as BLEU) index, and the BLEU calculation formula is as follows:

wherein BP is a short penalty factor, ω _n The upper limit of the value of N is 4, P _n Is based on the accuracy of the n-gram.

The quality of the audio data means no obvious noise, flaw and repetition, and can be played and the tone quality is clear; the audio quality is excellent because of no obvious noise, clear tone quality and no repeated flaws; the audio quality is healthy because of certain noise or certain repetition but not influence the audio quality; the audio content is affected by loud noise or long-time repetition, and the audio quality flaw is caused by the difficulty in listening to the audio content. For the category precision index, it can be understood that the richness of the translation result, for example, translating a sentence: i want to go to school if there is only one translation result: i want to go to school, this is single kind, but if for: i want to go to school in two translation ways, translation one: i want to go to school, translation two: i wanna go to school is a kind of the products.

As an optional implementation manner of the present invention, training the preset encoding-decoding model by using the training data to obtain a multilingual speech translation model, including: extracting multilingual audio features; constructing a transcription system from multilingual audio to multilingual text based on the audio characteristics, and performing first coding pre-training on the transcription system from multilingual audio to multilingual text to obtain a first parameter of a first code; constructing a text system from the multilingual text to the target text, and performing second decoding pre-training on the text system from the multilingual text to the target text to obtain second decoded parameters; a multilingual speech translation model is constructed through a pre-trained multilingual audio-to-multilingual text transcription system and a multilingual text-to-target text system.

FIG. 3 is a flow chart for constructing and training a multilingual speech translation model. By way of example, training is performed by a self-grinding framework and a staged training method, useful information is extracted by extracting language features of multi-language audio, and the length of the audio is sampled, so that the length of the audio is shortened to be within 6 seconds and limited by 15 Chinese characters.

Specifically, the transcription system from multilingual audio to multilingual text can adopt 12 layers of Conformer, and train the system through the extracted audio characteristics, so that parameters of the system can be determined, and the system has the function of converting multilingual audio into multilingual text; the multi-language text to target text system adopts 6 layers of convectors, and the parameters of the multi-language text to target text system are determined by training the multi-language text to target text through sample data, so that the multi-language text to target text system has the function of converting the multi-language text to the target text. Thus, the pre-trained multilingual audio-to-multilingual text transcription system and the multilingual text-to-target text system constitute a multilingual speech translation model.

As an alternative embodiment of the invention, the construction of the multilingual speech translation model by a pre-trained multilingual audio-to-multilingual text transcription system and a multilingual text-to-target text system comprises: loading the first parameter and the second parameter into a preset coding-decoding model for training once; the model inactivation rate is improved, secondary training is carried out on a preset encoding-decoding model, and the model inactivation rate reaches a first value after training is finished; and continuously improving the model inactivation rate, carrying out on-line sliding average of a preset encoding-decoding model, enabling the model inactivation rate to reach a second value after training, and converging the model to obtain the multilingual speech translation model.

By way of example, the transcription system from multilingual audio to multilingual text and the text system from multilingual text to target text are loaded with the first parameter of pre-trained code and the second parameter of decoded code, a round of training is performed, a great number of experimental training is performed, good effects can be obtained in 100 steps, a round of training-finished multilingual speech translation model is loaded with a round of parameters, the model inactivation rate is increased to a first value, model generalization is increased, and two rounds of training in 100 steps are performed, wherein the first value is 0.3. The improvement of generalization can further reduce errors caused by unmarked data. After the training of the two rounds is finished, on-line sliding average of the model is carried out, the model inactivation rate is further increased to a second numerical value, the model generalization is added, the training of the three rounds is finished, the model is converged, and the multilingual speech translation model is obtained, wherein the second numerical value is 0.4. The multi-language speech translation model is quantized, and the accuracy quantization from 32 bits to 16 bits is performed on the coding-decoding model, so that the reasoning speed is faster.

As shown in fig. 4, the present invention provides a device for constructing a multilingual speech translation model, which includes, but is not limited to, a sample data generating unit, a training data obtaining unit, and a model constructing unit, and is specifically described below.

A sample data generation unit configured to generate sample data; the sample data includes multilingual audio data, multilingual text data, and target text data.

The training data obtaining unit is used for screening the sample data based on a preset quality evaluation standard to obtain training data.

Based on a method for constructing a multilingual speech translation model, one or more embodiments of the present invention can also provide a multilingual speech translation method, which includes obtaining audio of a language to be translated; inputting the language audio to be translated into the multilingual speech translation model constructed by the multilingual speech translation model construction method of any embodiment to obtain the target text.

The target text is obtained by inputting the language audio to be translated into the constructed multilingual speech translation model.

As shown in fig. 5, the present invention provides a multilingual speech translation apparatus, which includes, but is not limited to, an audio acquisition unit and a translation unit, as described in detail below.

The translation unit is used for inputting the audio to be translated into the multilingual speech translation model constructed by the multilingual speech translation model construction method of any embodiment to obtain the target text.

As shown in fig. 6, the method for constructing a multilingual speech translation model is based on the same technical concept, and one or more embodiments of the present invention can also provide an electronic device including a memory and a processor, where the memory stores computer readable instructions that, when executed by the processor, cause the processor to perform the steps of the method for constructing a multilingual speech translation model in any one of the embodiments of the present invention.

As shown in fig. 6, the method of constructing a multilingual speech translation model is based on the same technical concept, and one or more embodiments of the present invention can also provide a storage medium storing computer readable instructions that when executed by one or more processors cause the one or more processors to perform the steps of the method of constructing a multilingual speech translation model in any of the embodiments of the present invention.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection (electronic device) with one or more wires, a portable computer cartridge (magnetic device), a random access Memory (RAM, random Access Memory), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (EPROM, erasable Programmable Read-Only Memory, or flash Memory), an optical fiber device, and a portable compact disc Read-Only Memory (CDROM, compact Disc Read-Only Memory). In addition, the computer-readable storage medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable gate arrays (PGA, programmable Gate Array), field programmable gate arrays (FPGA, field Programmable Gate Array), and the like.

In the description of the present specification, a description referring to the terms "present embodiment," "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

The above description is only of the preferred embodiments of the present invention, and is not intended to limit the invention, but any modifications, equivalents, and simple improvements made within the spirit of the present invention should be included in the scope of the present invention.

Claims

1. A multilingual speech translation model construction method is characterized by comprising the following steps:

generating sample data; the sample data comprises multilingual audio data, multilingual text data and target text data;

screening the sample data based on a preset quality evaluation standard to obtain training data;

and training a preset coding-decoding model by adopting the training data to obtain a multilingual speech translation model.

2. The method for constructing a multilingual speech translation model according to claim 1, wherein the generating sample data comprises:

obtaining multilingual audio data;

the multi-language audio data are converted into text data of corresponding languages by adopting a voice-to-text transcription system;

and converting the text data of the corresponding language into target text data by adopting a translation system.

3. The method for constructing a multilingual speech translation model according to claim 1, wherein the generating sample data further comprises:

acquiring target text data;

converting the target text data into multilingual text data by adopting a translation system;

and converting and synthesizing the multilingual text data by adopting a voice synthesis technology to obtain multilingual audio data.

4. The method for constructing a multilingual speech translation model according to claim 1, wherein the step of screening the sample data based on a preset quality evaluation criterion to obtain training data comprises:

constructing the preset quality evaluation standard through the first index, the second index, the third index, the fourth index and the fifth index; the first index is marked manually, the second index is the transcription accuracy rate from voice to words, the third index is a translation index, the fourth index is the audio data quality, and the fifth index is a category precision mark;

and grading the sample data according to the preset quality evaluation standard, and screening to obtain training data.

5. The method for constructing a multilingual speech translation model according to claim 1, wherein training the preset encoding-decoding model with the training data to obtain the multilingual speech translation model comprises:

extracting multilingual audio features;

constructing a transcription system from multilingual audio to multilingual text based on the audio characteristics, and performing first coding pre-training on the transcription system from multilingual audio to multilingual text to obtain first parameters of the first coding;

constructing a text system from the multilingual text to the target text, and performing second decoding pre-training on the text system from the multilingual text to the target text to obtain second parameters of the second decoding;

a multilingual speech translation model is constructed through a pre-trained multilingual audio-to-multilingual text transcription system and a multilingual text-to-target text system.

6. The method for constructing a multilingual speech translation model according to claim 5, wherein the constructing the multilingual speech translation model by the pretrained multilingual audio-to-multilingual text transcription system and the multilingual text-to-target text system includes:

loading the first parameter and the second parameter into a preset coding-decoding model for training once;

the model inactivation rate is improved, the preset encoding-decoding model is trained for the second time, and the model inactivation rate reaches a first value after training is completed;

and continuously improving the model inactivation rate, carrying out on-line sliding average of a preset encoding-decoding model, enabling the model inactivation rate to reach a second value after training, and converging the model to obtain the multilingual speech translation model.

7. A device for constructing a multilingual speech translation model, comprising:

and the model construction unit is used for training a preset coding-decoding model by adopting the training data to obtain a multilingual speech translation model.

8. A method for multilingual speech translation, comprising:

acquiring the audio of the language to be translated;

inputting the language audio to be translated into the multilingual speech translation model constructed by the multilingual speech translation model construction method according to any one of claims 1 to 6 to obtain a target text.

9. A multilingual speech translation apparatus comprising:

the audio acquisition unit is used for acquiring voice audio to be translated;

the translation unit is used for inputting the audio to be translated into the multilingual speech translation model constructed by the multilingual speech translation model construction method according to any one of claims 1-6 to obtain the target text.

10. An electronic device comprising a memory and a processor, the memory having stored therein computer readable instructions that, when executed by the processor, cause the processor to perform the steps of the multilingual speech translation method of any one of claims 1-6 and the steps of the multilingual speech translation method of claim 8.

11. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the multilingual speech translation method of any one of claims 1-6 and the steps of the multilingual speech translation method of claim.