CN116229994B

CN116229994B - Construction method and device of label prediction model of Arabic language

Info

Publication number: CN116229994B
Application number: CN202310505137.1A
Authority: CN
Inventors: 林一侃
Original assignee: Beijing Aishu Wisdom Technology Co ltd
Current assignee: Beijing Qingshu Intelligent Technology Co ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-07-21
Anticipated expiration: 2043-05-08
Also published as: CN116229994A

Abstract

The application discloses a construction method and a construction device of a label prediction model of an Arabic language, wherein the method comprises the following steps: respectively performing single-mode training on the acoustic encoder and the text encoder; training a first identifier prediction model based on first training data, and updating parameters of the acoustic encoder, the text encoder, and a multi-modal joint network in the first identifier prediction model; and fine tuning the first identifier prediction model based on the first training data and the second training data to obtain a second identifier prediction model. According to the embodiment of the application, the Arabic data with multiple modes are used for participating in training of the label prediction model in one or more stages, so that the label prediction model can learn richer language variants and style scenes, and further a larger potential lifting space is provided.

Description

Construction method and device of label prediction model of Arabic language

Technical Field

The application belongs to the technical field of natural language processing, and particularly relates to a method and a device for constructing a label prediction model of an Arabic language.

Background

The writing system of arabic mainly comprises letters and mark symbols (sometimes also called "pronunciation symbols", hereinafter simply referred to as "marks"), and from the viewpoint of speech, the letters are mainly used to represent consonants, whereas the marks are used to represent pronunciation phenomena in arabic that cannot be represented by letters, such as vowels, nasal sounds, silence, prolonged consonants, etc., and the marks cannot be separated from the letters. If a segment of arabic text is to fully embody its pronunciation, it should be written in its entirety by letters and marks, but except for religious text and language teaching materials, the sign of the arabic text is basically omitted, or only a small number of signs are used.

A word shape without a tag corresponds to a large number of possible tagged forms, depending on its meaning and syntactic effect in a specific context. For modern standard Arabic and classical Arabic, the usage rules of the identifiers are complete and standard. However, these two arabic variants are only used in religious texts, books, news, and other formal occasions. In daily life, the arabic users in various places use local dialects, and in informal writing, dialects may be used. The pronunciation rules, words and even grammar of dialects may be quite different from standard arabic.

The task of label prediction, sometimes also called label recovery, vowel recovery, etc., generates text with complete labels from existing text without labels or with only partial labels. The prediction of the sign of the Arabic language generally refers to predicting the Arabic text without the sign to obtain the Arabic text with the sign. The prior art generally adopts text-based identifier prediction, has a limited application scope, and can only rely on single-mode information. In actual business, a large number of scenes are dialect, spoken, a scheme relying on standard lexical analysis rules and a scheme relying on standard Arabic data, so that the scenes are difficult to adapt, and if the marked text of the scenes needs to be acquired, the difficulty is great.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for constructing a label prediction model of Arabic language, so as to solve the defect that a large number of label texts in a speaking scene and a large number of label texts are difficult to obtain in the prior art.

In order to solve the technical problems, the application is realized as follows:

in a first aspect, a method for constructing a label prediction model of an arabic language is provided, including the following steps:

respectively performing single-mode training on the acoustic encoder and the text encoder;

training a first identifier prediction model based on first training data, and updating parameters of the acoustic encoder, the text encoder and a multi-modal joint network in the first identifier prediction model, wherein the first training data comprises Arabic voice data, a identifier text corresponding to the voice data and a no-identifier text corresponding to the voice data;

and fine tuning the first label prediction model based on the first training data and the second training data to obtain a second label prediction model, wherein the second training data comprises Arabic dialect voice data and corresponding label-free text.

In a second aspect, a device for constructing a label prediction model of an arabic language is provided, including:

the first training module is used for performing single-mode training on the acoustic encoder and the text encoder respectively;

the second training module is used for training a first identifier prediction model based on first training data, and updating parameters of the acoustic encoder, the text encoder and the multi-mode joint network in the first identifier prediction model, wherein the first training data comprises Arabic voice data, a identifier text corresponding to the voice data and a non-identifier text corresponding to the voice data;

and the fine tuning module is used for fine tuning the first identifier prediction model based on the first training data and the second training data to obtain a second identifier prediction model, wherein the second training data comprises Arabic dialect voice data and corresponding non-identifier text.

According to the method and the device, the Arabic data with multiple modes are used for training the identifier prediction model in one or more stages, so that the identifier prediction model can learn richer language variants and style scenes, further has a larger potential lifting space, and can acquire corresponding identifier texts based on the voice data of the Arabic language and corresponding non-identifier texts in a large number of language and spoken scenes.

Drawings

FIG. 1 is a flowchart of a method for constructing a label prediction model of Arabic language according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a label prediction model structure provided in an embodiment of the present application;

FIG. 3 is a specific implementation diagram of a label prediction model structure provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of a device for constructing a label prediction model of arabic language according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The existing label prediction method is generally applied to two scenarios:

scenario one: and analyzing the text by using a label prediction tool only for the Arabic text without the label, so as to obtain the label text. This may occur during the pre-processing stages of text-to-speech tasks (e.g., speech synthesis, timbre cloning, etc.).

Scenario two: only arabic audio, a text with a label is desired, and the recognition text is obtained by a voice recognition technology first, but if the model is trained by a text without a label, the recognition text is also without a label, and the recognition text needs to be analyzed by a label prediction tool in a post-processing stage to obtain the text with the label.

The techniques relied upon by text-based label prediction tools can be divided into two categories:

(1) Lexical resolution (Morphological Analysis): based on word forming rules, syntax rules and the like of Arabic, the input text is subjected to lexical analysis, and the specific word meaning, part of speech, syntax effect and the like of each non-standard word shape at the position of the text are judged according to the context, so that the complete standard word form is predicted.

This approach suffers from two disadvantages:

A. the effect of the lexical parsing tool depends on whether the Arabic variant to which the text to be parsed belongs is consistent with the variant supported by the lexical parsing tool. Typically, such tools are only suitable for modern standard arabic (hereinafter "standard arabic") because standard arabic is relatively complete and standard in terms of the rules for adding the identifier. The identifier rules of dialect variants are generally not sufficiently complete and standardized. Lexical parsing tools capable of supporting dialect variants are very rare, the types of dialects that can be supported are very few, and a large number of dialect variants exist in Arabic.

B. The lexical parsing tool has difficulty in flexibly processing new vocabulary in the corpus, such as foreign vocabulary, foreign names, and the like. Because these words are often spelled with arabic letters according to their foreign language pronunciation, their exact identifier form is also difficult to obtain according to arabic lexical rules.

(2) Characterization engineering (Feature Engineering): a certain text without a label and a corresponding text with a label are obtained, and high-dimensional characteristic representations of different levels are advanced for the text, so that a label prediction model is trained. Whether the extracted and processed features are characters, word fragments, words, sentence fragments or sentence-level features, can be categorized as feature engineering-based methods. When the tool is used, the model processes the text without the identifier to obtain the required high-dimensional feature vector, performs certain vector calculation, converts the high-dimensional feature vector into the feature vector corresponding to the text with the identifier, and finally outputs the text with the identifier.

The method has the defects that:

A. developing such tools requires a large amount of tagged text. Such text exists, typically in modern standard arabic and classical arabic texts. This means that the model can only get a good prediction effect on unlabeled text similar to training text, while for text in arabic language, the prediction effect is poor.

B. The tagged text data, typically from religious classics, books or news, differs from the dialect spoken language, whether content, words or syntactically. Therefore, such methods may have difficulty extracting better features depending on the context when encountering text that differs significantly from the subject, style, and word of the training data.

Furthermore, for the task of "scenario two" above, i.e. the task of obtaining a labeled text of a to-be-labeled according to a text-free speech, if the two-stage flow of "speech recognition+label prediction" described above uses a text-based label prediction tool in the post-processing stage of speech recognition, the effect is affected by the recognition effect in addition to the performance of the label prediction tool itself. Therefore, there is also a method of constructing a speech recognition system capable of recognizing text with a tag directly from speech. The method can predict the text and improve the performance compared with the prior voice recognition. However, this method means that speech and text with a label corresponding to the speech are required, and several problems are raised:

A. if voice data is recorded based on existing tagged text, or only those voice data with tagged text that are easily available are used. The amount and style of data are limited, so that training data of the spoken language type of the dialect are difficult to obtain, and the problems that the above-mentioned method is only suitable for standard Arabic and classical Arabic, is only suitable for formal and read-aloud scenes, and is not suitable for various dialects, spoken languages and conversational scenes are caused.

B. If the text is based on the existing text-free voice data, the text is directly marked with the identifier, and the main problem is that the marking work is very difficult, the number of marking staff capable of being qualified for the task is very small, and even if proper personnel selection is found, the labor cost for marking and checking is very high.

C. If the text is predicted based on the existing voice data and the corresponding text without the identifier, then the text is modified and checked manually, that is, the task of predicting the identifier based on the text is added into the data preprocessing stage. At this time, on the one hand, the problem of the performance of the text-based label prediction tool is returned; on the other hand, even with modification and collation, the labor cost of this work is still considerable.

In addition, there is also a scenario: there are various types of speech data and their corresponding text without a tag, and it is necessary to obtain the corresponding text with a tag. According to the existing single-mode technology, a text-based prediction tool is generally used for performing label prediction on the label-free texts, and in a dialect and spoken scene, the effect of the method is not ideal. Although the existence of corresponding audio means that knowing the specific pronunciation corresponding to the text without the identifier should make it easier to obtain text with the identifier, the existing related technology has difficulty in fusing the information of the two modes to perform identifier prediction, and cannot fully exploit the potential of the data.

Based on the above analysis on the technical background and the shortcomings of the prior art, the embodiment of the application provides a method for constructing a label prediction model of an Arabic language, which is used for establishing a multi-mode label prediction model universal for various variants of the Arabic language, and predicting corresponding label text according to dialect voice data and corresponding text without labels. Compared with the prior single-mode technology, the potential of multi-mode data can be fully mined by combining the information of the two modes, and the difficulty that the prediction of dialect identifiers cannot be well processed in the prior art is overcome.

The method for constructing the label prediction model of the arabic language provided in the embodiment of the present application is described in detail below through specific embodiments and application scenarios thereof with reference to the accompanying drawings.

As shown in fig. 1, a flowchart of a method for constructing a label prediction model of an arabic language according to an embodiment of the present application is provided, where the method includes the following steps:

step 101, performing single-mode training on an acoustic encoder and a text encoder respectively.

Specifically, the acoustic encoder may be trained by masking the local audio signal of the arabic speech data, using the masked data as input, and using the unmasked data as output;

mapping the Arabic non-standard text and the Arabic standard text in the third training data to corresponding symbol sequences respectively, taking the symbol sequence corresponding to the non-standard text as input, taking the symbol sequence corresponding to the standard text as output, and training the text encoder;

the trained text encoder is used for converting a symbol sequence corresponding to the label-free text into a corresponding high-dimensional feature vector.

Step 102, training a first identifier prediction model based on the first training data, and updating parameters of the acoustic encoder, the text encoder and the multi-modal joint network in the first identifier prediction model.

The first training data comprises Arabic voice data, a text with a label corresponding to the voice data and a text without a label corresponding to the voice data;

specifically, the arabic speech data in the first training data may be used as an input of the acoustic encoder, so as to obtain a first feature vector output by the acoustic encoder; taking the text without the identifier corresponding to the voice data in the first training data as the input of the text encoder to obtain a second feature vector output by the text encoder; fusing the first feature vector and the second feature vector to obtain a combined feature, and predicting a text with a label by taking the combined feature as the input of the multi-mode combined network; and calculating loss by taking the marked text corresponding to the voice data in the first training data as a reference, and training the multi-mode marked prediction model.

And step 103, fine tuning the first identifier prediction model based on the first training data and the second training data to obtain a second identifier prediction model.

The second training data comprises Arabic dialect voice data and corresponding text without a label.

Specifically, a part of voice data and a text without a label can be extracted from the second training data to serve as input of the first label prediction model, and a corresponding pseudo label is generated; and fine tuning the first label prediction model according to the pseudo label and the label text in the first training data.

In this embodiment, after the first identifier prediction model is finely tuned based on the first training data and the second training data to obtain the second identifier prediction model, the speech data of the arabic language and the corresponding text without identifier may also be used as input of the second identifier prediction model to predict the corresponding text with identifier.

According to the embodiment of the application, the Arabic data (only voice data, only voice data and corresponding non-identifier text, only non-identifier text and corresponding identifier text, and voice data and corresponding identifier text) with multiple modes are used for participating in training of the identifier prediction model in one or more stages, so that the identifier prediction model can learn richer language variants and style scenes, further a larger potential lifting space is provided, and the identifier text can be obtained in the scenes with a large number of languages and spoken languages.

In the embodiment of the present application, the identifier prediction model structure of the multimodal arabic language can be divided into three main modules: acoustic encoders, text encoders, and multi-modal joint networks, as shown in fig. 2. The training method of the label prediction model comprises a data preparation stage and a model training stage.

Wherein the data preparation phase comprises preparing the following data:

(1) Training data D0: arabic speech data. The data may be arabic speech data of various styles, variants. The source selection will be relatively broad, as no corresponding text is required.

(2) Training data D1: arabic speech data, corresponding tagged text, and corresponding non-tagged text. The data used here may be data that is easy to obtain, and does not necessarily include dialect-type data. The text should be cleaned and normalized without punctuation marks, special characters, etc.

(2) Training data D2: arabic dialect phonetic data and corresponding no-label text. The text should be cleaned and normalized without punctuation marks, special characters, etc. The portion of data does not require corresponding tagged text and should contain as rich dialect variants and styles as possible.

(3) Training data D3: non-tagged text and corresponding tagged text. The partial data is plain text data, and is easy to obtain text, and text of dialect and spoken language types is not necessarily required to be contained.

(4) Preparing a dictionary L0: the vocabulary of the available marked text is mapped to a grapheme symbol (grapheme) sequence with smaller granularity, such as a symbol sequence with the level of [ letter+mark ], and null characters eps are added to form a dictionary. Such as:

(5) Preparing a dictionary L1: the vocabulary of the available text without the identifier is mapped to a grapheme symbol (grapheme) sequence with smaller granularity, and null characters eps are added to form a dictionary. Such as:

further, the model training phase includes three phases (M0, M1, M2), the training purposes and steps of each phase are described in detail below:

(1) In the M0 stage, the acoustic encoder and the text encoder are each trained in a single mode. The audio-only and text-only single-mode arabic data perform respective single-mode training compared to multi-mode data in order to allow the acoustic encoder and the text encoder module to have better initial parameters in the subsequent multi-mode training phase.

Wherein the acoustic encoder training comprises: masking the Arabic voice data of D0, namely masking the data by local audio signals, taking the masked data as input, taking the unmasked data as output, and recovering the input data into original data by training targets. The acoustic encoder so trained can convert the raw speech data into corresponding high-dimensional feature vectors.

Text encoder training includes: according to the Arabic text data pair of D3 and the dictionaries L0 and L1, taking the symbol sequence of the non-standard text as input, taking the symbol sequence of the standard text as output, and predicting the corresponding symbol sequence of the standard text based on the symbol sequence of the input non-standard text as training target. The text encoder thus trained can convert the symbol sequence of the label-free text into a corresponding high-dimensional feature vector. This step may not require an excessively computationally intensive network structure (e.g., RNN), but may instead select a structure such as two LSTM layers plus one-dimensional convolutional layer.

(2) In the M1 stage, training data D1 is used first, and voice data is used as input of an acoustic encoder obtained in the previous stage to obtain a feature vector V1; the corresponding text data without the identifier is used as input of a text encoder to obtain a feature vector V2. Subsequently, the V1 and V2 features are fused to obtain a combined feature V3. There are many methods for feature fusion, which can combine two three-dimensional features to obtain a four-dimensional feature, and can construct the four-dimensional feature to reduce the calculation amount (not covered by the present invention).

In summary, the multi-modal joint network predicts the tagged text based on the input of V3, calculates the loss based on the tagged text of D1 itself as a reference, and trains the multi-modal tag prediction model of the first stage. The model generated at this stage is also abbreviated as M1. During training, parameters of the acoustic encoder, the text encoder, and the multimodal fusion network are updated.

The text encoder may use a structure such as two LSTM layers plus one-dimensional convolution layer, and the acoustic encoder may use a structure such as Conformer Encoder. The multimodal fusion network, i.e. the prediction module, may use a CTC/attention mix architecture because the CTC loss function may constrain the monotonic alignment between input and output in this task, and the attention architecture may focus on certain contexts, avoiding potential inadequacies from the independence assumption of CTC algorithms. Neither the neural network structure nor the loss function that can be used by the above three modules is fixed, as shown in fig. 3, for a network constructed using the above-mentioned examples. The model obtained in the stage can be used for predicting the text with the identifier according to the joint information of the audio and the text without the identifier, but has no good generalization capability for scenes in the cross-domain because the style of the D1 training data is generally standard Arabic reading data.

(3) The M2 stage is divided into the following stages:

(1) pseudo tag initialization: the part D2 data is extracted and predicted using M1, i.e. pseudo tags are generated using the part of speech data and the label-free text. That is, the D2 data also yields the corresponding tagged text.

(2) D1 data (true tagged text) was added to the pseudo-tagged D2 data obtained in (1) to Fine tune the M1 model (Fine tuning) to obtain m1+. M1 is replaced with m1+.

(3) After iterating for a certain round, extracting the part D2 data again, and repeating the steps (1) (generating pseudo tag) and (2) (trimming).

Thus, the model M2 (final m1+) is obtained by optimization based on the model M1. M2 has better generalization performance than M1 due to the participation of D2 in fine tuning, and meanwhile, the performance on the data of the D1 type is not damaged.

In addition, when voice data is input to the acoustic encoder at each stage, the input may be in the form of fbank features, mfcc features, or the like. In addition, since Arabic data is generally of low resource language in magnitude, a data enhancement step can be added when processing voice input. The data enhancement (Data Augmentation) method is a common operation in voice tasks, and is not described in detail in the above-mentioned flow. The data enhancement can comprise spectrum enhancement, speed disturbance, volume disturbance and the like, so that the data can be well enriched, and the generalization performance of the model is improved.

According to the Arabic sign prediction scheme provided by the embodiment of the application, the information of two modes of voice and non-sign text is combined, so that the corresponding text with the sign can be generated based on the existing Arabic voice and the corresponding text without the sign. The training method of the multi-mode identifier prediction model can better utilize data of various sources. The multi-modal Arabic data (audio only, audio only and text without a label, text with a label and text with a label, audio and text with a label) can participate in model training at one or more stages, and the model can learn language variants and style scenes, so that compared with a single-modal prediction scheme, the model has richer potential lifting space.

In addition, the multi-modal method is compared with the single-modal method, a certain amount of dialect data of dialogue style is used for testing, the effect is evaluated by the mark error rate, and the lower the error rate is, the better the effect is. It was found experimentally that the flag error rate of the multi-modal approach was reduced by about 20% compared to the text-based predictive tool and about 15% compared to the speech-based predictive tool.

As shown in fig. 4, a schematic structural diagram of a device for constructing a label prediction model of arabic language according to an embodiment of the present application includes:

a first training module 410 is configured to perform single-mode training on the acoustic encoder and the text encoder, respectively.

Specifically, the first training module 410 is specifically configured to mask the local audio signal for the arabic speech data, take the masked data as input, take the unmasked data as output, and train the acoustic encoder; mapping the Arabic non-standard text and the Arabic standard text in the third training data to corresponding symbol sequences respectively, taking the symbol sequence corresponding to the non-standard text as input, taking the symbol sequence corresponding to the standard text as output, and training the text encoder;

The second training module 420 is configured to train a first identifier prediction model based on the first training data, and update parameters of the acoustic encoder, the text encoder, and the multi-modal joint network in the first identifier prediction model.

The first training data comprises Arabic voice data, a marked text corresponding to the voice data and a non-marked text corresponding to the voice data.

Specifically, the second training module 420 is specifically configured to use arabic speech data in the first training data as input of the acoustic encoder to obtain a first feature vector output by the acoustic encoder; taking the text without the identifier corresponding to the voice data in the first training data as the input of the text encoder to obtain a second feature vector output by the text encoder; fusing the first feature vector and the second feature vector to obtain a combined feature, and predicting a text with a label by taking the combined feature as the input of the multi-mode combined network; and calculating loss by taking the marked text corresponding to the voice data in the first training data as a reference, and training the multi-mode marked prediction model.

And a fine tuning module 430, configured to fine tune the first identifier prediction model based on the first training data and the second training data, so as to obtain a second identifier prediction model.

Specifically, the fine tuning module 430 is specifically configured to extract a part of speech data and a text without a label from the second training data as input of the first label prediction model, and generate a corresponding pseudo tag; and fine tuning the first label prediction model according to the pseudo label and the label text in the first training data.

In this embodiment, the apparatus further includes:

and the prediction module is used for taking the voice data of the Arabic language and the corresponding text without the identifier as the input of the second identifier prediction model to predict the corresponding text with the identifier.

According to the method and the device, the Arabic data (only voice data, only voice data and corresponding non-identifier text, only non-identifier text and corresponding identifier text, and voice data and corresponding identifier text) of multiple modes are used for training the identifier prediction model in one or more stages, so that the identifier prediction model can learn richer language variants and style scenes, further a larger potential lifting space is provided, and corresponding identifier text can be obtained based on the voice data of the Arabic language and the corresponding non-identifier text in a large number of language and spoken scenes.

The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above embodiment of the method for constructing the label prediction model of arabic language, and the same technical effects can be achieved, so that repetition is avoided, and details are not repeated here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (RandomAccess Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. The construction method of the label prediction model of the Arabic language is characterized by comprising the following steps of:

fine tuning the first label prediction model based on the first training data and the second training data to obtain a second label prediction model, wherein the second training data comprises Arabic dialect voice data and corresponding label-free texts;

the method for performing single-mode training on the acoustic encoder and the text encoder respectively specifically comprises the following steps:

masking the Arabic voice data by local audio signals, taking masked data as input, taking unmasked data as output, and training the acoustic encoder;

the trained text encoder is used for converting a symbol sequence corresponding to the label-free text into a corresponding high-dimensional feature vector;

the training of the first label prediction model based on the first training data specifically comprises:

taking Arabic voice data in the first training data as input of the acoustic encoder to obtain a first feature vector output by the acoustic encoder; taking the text without the identifier corresponding to the voice data in the first training data as the input of the text encoder to obtain a second feature vector output by the text encoder;

fusing the first feature vector and the second feature vector to obtain a combined feature, and predicting a text with a label by taking the combined feature as the input of the multi-mode combined network;

calculating loss by taking a marked text corresponding to the voice data in the first training data as a reference, and training a multi-mode marked prediction model;

the fine tuning of the first identifier prediction model based on the first training data and the second training data specifically includes:

extracting part of voice data and text without a label from the second training data as input of the first label prediction model to generate a corresponding pseudo label;

and fine tuning the first label prediction model according to the pseudo label and the label text in the first training data.

2. The method of claim 1, wherein the fine tuning the first identifier prediction model based on the first training data and the second training data, after obtaining the second identifier prediction model, further comprises:

and taking the voice data of the Arabic language and the corresponding text without the identifier as the input of the second identifier prediction model to predict the corresponding text with the identifier.

3. The device for constructing the label prediction model of the Arabic language is characterized by comprising the following components:

the fine tuning module is used for fine tuning the first identifier prediction model based on the first training data and the second training data to obtain a second identifier prediction model, wherein the second training data comprises Arabic dialect voice data and corresponding non-identifier text;

the first training module is specifically configured to mask local audio signals for arabic speech data, input masked data, and output unmasked data, and train the acoustic encoder; mapping the Arabic non-standard text and the Arabic standard text in the third training data to corresponding symbol sequences respectively, taking the symbol sequence corresponding to the non-standard text as input, taking the symbol sequence corresponding to the standard text as output, and training the text encoder;

the second training module is specifically configured to use arabic speech data in the first training data as input of the acoustic encoder to obtain a first feature vector output by the acoustic encoder; taking the text without the identifier corresponding to the voice data in the first training data as the input of the text encoder to obtain a second feature vector output by the text encoder; fusing the first feature vector and the second feature vector to obtain a combined feature, and predicting a text with a label by taking the combined feature as the input of the multi-mode combined network; calculating loss by taking a marked text corresponding to the voice data in the first training data as a reference, and training a multi-mode marked prediction model;

the fine tuning module is specifically configured to extract a part of voice data and a text without a label from the second training data as input of the first label prediction model, and generate a corresponding pseudo label; and fine tuning the first label prediction model according to the pseudo label and the label text in the first training data.

4. A device according to claim 3, further comprising: