CN113257238B

CN113257238B - Training method of pre-training model, coding feature acquisition method and related device

Info

Publication number: CN113257238B
Application number: CN202110791198.XA
Authority: CN
Inventors: 李航; 康昱; 丁文彪; 刘子韬
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-10-01
Anticipated expiration: 2041-07-13
Also published as: CN113257238A

Abstract

The embodiment of the disclosure discloses a training method of a pre-training model, a coding feature acquisition method and a related device, wherein the method comprises the following steps: acquiring the characteristics of each audio frame and the text of the audio to be identified; the text features are coded through a text coding module to obtain text coding features; randomly selecting audio frame characteristics to carry out mask processing to obtain a mask audio frame characteristic sequence; coding each mask audio frame characteristic by an audio coding module in combination with the text coding characteristic to obtain a mask audio frame coding characteristic sequence; and acquiring the characteristics of each training audio frame according to the coding characteristics of each mask audio frame, acquiring audio loss, and adjusting the parameters of the pre-training model until the audio loss meets a loss audio threshold value to obtain a trained text coding module and an audio coding module. The method and the device for training the pre-training model and acquiring the coding characteristics can improve the accuracy of model training on the basis of smaller mark amount.

Description

Training method of pre-training model, coding feature acquisition method and related device

Technical Field

The present disclosure relates to the field of computers, and in particular, to a training method for a pre-training model, a coding feature obtaining method, and a related apparatus.

Background

In the fields of application and analysis of current teaching analysis, intelligent artificial customer service, automatic language translation and the like, which relate to voice data, a data processing model is trained through a deep neural network technology, so that the data processing model has strong learning capacity and is widely applied to data processing.

However, the existing data processing model training method needs to train different models based on different applications, which results in different labeling of data, and the labeling workload is large and the training cost is high.

Therefore, how to improve the accuracy of model training on the basis of a smaller labeling amount becomes a technical problem which needs to be solved urgently.

Disclosure of Invention

The embodiment of the disclosure provides a training method of a pre-training model, a coding feature obtaining method and a related device, so as to improve the accuracy of model training on the basis of a smaller mark amount.

According to an aspect of the present disclosure, there is provided a training method of a pre-training model, including:

acquiring the characteristics of each audio frame of the audio to be recognized and the text characteristics of the text corresponding to the audio to be recognized;

coding the text features through a text coding module of the pre-training model to obtain text coding features;

randomly selecting audio frame features with a first preset proportion from the audio frame features to perform mask processing to obtain a mask audio frame feature sequence;

coding each mask audio frame feature in the mask audio frame feature sequence by combining the text coding feature through an audio coding module of the pre-training model to obtain a mask audio frame coding feature sequence;

and acquiring each training audio frame characteristic according to each mask audio frame coding characteristic in the mask audio frame coding characteristic sequence, and adjusting the parameters of the pre-training model according to the audio loss obtained by each corresponding training audio frame characteristic and the audio frame characteristic until the audio loss meets a loss audio threshold value to obtain the trained pre-training model.

According to another aspect of the present disclosure, there is provided an encoding characteristic obtaining method, including:

acquiring each audio frame characteristic to be coded of audio to be coded and a text characteristic to be coded of a text to be coded corresponding to the audio to be coded;

the text coding module obtained by training by using the training method of the pre-training model encodes the text features to be encoded to obtain encoded text encoding features;

and the audio coding module obtained by training by using the pre-training model training method encodes each audio frame feature to be encoded by combining the encoding text encoding feature to obtain an audio frame encoding feature sequence.

According to another aspect of the present disclosure, there is provided a training apparatus for pre-training a model, including:

the audio frame characteristic and text characteristic acquisition unit is used for acquiring each audio frame characteristic of the audio to be identified and the text characteristic of the text corresponding to the audio to be identified;

the text coding feature acquisition unit is used for coding the text features through a text coding module of the pre-training model to obtain text coding features;

the mask audio frame feature sequence acquisition unit is used for randomly selecting audio frame features with a first preset proportion from the audio frame features to carry out mask processing to obtain a mask audio frame feature sequence;

an audio coding feature obtaining unit, configured to encode, by using the audio coding module of the pre-training model and in combination with the text coding feature, each mask audio frame feature in the mask audio frame feature sequence to obtain a mask audio frame coding feature sequence;

and the parameter adjusting unit is used for acquiring each training audio frame characteristic according to each mask audio frame coding characteristic in the mask audio frame coding characteristic sequence, and adjusting the parameters of the pre-training model according to the audio loss obtained by each corresponding training audio frame characteristic and the audio frame characteristic until the audio loss meets a loss audio threshold value, so as to obtain the trained pre-training model.

According to another aspect of the present disclosure, there is provided an encoding characteristic obtaining apparatus including:

the device comprises a to-be-coded feature acquisition unit, a to-be-coded feature acquisition unit and a to-be-coded feature acquisition unit, wherein the to-be-coded feature acquisition unit is used for acquiring each to-be-coded audio frame feature of an audio to be coded and a to-be-coded text feature of a to-be-coded text corresponding to the audio to be coded;

the text coding unit is used for coding the text features to be coded to obtain coded text coding features;

and the audio coding unit is used for coding the characteristics of the audio frames to be coded by combining the coding text coding characteristics to obtain an audio frame coding characteristic sequence.

According to another aspect of the present disclosure, a computer-readable storage medium is provided, having stored thereon computer instructions which, when executed, perform the training method of the pre-trained model as described above.

According to another aspect of the present disclosure, there is provided a terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the computer, the processor executing the computer instructions to perform the aforementioned training method of the pre-training model.

Compared with the prior art, the technical scheme disclosed by the invention has the following advantages:

the training method of the pre-training model provided by the embodiment of the disclosure obtains the audio frame characteristics of the audio to be recognized and the text characteristics of the text corresponding to the audio to be recognized respectively, then randomly selects the audio frame characteristics of the preset proportion in the audio frame characteristics to perform mask processing to obtain a mask audio frame characteristic sequence, encodes the text characteristics through a text encoding module of the pre-training model to obtain text encoding characteristics, encodes each mask audio frame characteristic in the mask audio frame characteristic sequence through an audio encoding module of the pre-training model in combination with the text encoding characteristics to obtain a mask audio frame encoding characteristic sequence, obtains each training audio frame characteristic through each mask audio frame encoding characteristic in the mask audio frame encoding characteristic sequence, and obtains audio loss according to each training audio frame characteristic and audio frame characteristic corresponding to each other, and adjusting parameters of the pre-training model until the audio loss meets the audio loss threshold, so as to obtain the trained pre-training model. Therefore, according to the training method of the pre-training model provided by the embodiment of the disclosure, when the pre-training model to be trained is trained, the text features are encoded through the text encoding module of the pre-training model to obtain the text encoding features, and the audio encoding module of the pre-training model is combined with the text encoding features to encode each mask audio frame feature in the mask audio frame feature sequence, so that the audio frame features and the text features can be fully fused during encoding, so that the model can be more accurately extracted into the audio frame encoding features and the text features through training, and the accuracy of model training is improved; in addition, when the pre-training model is trained, the audio frame features with the preset proportion in the audio frame features are randomly selected to carry out mask processing, and then the training is realized in a reduction mode without indexing training data, so that the training cost of the pre-training model can be reduced; on the other hand, because the audio coding module and the text coding module obtained by the training method of the pre-training model provided by the embodiment of the disclosure have higher accuracy, when the audio coding module is used for audio coding and the text coding module is used for text coding, accurate audio coding characteristics and text coding characteristics can be obtained, so that the training difficulty of each model (such as a speaker identity authentication model and a speaker emotion recognition model) which needs to be further processed based on the audio coding characteristics and the text coding characteristics is reduced, better training effect can be achieved by using less labeled data, and the model training cost for further processing can be reduced; meanwhile, the audio coding module and the text coding module can be applied to models of different application scenes, so that the audio coding module and the text coding module have better mobility and expandability.

In an alternative, the training method of the pre-training model provided by the embodiments of the present disclosure includes that the audio coding module of the pre-training model includes at least two coding layers, and the text coding features are combined by the audio coding module of the pre-training model, when coding each mask audio frame feature in the mask audio frame feature sequence, firstly coding each mask audio frame feature in the mask audio frame feature sequence by combining text coding features through a first coding layer of an audio coding module to obtain a first mask audio frame coding feature sequence, then combining text coding features through a second coding layer, coding each first mask audio frame coding feature in the first mask audio frame coding feature sequence to obtain a second mask audio frame coding feature sequence, and acquiring the coding feature sequence of the mask audio frame according to the coding feature sequence of the second mask audio frame. Therefore, when the pre-training model is trained, each coding layer of the audio coding module is coded by combining the text coding features, the coding output and the text coding features of the previous coding layer are used as the coding input of the next coding layer, the pre-training model can more accurately extract the audio frame coding features and the text features during training, and the accuracy of model training can be further improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a training method for a pre-training model according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of obtaining features of each audio frame and text features of a training method for a pre-training model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of obtaining an audio to be recognized in a training method of a pre-training model provided in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of obtaining audio frame features of a training method of a pre-training model according to an embodiment of the present disclosure;

FIG. 5 is a diagram illustrating an alternative structure of a text encoding module of a training method for a pre-training model according to an embodiment of the present disclosure;

FIG. 6 is a schematic flowchart of obtaining mask audio frame coding features of a training method of a pre-training model according to an embodiment of the present disclosure;

FIG. 7 is a schematic flowchart of another method for training a pre-training model to obtain a mask audio frame coding feature according to an embodiment of the present disclosure;

FIG. 8 is a diagram illustrating an alternative structure of an audio coding module of a training method for a pre-training model according to an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart diagram illustrating a training method for a pre-training model according to an embodiment of the present disclosure;

fig. 10 is a flowchart illustrating a method for acquiring coding characteristics according to an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of a training apparatus for pre-training a model according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an encoding characteristic obtaining apparatus according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of an alternative hardware device architecture according to an embodiment of the present disclosure.

Detailed Description

The accuracy of model training is difficult to improve on the basis of a smaller standard quantity in the prior art.

In order to improve the accuracy of model training on the basis of a smaller mark amount, the disclosure provides a training method of a pre-training model, which comprises the following steps:

It can be seen that, in the training method of the pre-training model provided in the embodiment of the present disclosure, audio frame features of an audio to be recognized and text features of a text corresponding to the audio to be recognized are respectively obtained, then audio frame features of a preset proportion in the audio frame features are randomly selected to be masked to obtain a mask audio frame feature sequence, the text features are encoded by a text encoding module of the pre-training model to obtain text encoding features, each mask audio frame feature in the mask audio frame feature sequence is encoded by an audio encoding module of the pre-training model in combination with the text encoding features to obtain a mask audio frame encoding feature sequence, and audio loss is obtained according to each training audio frame feature and each audio frame feature corresponding to each other by each training audio frame feature obtained from each mask audio frame encoding feature in the mask audio frame encoding feature sequence, and adjusting parameters of the pre-training model until the audio loss meets the audio loss threshold, and obtaining the pre-training model after training.

It can be seen that in the training method of the pre-training model provided by the embodiment of the present disclosure, when the pre-training model to be trained is trained, the text features are encoded by the text encoding module of the pre-training model to obtain the text encoding features, and the audio encoding module of the pre-training model is used to encode each mask audio frame feature in the mask audio frame feature sequence in combination with the text encoding features, so that the audio frame features and the text features can be fully fused during encoding, so that the model can be more accurately extracted from the audio frame encoding features and the text features through training, and the accuracy of model training is improved; in addition, when the pre-training model is trained, the audio frame features with the preset proportion in the audio frame features are randomly selected to carry out mask processing, and then the training is realized in a reduction mode without indexing training data, so that the training cost of the pre-training model can be reduced; on the other hand, because the audio coding module and the text coding module obtained by the training method of the pre-training model provided by the embodiment of the disclosure have higher accuracy, when the audio coding module is used for audio coding and the text coding module is used for text coding, accurate audio coding characteristics and text coding characteristics can be obtained, so that the training difficulty of each model (such as a speaker identity authentication model and a speaker emotion recognition model) which needs to be further processed based on the audio coding characteristics and the text coding characteristics is reduced, better training effect can be achieved by using less labeled data, and the model training cost for further processing can be reduced; meanwhile, the audio coding module and the text coding module can be applied to models of different application scenes, so that the audio coding module and the text coding module have better mobility and expandability.

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a training method of a pre-training model according to an embodiment of the present disclosure.

As shown in the drawings, the training method of the pre-training model provided by the embodiment of the disclosure includes the following steps:

step S10, obtaining each audio frame characteristic of the audio to be recognized and the text characteristic of the text corresponding to the audio to be recognized.

In order to train the model to be trained, the audio frame features and the text features of the text corresponding to the audio to be recognized are extracted based on the audio to be recognized and used as training data of the model training method during training.

It is easily understood that, in order to implement the training for the pre-training model, the number of the audio to be recognized is more than one segment, and thus the audio to be recognized as described herein refers to multiple segments of audio to be recognized.

Specifically, the audio to be recognized may be obtained from the recorded audio, and after the audio to be recognized is obtained, the audio frame features and the text features are further obtained.

In a specific embodiment, in order to improve the quality of the obtained audio frame features and text features, please refer to fig. 2, and fig. 2 is a schematic flow chart illustrating the process of obtaining the audio frame features and text features by the training method of the pre-training model according to an embodiment of the present disclosure.

As shown in the figure, the step of acquiring the features of each audio frame of the audio to be recognized in the training method of the pre-training model provided by the embodiment of the present disclosure may include:

and S100, acquiring the voice audio in the original audio to obtain the audio to be identified.

Specifically, the audio to be identified may be directly obtained from the recorded original audio.

In a specific embodiment, if the original audio can be directly used, the original audio is directly used as the audio to be recognized.

However, since the original audio obtained from a specific scene contains many random noises, mutes, and other unwanted audio, it is necessary to first remove the unwanted audio from the original audio or extract the speech audio required for training from the original audio, and obtain the speech audio in the original audio to obtain the audio to be recognized, which may specifically include:

identifying and marking voice audio in the original audio;

and extracting the voice audio frequency in the original audio frequency according to the mark to obtain the audio frequency to be identified.

Specifically, reference may be made to fig. 3, where fig. 3 is a schematic diagram of obtaining an audio to be recognized in a training method of a pre-training model provided in an embodiment of the present disclosure.

As shown in fig. 3, in order to recognize and mark the voice audio, an endpoint detection technique may be applied, and under the circumstances of less environmental interference and higher signal-to-noise ratio of the original audio, the original audio is automatically recognized and marked through preset time domain or frequency domain parameters, or time domain and frequency domain combination parameters. Further, if the environment is noisy, the signal-to-noise ratio of the original audio is not high, and in order to ensure the accuracy of identification and marking, end point detection needs to be performed by using models such as a Hidden Markov Model (HMM), a multilayer perceptron Model (MLP), a deep neural network model (DNN), and the like. The accurate marking of the effective speech part in the original audio is obtained through endpoint detection.

After the mark of the voice audio is obtained, the original audio is segmented according to the mark to obtain a series of voice segments, and the series of voice segments can be used as the audio to be recognized for model training.

The original audio is identified to obtain the audio to be identified, so that invalid audio such as noise, silence and the like in the original audio can be effectively removed, the training data for pre-training the model only contain valid voice audio, and the influence of the invalid audio during training on the training precision of the model is avoided.

Step S101: and sequentially acquiring each audio frame of the audio to be identified according to a preset frame length and a preset sliding step length, wherein the preset frame length is greater than the preset sliding step length.

After the audio to be recognized is obtained, no matter before the audio features are extracted or the corresponding text is obtained through voice recognition, due to the short-time steady characteristic of the audio, in order to reduce the influence of unsteady state and time variation of the whole audio, the audio needs to be subjected to framing processing, in order to enable the frames to be smoothly transited and maintain the continuity of the frames, the framing generally adopts an overlapping and segmenting method, a part of the two adjacent frames are ensured to be mutually overlapped, namely the time difference of the starting positions of the two adjacent frames, namely the preset sliding step length is smaller than the preset frame length, the preset frame length is set to be 50 milliseconds, the sliding step length is set to be 12.5 milliseconds, and the audio to be recognized is framed according to the length to obtain a series of audio frames. The series of audio frames contain complete information of the audio to be recognized, so that the audio training data and the text training data of the pre-training model contain consistent and one-to-one corresponding information, and the accuracy of model training is ensured.

Step S102: and extracting the characteristics of each audio frame to obtain the characteristics of the audio frames.

As shown in fig. 4, fig. 4 is a schematic diagram of obtaining audio frame features of a training method of a pre-training model according to an embodiment of the present disclosure.

As shown in the figure, after each audio frame corresponding to the audio to be identified is obtained, the features of each audio frame are extracted and spliced. The audio frame features to be obtained here may be mel-frequency cepstrum coefficient (MFCC) features, where the MFCC features represent a frequency band of an audio having the highest hearing sensitivity to human ears, and can sufficiently reflect voice features in the audio, so that the feature dimension extracted here may be 80 dimensions, and then a first-order difference of the features may be further extracted and spliced to obtain each audio frame feature corresponding to the audio to be recognized, so that the feature dimension of each audio frame feature obtained after splicing is 160 dimensions.

The audio frame characteristics obtained in the way not only can reduce the influence of invalid audio such as noise in the original audio, but also can contain complete information of the audio to be identified, thereby avoiding the loss of the audio frame characteristics due to the improper acquisition mode of the audio frame and improving the integrity of the audio frame characteristics.

And S103, acquiring a text corresponding to the audio to be recognized.

After the audio to be recognized is obtained, in order to obtain the text features, a text corresponding to the audio to be recognized needs to be further obtained.

In a specific implementation manner, the features of each audio frame corresponding to the obtained audio to be recognized and the audio to be recognized may be recognized through a speech recognition model, so as to obtain a text corresponding to the audio to be recognized.

In other embodiments, the corresponding text may be obtained in other ways.

It is easy to understand that the text acquisition step (i.e., step S102) and the audio frame feature acquisition step (i.e., step S103) do not have a strict sequence, and the text acquisition or the audio frame feature acquisition can be performed as long as the audio to be recognized is obtained.

Step S104: and acquiring text characteristics of the text.

And after the text is obtained, further processing the text to obtain text characteristics.

Therefore, through the processing, the audio frame characteristics of the audio to be recognized and the text characteristics of the text corresponding to the audio to be recognized can be obtained.

Step S11: and coding the text features through a text coding module of the pre-training model to obtain text coding features.

After the text features are obtained, in order to further implement training of the pre-training model, the pre-training model may be encoded by a text encoding module of the pre-training model.

In a specific implementation manner, please refer to fig. 5, where fig. 5 is a schematic diagram illustrating an optional structure of a text encoding module of a training method for a pre-training model according to an embodiment of the present disclosure.

As shown in the figure, the text encoding module may include two text encoding layers with the same structure, each text encoding layer may include four sub-layers, and one arrangement of the sub-layers may be: the multi-head self-attention layer, the residual error normalization layer, the feedforward neural network layer and the residual error normalization layer are sequentially arranged from input to output.

When the coding is started, the text features are input into a first text coding layer, the coding is carried out on the first sublayer, namely the multi-head self-attention layer, of the text coding layer one by one until the first text coding features are obtained through the last sublayer, namely the residual error normalization layer, of the text coding layer, then the first text coding features are input into a second text coding layer for coding, and the coding is carried out on the first sublayer, namely the multi-head self-attention layer, of the text coding layer one by one until the second text coding features are obtained through the last sublayer, namely the residual error normalization layer, of the text coding layer.

When the text coding module only comprises two text coding layers, the second text coding characteristic of the coded output of the second coding layer is the text coding characteristic of the coded output of the text coding module.

Of course, in other embodiments, the text encoding module may include more text encoding layers, and the specific situation of the sub-layer included in each text encoding layer may also be determined as needed.

It is easy to understand that, if the text coding module includes more text coding layers, the obtained second text coding feature is input to the next text coding layer, and the coding is continued according to the coding sequence of the first two text coding layers until the output of the last text coding layer is obtained, so as to obtain the text coding feature.

The arrangement of the plurality of text coding layers can enable the text coding module to more fully acquire information in text characteristics, and improve the accuracy of the acquired information, thereby improving the accuracy of model training.

Furthermore, a multi-head self-attention layer is arranged in the text coding layer, so that the text coding module can fully acquire the context relation of the characteristics of each word in the text characteristics, the characteristic loss possibly generated in the text coding process is avoided, and the residual error normalization layer and the feedforward neural network layer in the coding layer can enable the model to effectively transfer the parameter adjustment layer by layer during the training, and the accuracy of the training result is ensured.

In addition, when the pre-training model is trained, the structure of the text coding module, including the number of layers of the text coding layer, the composition of sub-layers in the text coding layer, or the stacking order of sub-layers, may be adjusted accordingly in order to achieve a preset training target, which is not limited in this disclosure.

Step S12: and randomly selecting audio frame characteristics with a first preset proportion from the audio frame characteristics to perform mask processing to obtain a mask audio frame characteristic sequence.

In order to achieve training of the pre-training model and reduce the amount of labeling required for training, the pre-training model can be implemented by masking the audio frames and then restoring the audio frames.

In the training process of the pre-training model, firstly, the audio frame features for masking operation are randomly selected from the audio frame features, specifically, the audio frame features for masking can be selected according to a first preset proportion, specifically, the first preset proportion can be 15% or 20%, and other values can also be modified according to actual needs.

Specifically, the masking audio frame features or any audio frame features may be used to perform masking processing on the randomly selected audio frame features of the audio frame features at the first preset proportion, so as to obtain a masking audio frame feature sequence. Therefore, different audio frame characteristics are utilized to carry out mask processing, the effect of carrying out mask processing on the audio frames can be further improved, and the subsequent effect of training the audio coding module is further improved.

In one embodiment, when masking the randomly selected audio frame features, it may also be determined how to perform masking according to a probability, such as: each selected audio frame feature is replaced by a mask audio frame feature with a probability of 80%, and is replaced by other random audio frame features with a probability of 10%, and the probability of 10% is kept unchanged, so that the randomness of the mask can be further improved, and the training effect of the audio coding module is improved; of course, the masking operation may be performed on the randomly selected audio frame features in other manners.

Therefore, in the training process, the same part of audio frame features are selected from the audio frame features each time to be subjected to mask processing, the selected audio frame features are respectively replaced by mask audio frame features and other random audio frame features or original audio frame features are kept unchanged each time with different probabilities, the mask audio frame feature sequence obtained after the mask processing is coded through an audio coding module, the original audio frame features corresponding to the mask audio frame features are identified after audio coding is completed, and the feature identification capability of the model can be trained by continuously converging the mask audio frame features and errors corresponding to the original audio frames through multiple times of audio coding.

It is easy to understand that, since part of the audio frame features are changed into mask audio frame features through the aforementioned masking processing, and part of the audio frame features are still the original audio frame features, the obtained mask audio frame feature sequence at this time refers to a sequence including the masked and masked audio frame features and the audio frame features that are masked and not masked, and for convenience of description, the audio frame features that are actually masked are referred to herein as the mask audio frame features regardless of whether they are actually masked or not.

Step S13: and coding each mask audio frame feature in the mask audio frame feature sequence by combining the text coding features through an audio coding module of the pre-training model to obtain a mask audio frame coding feature sequence.

And after the mask audio frame characteristic sequence is obtained through the mask operation, the mask audio frame characteristic sequence is coded through an audio coding module and combined with text coding characteristics.

Specifically, the number of audio coding layers included in the audio coding module may be determined according to needs, such as: one or more layers.

In order to more accurately obtain information of audio frame characteristics, in a specific embodiment, the audio coding module may include at least two audio coding layers, that is, at least a first coding layer and a second coding layer, a specific coding process is as shown in fig. 6, and fig. 6 is a schematic flowchart of a method for training a pre-training model to obtain mask audio frame coding characteristics according to an embodiment of the present disclosure.

As shown in fig. 6, the step of coding the mask audio frame feature by combining the text coding feature through the audio coding layer in the audio coding module in the training method of the pre-training model provided in the embodiment of the present disclosure may include:

step S130: and coding the mask audio frame characteristics by combining the text coding characteristics through a first coding layer of the audio coding module to obtain a first mask audio frame coding characteristic sequence.

In the encoding process, in order to improve the obtained information of the audio to be identified, text encoding characteristics can be combined at the same time, and each mask audio frame characteristic is encoded through a first encoding layer of an audio encoding module.

It is readily understood that when encoding with the first encoding layer, the individual masked audio frame features of the sequence of masked audio frame features are actually processed in conjunction with the text encoding features.

It should be noted that obtaining the first mask audio frame feature sequence described herein includes encoding each mask audio frame feature of the mask audio frame feature sequence by using the first encoding layer to obtain each first mask audio frame feature sequence, then inputting the first mask audio frame feature sequence into the second encoding layer, and also includes processing one mask audio frame feature by using the first encoding layer and then processing the first mask audio frame feature by using the second mask layer until all the first mask audio frame features are obtained, so as to obtain the first mask audio frame feature sequence.

Step S131: and coding each first mask audio frame coding feature by combining the text coding feature through a second coding layer of the audio coding module to obtain a second mask audio frame coding feature sequence.

And after at least one first mask audio frame characteristic is obtained, inputting the first mask audio frame characteristic into a second coding layer of the audio coding module, and coding the first mask audio frame coding characteristic by combining with the text coding characteristic to obtain a second mask audio frame coding characteristic.

And step S132, acquiring a mask audio frame coding feature sequence according to the second mask audio.

It is easy to understand that, when the audio coding module only includes two audio coding layers, the second masked audio frame coding feature sequence encoded and output by the second audio coding layer is the masked audio frame coding feature sequence encoded and output by the audio coding module.

If the audio coding module comprises more audio coding layers, the obtained second mask audio frame coding feature sequence is input into the next audio coding layer, and coding is continued according to the coding sequence of the first two audio coding layers and the text coding feature until the mask audio frame coding feature sequence is obtained through the output of the last audio coding layer.

Therefore, when the pre-training model is trained, each audio coding layer of the audio coding module is coded by combining the text coding features, the coding output and the text coding features of the previous audio coding layer are used as the coding input of the next audio coding layer, the pre-training model can more accurately extract the audio frame coding features and the text features during training, and the accuracy of model training can be further improved.

In a specific embodiment, in order to enable each audio coding layer to better implement fusion of text coding features, the present disclosure further provides a training method of a pre-training model, as shown in fig. 7, fig. 7 is another schematic flow diagram of acquiring mask audio frame coding features of the training method of the pre-training model provided in an embodiment of the present disclosure.

As shown, the first coding layer and the second coding layer both include a feature fusion sublayer, and step S130: the step of encoding, by a feature fusion sublayer in each encoding layer of the audio encoding module, each mask audio frame feature in combination with the text encoding feature may include:

s1300: when the first coding layer of the audio coding module is used for coding, the feature fusion sublayer in the first coding layer combines text coding features to code the features of each mask audio frame to obtain a first mask audio frame coding feature sequence.

Specifically, when the coding is performed through the first coding layer, the feature fusion sublayer in the first coding layer is combined with the text coding features and performs coding to obtain a first mask audio frame coding feature sequence.

For convenience of understanding, reference is now made to fig. 8, where fig. 8 is a schematic diagram illustrating an alternative structure of an audio coding module of a training method for a pre-training model according to an embodiment of the present disclosure.

As shown in the figure, the first coding layer and the second coding layer included in the audio coding module have the same structure (of course, if there are other audio coding layers, the same result may also be obtained), each coding layer includes a feature fusion sub-layer, in a specific embodiment, each coding layer may include six sub-layers, and one arrangement of the sub-layers may be: the multi-head self-attention layer, the residual error normalization layer, the cross-mode multi-head self-attention layer, the residual error normalization layer, the feedforward neural network layer and the residual error normalization layer are sequentially arranged from input to output.

When encoding is performed by the audio encoding module, the feature fusion sublayer in each encoding layer is encoded in combination with text encoding features, and specifically, the feature fusion sublayer may be a cross-modal multi-headed self-attention layer.

When the coding is started, firstly, the mask audio frame characteristics in the mask audio frame characteristic sequence are input into a first coding layer, the coding is carried out on the first sub-layer, namely the multi-head self-attention layer, of the coding layer one by one, and the coding is continued by combining the text coding characteristics output by the text coding module when the coding is carried out on a third sub-layer, namely the cross-mode multi-head self-attention layer, until the first mask audio frame coding characteristic sequence is obtained through the last sub-layer, namely the residual error normalization layer, of the coding layer.

And after a first mask audio frame coding feature sequence output by a first coding layer is obtained, inputting the first mask audio frame coding feature sequence into a second coding layer for coding, coding the first mask audio frame coding feature sequence one by one from a first sub-layer, namely a multi-head self-attention layer, of the coding layer, and continuing coding by combining text coding features output by a text coding module when a third sub-layer, namely a cross-mode multi-head self-attention layer, is reached until a second mask audio frame coding feature sequence is obtained through a last sub-layer, namely a residual error normalization layer, of the coding layer.

The specific way of encoding by combining the cross-modal multi-head self-attention layer and the text encoding feature can be as follows:

the Query input end of the cross-modal multi-head attention layer receives the coded output from the last residual normalization layer, the Key input end and the Value input end of the cross-modal multi-head attention layer both receive the text coding features output from the text coding module, the cross-modal multi-head attention layer combines the inputs of the three input ends for coding, and the coded output is input to the next residual normalization layer for continuous coding.

Of course, in the above case, step S131: when encoding is performed through the second encoding layer of the audio encoding module, the step of encoding each first mask audio frame encoding feature in the first mask audio frame encoding feature sequence by combining the feature fusion sublayer in the second encoding layer with the text encoding feature may include:

step 1310: when the second coding layer of the audio coding module is used for coding, the feature fusion sublayer in the second coding layer combines text coding features to code each first mask audio frame coding feature in the first mask audio frame coding feature sequence to obtain a second mask audio frame coding feature sequence.

When the second coding layer is used for coding, since the first coding layer and the second coding layer have the same structure, the detailed content may refer to the description of step S1300, and is not described herein again.

It should be noted that, when encoding is performed by using the second encoding layer, the feature fusion sublayer (specifically, the cross-modal multi-headed attention layer) may encode in combination with the first masked audio frame encoding feature.

Therefore, through the cross-modal multi-head self-attention layer in two or more audio coding layers of the audio coding module, the characteristic relation between each mask audio frame characteristic and the text coding characteristic of the corresponding text can be fully obtained during audio coding, and the recognition and combination processing capacity of the model to the characteristics of different modalities can be sequentially and fully improved.

Furthermore, the multi-head self-attention layer in two or more audio coding layers of the audio coding module can enable the audio coding module to fully acquire the context information of each mask audio frame feature in the input mask audio frame feature sequence, so that the original audio frame feature corresponding to each mask audio frame feature can be more accurately predicted after the audio coding is finished, and the recognition capability of the model to the audio features is improved.

Of course, in other embodiments, the audio coding module may include more audio coding layers, and the specific situation of the sub-layers included in each audio coding layer may also be determined as needed.

In addition, when the pre-training model is trained, the structure of the audio coding module, including the number of layers of the audio coding layer, the composition of sub-layers in the audio coding layer, or the stacking order of sub-layers, may be adjusted accordingly in order to achieve a preset training target, which is not limited in this disclosure.

Step S14: and acquiring each training audio frame characteristic according to each mask audio frame coding characteristic in the mask audio frame coding characteristic sequence, and acquiring audio loss according to each corresponding training audio frame characteristic and audio frame characteristic.

In order to train the pre-training model, it is necessary to identify the coding features of each mask audio frame subjected to mask processing, and then perform parameter adjustment on the pre-training model, so that the text coding module and the audio coding module of the pre-training model can extract accurate information of the audio to be identified.

And after the coding features of the mask audio frames are obtained, further restoring the audio frame features corresponding to the mask audio frame features before coding, wherein the obtained restoration result is the training audio frame features.

And then calculating the loss between each training audio frame characteristic and the corresponding audio frame characteristic to obtain the audio loss.

Step S15: and judging whether the audio loss meets the audio loss threshold, if so, executing the step S17, and if not, executing the step S16.

If the current audio loss value does not meet the preset lost audio threshold value, which indicates that the recognition capability of the current model does not meet the training requirement, the adjustment of the parameters and the training of the model after the parameters are adjusted need to be continued, step S16 is executed, if the current audio loss value meets the preset lost audio threshold value, which indicates that the recognition capability of the model meets the training requirement, the training of the pre-trained model is completed, and step S17 is executed.

It should be noted here that, since a series of audios to be recognized and corresponding texts are obtained from an original audio in steps S100 to S103, each section of audio to be recognized and corresponding text may be encoded multiple times as training data, audio losses obtained after each encoding may be different, and a determination manner for determining whether audio loss satisfies an audio loss threshold may be a manner that requires that audio losses obtained each time satisfy an audio loss threshold, or an average value of audio losses for multiple times is calculated first, and then a determination manner is performed to determine whether the obtained average value satisfies the audio loss threshold, which may be flexibly set according to a training target, and the present invention is not limited herein.

Step S16, adjusting the parameters of the pre-training model according to the audio loss, and turning to execute step S13.

And adjusting parameters of the audio coding module of the pre-training model according to the currently obtained audio loss, then performing audio coding again, namely executing step S13, and performing new cycle coding and adjustment, wherein when executing step S13, the audio coding module with updated parameters can be a new audio coding module, so that the accuracy of pre-training model training can be improved by replacing the audio coding module.

Step S17: and obtaining the trained pre-training model after the training is finished.

In another specific implementation manner, in order to improve training efficiency, an embodiment of the present disclosure further provides a training method of a pre-training model, please refer to fig. 9, and fig. 9 is another schematic flow chart of the training method of the pre-training model provided in an embodiment of the present disclosure.

As shown in the figure, the training method of the pre-training model provided by the embodiment of the present disclosure includes:

and step S70, acquiring the audio to be recognized.

For details of step S70, please refer to the description of step S100 shown in fig. 2, which is not repeated herein.

Step S71: and acquiring the characteristics of each audio frame corresponding to the audio to be identified.

For details of step S71, please refer to the description of steps S101 to S102 shown in fig. 2, which is not repeated herein.

Step S72: and acquiring word segmentation text characteristics of the text corresponding to the audio to be recognized.

Please refer to the description of step S103 shown in fig. 2 for obtaining the specific content of the text corresponding to the audio to be recognized, which is not described herein again.

After the text is obtained, in a specific implementation manner, a word segmentation algorithm may be applied to perform word segmentation on the obtained text, so as to obtain a word segmentation text of the text corresponding to the audio to be recognized. And then, obtaining a word vector of the word segmentation text to obtain the word segmentation text characteristics.

In another specific embodiment, in order to improve the accuracy of the obtained feature of the segmented word text, after obtaining a word vector, further obtaining a position vector corresponding to the segmented word text, and splicing the obtained word vector and the position vector to obtain a feature vector of the segmented word text, that is, the feature of the segmented word text of the text corresponding to the audio to be recognized.

Furthermore, considering that the conventional method for segmenting the text needs to use a segmentation table, and the text corresponding to the audio to be recognized may include words or word distribution definitions that are not included in the segmentation table, a BPE segmentation algorithm may be used to segment the text corresponding to the audio to be recognized to obtain the segmented text including the complete text content.

Step S73: and performing mask processing on the word segmentation text characteristics of the text corresponding to the audio to be recognized to obtain mask text characteristics.

In order to train the pre-training model and reduce the amount of marking required by training, the method can be realized by masking the word segmentation text features and then restoring the word segmentation text features.

In the training process of the pre-training model, the word segmentation text features for mask operation are randomly selected from the word segmentation text features, and the word segmentation text features for mask operation can be selected according to a first preset proportion, specifically, the first preset proportion can be 15% or 20%, and can also be modified into other numerical values according to actual needs.

Specifically, the mask text features or any word segmentation text features may be used to perform mask processing on the word segmentation text features of the randomly selected word segmentation text features in the first preset proportion, so as to obtain mask text features. Therefore, different word segmentation text characteristics are utilized to carry out mask processing, the effect of carrying out mask processing on the word segmentation text characteristics can be further improved, and the subsequent effect of training a text coding module is further improved.

In a specific embodiment, when the randomly selected segmented text features are subjected to a masking operation, how to perform the masking operation may also be determined according to a probability, such as: each selected word segmentation text characteristic is replaced by a mask text characteristic with a probability of 80%, and is replaced by other random word segmentation text characteristics with a probability of 10%, and the probability of 10% is kept unchanged, so that the randomness of the mask can be further improved, and the training effect of a text coding module is improved; of course, the masking operation may be performed on the randomly selected segmented text features in other manners.

Therefore, in the training process, the same part of the participle text features are selected from the participle text features each time for mask processing, the selected participle text features are replaced by mask text features and other random participle text features each time with different probabilities or the original participle text features are kept unchanged, the mask text features obtained after mask processing are coded through a text coding module, the original participle text features corresponding to the mask text features are recognized after the text coding is finished, and the training of the feature recognition capability of the model can be realized by continuously converging the errors of the mask text features and the original participle text features through multiple times of text coding.

It is easy to understand that, because part of the participle text features are changed into mask text features through the mask processing, and part of the participle text features are still the original participle text features, the obtained mask text features refer to the mask text features which are subjected to the mask processing and are masked and the participle text features which are subjected to the mask processing and are not masked, and for convenience of description, the obtained participle text features are referred to as mask text features as long as the participle text features are subjected to the mask processing, regardless of whether the participle text features are actually masked.

Step S74: and performing mask processing on the audio frame characteristics corresponding to the audio to be identified to obtain a mask audio frame characteristic sequence.

For details of step S74, please refer to the description of step S12 shown in fig. 1, which is not repeated herein.

Step S75: and coding the mask text characteristics through a text coding module to obtain the mask text coding characteristics.

The specific content of step S75 can refer to the description of step S11 shown in fig. 1 for encoding the text features, and is not described herein again.

Step S76: and coding each mask audio frame feature in the mask audio frame feature sequence by combining the mask text coding feature through an audio coding module of the pre-training model to obtain a mask audio frame coding feature sequence.

The specific content of step S76 can refer to the description of step S13 shown in fig. 1 or steps S1300 to S1310 shown in fig. 7, and is not described herein again.

It should be noted that, since the text encoding process described above results in mask text encoding features, when combining text encoding features, mask text encoding features need to be combined.

Step S77: and acquiring training recognition word segmentation text characteristics by using a text recognition module of a pre-training model according to the mask text coding characteristics.

After the mask text coding features are obtained through the coding output of the text coding module, the mask text coding features are input to the audio coding module to participate in audio coding, meanwhile, the mask text coding features are subjected to feature recognition and restoration on the mask text features before coding through a text recognition module of a pre-training model, original word segmentation text features corresponding to all word segmentation mask text features in the mask text features before mask processing are predicted, and the training recognition word segmentation text features are obtained. As can be seen from the description of step S53, since only a part of each participle text feature is selected for mask processing before text encoding, the text encoding module can capture information in the original participle text feature corresponding to each mask text feature through the non-masked participle text feature near each mask text feature during encoding, and the text recognition module can try to restore the original participle text feature through the information, that is, the training recognized participle text feature is obtained.

It is easy to understand that the step of training and recognizing the word segmentation text features only needs to obtain the mask text coding features and the mask audio frame coding feature sequence without clear context.

Step S78: and acquiring the training audio frame characteristics by using an audio recognition module of the pre-training model according to the mask audio frame coding characteristic sequence.

After the mask audio frame coding feature sequence is obtained through the coding output of the audio coding module, the coding features of each mask audio frame in the mask audio frame coding feature sequence are identified and restored through the audio identification module of the pre-training model, the original audio frame features of each mask audio frame feature before the mask processing is carried out are predicted, and the training audio frame features are obtained. As can be seen from the descriptions of steps S54 and S12, since only a portion of each audio frame feature is selected for masking before audio encoding, the audio encoding module can capture information in the original audio frame feature corresponding to the masked audio frame feature through the audio frame feature that is not masked near each masked audio frame feature during encoding, and the audio recognition module can try to restore the original audio frame feature through the information to obtain the training audio frame feature.

Step S79: and recognizing the word segmentation text characteristics and the word segmentation text characteristics according to the training corresponding to each other to obtain text loss.

It can be known from step S73 that, because the mask processing is performed on the inputted participle text features before the text encoding, the mask text encoding features obtained by the text encoding can be identified by the model in addition to the encoding by combining the audio encoding module with each mask audio frame feature, and try to restore the original participle text features corresponding to the mask text features before the encoding, so as to obtain the restored result, i.e., the training recognized participle text features, and comprehensively obtain the loss between each training recognized participle text feature and the corresponding participle text feature, so as to obtain the text loss.

And step S710, obtaining audio loss according to the training audio frame characteristics and the audio frame characteristics which correspond to each other.

The specific content of step S710 may refer to the description of step S14 in fig. 1 regarding the obtaining of the audio loss, and is not described herein again.

Step S711: and judging whether the text loss meets a text loss threshold value and the audio loss meets an audio loss threshold value, executing the step S713 when the text loss meets the text loss threshold value and the audio loss also meets the audio loss threshold value, and otherwise executing the step S712.

Referring to the description of step S79, since the current text loss and audio loss respectively represent the recognition capabilities of the current model for the segmented text features and the audio frame features, only when the text loss satisfies the text loss threshold and the audio loss also satisfies the audio loss threshold, the encoding information extraction capabilities of the model for the segmented text features and the audio frame features and the recognition capabilities reach the expectation of training, and even if only the text loss or the audio loss satisfies the corresponding loss threshold, the model needs to continue the parameter adjustment and the training of the model after the parameter adjustment.

Step S712: and adjusting parameters of the text coding module and the text recognition module according to the text loss, and adjusting parameters of the audio coding module, the text coding module and the audio recognition module of the pre-training model according to the audio loss.

As described above, when the text loss and the audio loss cannot respectively satisfy the text loss threshold and the audio loss threshold at the same time, the parameters of the text encoding module and the text recognition module, and the parameters of the audio encoding module and the audio recognition module are adjusted according to the text loss and the audio loss.

Step S713: and obtaining the trained pre-training model.

Therefore, the training method of the pre-training model disclosed in the embodiment of the disclosure not only performs mask processing on the audio frame features, but also performs mask processing and reduction recognition processing on the participle text features, and adjusts parameters of the text coding module by using text loss, so that the text information acquisition capability of the text coding module and the training efficiency can be improved, and the coding accuracy of the text coding module can be improved, thereby improving the accuracy of the mask text coding features, further improving the training efficiency of the audio coding module, and improving the training effect of the audio coding module.

In order to facilitate the training of models for implementing downstream tasks (such as speaker authentication and emotion classification) in the subsequent use process, an embodiment of the present disclosure further provides a method for obtaining coding features to implement the obtaining of coding features, please refer to fig. 10, where fig. 10 is a schematic flowchart of a method for obtaining coding features according to an embodiment of the present disclosure.

As shown in the figure, the method for acquiring coding characteristics provided by the embodiment of the present disclosure includes:

step S80: the method comprises the steps of obtaining each audio frame feature to be coded of audio to be coded and a text feature to be coded of a text to be coded corresponding to the audio to be coded.

In order to obtain the coding characteristics, firstly, audio to be coded is obtained, and then, the audio frame characteristics to be coded, the text to be coded and the text characteristics to be coded are obtained based on the audio to be coded.

Specifically, the specific content of obtaining the characteristics of each audio frame to be encoded of the audio to be encoded may refer to the descriptions of steps S100 to S102 shown in fig. 2, and is not described herein again.

The specific content of the text to be coded corresponding to the audio to be coded may refer to the description of steps S103 to S105 shown in fig. 2, and is not described herein again.

Step S81: and coding the text features to be coded by the text coding module trained by the pre-training model training method to obtain the coded text features.

The text coding module trained by the pre-training model training method is used for coding the text features to be coded to obtain the specific content of the coded text features, which may refer to the description of step S11 shown in fig. 1 and is not described herein again.

It should be noted here that, the method for encoding the text features to be encoded by the trained text encoding module is the same as the method for encoding the text features by the text encoding module of the pre-training model in step S11, except that the text encoding module is obtained by parameter adjustment, and the accuracy of the obtained encoded text features is higher, so that the training efficiency of the downstream module can be improved.

Step S82: and coding the characteristics of the audio frame to be coded by the audio coding module trained by the pre-training model training method in combination with the characteristics of the coded text to obtain an audio frame coding characteristic sequence.

The specific content of the audio frame coding feature sequence obtained by coding the features of the audio frame to be coded by the audio coding module trained by the pre-training model training method in combination with the coded text features obtained in step S81 may refer to the description of step S13 shown in fig. 1, and is not described herein again.

It should be noted that, although the method for encoding the audio frame features to be encoded by using the trained audio encoding module in combination with the encoded text features is the same as the method for encoding each mask audio frame feature in the mask audio frame feature sequence by using the audio encoding module of the pre-training model and combining the text encoding features described in step S13, the accuracy of the obtained audio frame encoding feature sequence is higher due to the adjustment of the audio encoding module during training, so that the training efficiency of the downstream module can be improved.

It can be seen that the coding feature obtaining method provided by the embodiment of the present disclosure can obtain the coding text feature and the audio frame coding feature sequence more accurately through the trained text coding module and audio coding module, thereby reducing the difficulty of downstream model training, reducing the required labeled data amount, reducing the training cost, and improving the efficiency of downstream model training.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a training apparatus for pre-training a model according to an embodiment of the present disclosure.

As shown in the drawings, the training apparatus for pre-training a model provided by the embodiment of the present disclosure includes:

the audio frame feature and text feature acquiring unit 90 is adapted to acquire each audio frame feature of the audio to be recognized and a text feature of a text corresponding to the audio to be recognized.

The text coding feature obtaining unit 91 is adapted to code the text features through a text coding module of the pre-training model to obtain text coding features.

The mask audio frame feature sequence obtaining unit 92 is adapted to randomly select an audio frame feature with a first preset proportion from the audio frame features to perform a mask processing, so as to obtain a mask audio frame feature sequence.

The audio coding feature obtaining unit 93 is adapted to code, through the audio coding module of the pre-training model, each mask audio frame feature in the mask audio frame feature sequence in combination with the text coding feature, so as to obtain a mask audio frame coding feature sequence.

The parameter adjusting unit 94 is adapted to obtain each training audio frame feature according to each mask audio frame coding feature in the mask audio frame coding feature sequence, and adjust the parameter of the pre-training model according to the audio loss obtained by each corresponding training audio frame feature and audio frame feature until the audio loss meets the loss audio threshold, so as to obtain the trained pre-training model.

Optionally, the audio frame feature and text feature obtaining unit 90, adapted to obtain each audio frame feature of the audio to be recognized and the text feature of the text corresponding to the audio to be recognized, may include:

acquiring a voice audio in an original audio to obtain the audio to be identified;

sequentially acquiring each audio frame of the audio to be identified according to a preset frame length and a preset sliding step length, wherein the preset frame length is greater than the preset sliding step length;

extracting the characteristics of each audio frame to obtain the characteristics of the audio frames;

obtaining the voice audio in the original audio, where obtaining the audio to be recognized may include:

identifying and marking voice audio in the original audio;

Optionally, the text coding feature obtaining unit 91 is adapted to code the text features through the text coding module of the pre-training model, and obtaining the text coding features may include:

performing word segmentation on the text corresponding to the audio to be recognized to obtain each word segmentation text, and obtaining the word segmentation text characteristics of each word segmentation text;

randomly selecting word segmentation text features of a second preset proportion from the word segmentation text features to perform mask processing to obtain mask text features, and coding the mask text features through the text coding module to obtain mask text coding features;

the randomly selecting word segmentation text features of a second preset proportion from the word segmentation text features to perform mask processing to obtain mask text features, and the obtaining mask text coding features by coding the mask text features through the text coding module may include:

and carrying out mask processing on the word segmentation text features with the second preset proportion in the randomly selected word segmentation text features by using mask text features or any text features to obtain mask text features.

Optionally, the mask audio frame feature sequence obtaining unit 92 is adapted to randomly select an audio frame feature with a first preset proportion from among the audio frame features to perform a masking process, and obtaining a mask audio frame feature sequence may include:

and masking the audio frame characteristics with a first preset proportion in the randomly selected audio frame characteristics by using the masked audio frame characteristics or any audio frame characteristics to obtain a masked audio frame characteristic sequence.

Optionally, the audio coding feature obtaining unit 93 is adapted to code, by using the audio coding module of the pre-training model and combining the text coding feature, each mask audio frame feature in the mask audio frame feature sequence to obtain a mask audio frame coding feature sequence, and may include:

coding each mask audio frame feature in the mask audio frame feature sequence by the first coding layer of the audio coding module in combination with the text coding feature to obtain a first mask audio frame coding feature sequence;

coding each first mask audio frame coding feature in the first mask audio frame coding feature sequence by the second coding layer of the audio coding module in combination with the text coding feature to obtain a second mask audio frame coding feature sequence;

acquiring the mask audio frame coding feature sequence according to the second mask audio;

wherein, by the first coding layer of the audio coding module, in combination with the text coding features, coding each mask audio frame feature in the mask audio frame feature sequence to obtain a first mask audio frame coding feature sequence may include:

coding each mask audio frame feature in the mask audio frame feature sequence by a feature fusion sublayer of the first coding layer of the audio coding module in combination with the text coding feature to obtain a first mask audio frame coding feature sequence;

wherein, by the second coding layer of the audio coding module, in combination with the text coding features, coding each first mask audio frame coding feature in the first mask audio frame coding feature sequence to obtain a second mask audio frame coding feature sequence may include:

and combining the text coding features and the coding features of each first mask audio frame in the first mask audio frame coding feature sequence through the feature fusion sublayer of the second coding layer, and coding to obtain the second mask audio frame coding feature sequence.

Optionally, the parameter adjusting unit 94 is adapted to obtain each training audio frame feature according to each mask audio frame coding feature in the mask audio frame coding feature sequence, and adjust parameters of the pre-training model according to the audio loss obtained by each corresponding training audio frame feature and audio frame feature until the audio loss meets the loss audio threshold, where the obtaining of the trained pre-training model may include:

acquiring training recognition word segmentation text characteristics by using a text recognition module of the pre-training model according to the mask text coding characteristics, and acquiring text loss according to the training recognition word segmentation text characteristics and the word segmentation text characteristics which correspond to each other;

and adjusting parameters of the text coding module and the text recognition module according to the text loss, and adjusting parameters of the audio coding module, the text coding module and the audio recognition module of the pre-training model according to the audio loss until the text loss meets a text loss threshold and the audio loss meets an audio loss threshold, so as to obtain the trained pre-training model.

Therefore, when the training device of the pre-training model provided by the embodiment of the disclosure is used for training the pre-training model to be trained, the text coding module of the pre-training model is used for coding the text features to obtain the text coding features, and the audio coding module of the pre-training model is used for coding each mask audio frame feature in the mask audio frame feature sequence by combining the text coding features, so that the audio frame features and the text features can be fully fused during coding, the model can be more accurately extracted by training to obtain the audio frame coding features and the text features, and the accuracy of model training is improved; in addition, when the pre-training model is trained, the audio frame features with the preset proportion in the audio frame features are randomly selected to carry out mask processing, and then the training is realized in a reduction mode without indexing training data, so that the training cost of the pre-training model can be reduced; on the other hand, the pre-training model training device provided by the embodiment of the disclosure has higher accuracy on the audio coding module and the text coding module which are obtained by training with the training model, so that when the audio coding module is used for audio coding and the text coding module is used for text coding, accurate audio coding characteristics and text coding characteristics can be obtained, the training difficulty of each model (such as a speaker identity authentication model and a speaker emotion recognition model) which needs to be further processed based on the audio coding characteristics and the text coding characteristics is reduced, better training effect can be achieved by using less labeled data, and the model training cost for further processing can be reduced; meanwhile, the audio coding module and the text coding module can be applied to models of different application scenes, so that the audio coding module and the text coding module have better mobility and expandability.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an encoding characteristic obtaining apparatus according to an embodiment of the present disclosure.

As shown in the figure, the encoding characteristic obtaining apparatus provided by the embodiment of the present disclosure includes:

the to-be-encoded feature obtaining unit 100 is adapted to obtain each to-be-encoded audio frame feature of the to-be-encoded audio and a to-be-encoded text feature of a to-be-encoded text corresponding to the to-be-encoded audio.

The text encoding unit 101 is adapted to encode the text feature to be encoded to obtain an encoded text encoding feature.

And the audio encoding unit 102 is adapted to encode each audio frame feature to be encoded in combination with the encoded text encoding feature to obtain an audio frame encoding feature sequence.

In this way, according to the coding feature obtaining apparatus provided by the embodiment of the present disclosure, the text coding module in the text coding unit and the audio coding module in the audio coding unit of the apparatus are obtained by parameter adjustment in the training process, and the text coding feature and the audio frame coding feature sequence with higher accuracy can be obtained respectively, so that the difficulty of downstream model training can be reduced, the required labeled data amount is reduced, the training cost is reduced, and the efficiency of downstream model training can be improved.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 13, a block diagram of a structure of an electronic device 1100, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 13, the electronic device 1100 includes a computing unit 1101, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM 1102, and the RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in electronic device 1100 connect to I/O interface 1105, including: an input unit 1106, an output unit 1107, a storage unit 1108, and a communication unit 1109. The input unit 1106 may be any type of device capable of inputting information to the electronic device 1100, and the input unit 1106 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 1107 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 1104 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 1109 allows the electronic device 1100 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1101 performs the respective methods and processes described above. For example, in some embodiments, method S11, i.e., encoding the text features by the text encoding module of the pre-trained model, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1100 via the ROM 1102 and/or the communication unit 1109. In some embodiments, the computing unit 1101 may be configured by any other suitable means (e.g., by means of firmware) to perform the method S74, i.e., to perform a masking process on each audio frame feature corresponding to the audio to be recognized, resulting in a sequence of masked audio frame features.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Although the disclosed embodiments are disclosed above, the disclosure is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the disclosure, and it is intended that the scope of the disclosure be limited only by the claims appended hereto.

Claims

1. A training method of a pre-training model is characterized in that:

acquiring each training audio frame characteristic according to each mask audio frame coding characteristic in the mask audio frame coding characteristic sequence, and adjusting parameters of the pre-training model according to each corresponding training audio frame characteristic and audio loss obtained by the audio frame characteristic until the audio loss meets a loss audio threshold value to obtain the trained pre-training model;

the audio coding module at least comprises a first coding layer and a second coding layer;

the step of obtaining a mask audio frame coding feature sequence by coding each mask audio frame feature in the mask audio frame feature sequence through the audio coding module of the pre-training model in combination with the text coding feature includes:

coding each first mask audio frame coding feature in the first mask audio frame coding feature sequence through a second coding layer in combination with the text coding feature to obtain a second mask audio frame coding feature sequence;

and acquiring the coding feature sequence of the mask audio frame according to the coding feature sequence of the second mask audio frame.

2. The method of claim 1, wherein the first coding layer and the second coding layer each comprise a feature fusion sublayer;

the step of obtaining a first mask audio frame coding feature sequence by coding each mask audio frame feature in the mask audio frame feature sequence through the first coding layer of the audio coding module in combination with the text coding feature includes:

combining the text coding features and all mask audio frame features in the mask audio frame feature sequence through the feature fusion sublayer of the first coding layer, and coding to obtain a first mask audio frame coding feature sequence;

the step of obtaining the second mask audio frame coding feature sequence by coding each first mask audio frame coding feature in the first mask audio frame coding feature sequence through the second coding layer in combination with the text coding feature includes:

3. The method for training a pre-training model according to claim 1, wherein the step of obtaining the features of each audio frame of the audio to be recognized comprises:

and extracting the characteristics of each audio frame to obtain the characteristics of the audio frames.

4. The method for training a pre-training model according to claim 3, wherein the step of obtaining the speech audio in the original audio to obtain the audio to be recognized comprises:

identifying and marking voice audio in the original audio;

5. The method for training a pre-training model according to claim 1, wherein the step of randomly selecting the audio frame features with the first preset proportion from among the audio frame features to perform masking processing to obtain a masked audio frame feature sequence comprises:

6. The method of training a pre-trained model according to any one of claims 1-5,

the step of obtaining the text features of the text corresponding to the audio to be recognized comprises the following steps: performing word segmentation on the text corresponding to the audio to be recognized to obtain each word segmentation text, and obtaining the word segmentation text characteristics of each word segmentation text;

the step of coding the text features through the text coding module of the pre-training model to obtain the text coding features comprises:

the step of coding each mask audio frame feature in the mask audio frame feature sequence by the audio coding module of the pre-training model in combination with the text coding feature to obtain a mask audio frame coding feature sequence includes:

coding each mask audio frame feature in the mask audio frame feature sequence by combining the mask text coding feature through an audio coding module of the pre-training model to obtain a mask audio frame coding feature sequence;

the training method further comprises the following steps:

the step of adjusting the parameters of the pre-training model according to the audio loss obtained by the training audio frame characteristics and the audio frame characteristics corresponding to each other until the audio loss meets an audio loss threshold value to obtain the trained pre-training model comprises:

7. The method for training a pre-training model according to claim 6, wherein the step of randomly selecting a second preset proportion of the participle text features in each of the participle text features to perform mask processing to obtain mask text features, and the step of encoding the mask text features by the text encoding module to obtain mask text encoding features comprises:

8. A method for acquiring coding characteristics, comprising:

the text coding module obtained by training with the training method of the pre-training model according to any one of claims 1 to 7 is used for coding the text features to be coded to obtain coded text coding features;

the audio coding module obtained by training with the training method of the pre-training model according to any one of claims 1 to 7 encodes each of the audio frame features to be encoded in combination with the encoded text encoding features to obtain an audio frame encoding feature sequence.

9. A training apparatus for pre-training a model, comprising:

a parameter adjusting unit, configured to obtain each training audio frame feature according to each mask audio frame coding feature in the mask audio frame coding feature sequence, and adjust a parameter of the pre-training model according to each corresponding training audio frame feature and an audio loss obtained by the audio frame feature until the audio loss meets a loss audio threshold, so as to obtain the trained pre-training model;

the audio coding feature obtaining unit is configured to encode, by using the audio coding module of the pre-training model and in combination with the text coding feature, each mask audio frame feature in the mask audio frame feature sequence to obtain a mask audio frame coding feature sequence, where the obtaining of the mask audio frame coding feature sequence includes:

10. An encoding characteristic acquisition apparatus, comprising:

a text coding unit, configured to code the text features to be coded by using the text coding module obtained by training through the training method of the pre-training model according to any one of claims 1 to 7, so as to obtain coded text coding features;

an audio encoding unit, configured to encode, by using the audio encoding module obtained through training by using the pre-training model according to any one of claims 1 to 7, each of the audio frame features to be encoded in combination with the encoded text encoding feature, so as to obtain an audio frame encoding feature sequence.

11. A computer-readable storage medium having stored thereon computer instructions, wherein the computer instructions are executable to perform a method of training a pre-trained model according to any one of claims 1 to 7 or a method of obtaining coding features according to claim 8.

12. A terminal comprising a memory and a processor, the memory having stored thereon computer instructions capable of being executed on a computer, wherein the processor executes the computer instructions to perform a method of training a pre-trained model according to any one of claims 1 to 7 or a method of obtaining coding features according to claim 8.