CN114999532A

CN114999532A - Model obtaining method, device, system, electronic equipment and storage medium

Info

Publication number: CN114999532A
Application number: CN202210423425.8A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-09-02

Abstract

The disclosure provides a model acquisition method, a model acquisition device, a model acquisition system, electronic equipment and a storage medium, and relates to the field of artificial intelligence such as deep learning and natural language processing, wherein the method comprises the following steps: the method comprises the steps of obtaining a global model obtained by a cloud end newly, and sending the global model to at least two equipment ends when the cloud end determines that the global model does not meet a preset ending condition; generating a pseudo label for the unmarked data by using the global model to obtain first class training data with the pseudo label; training the global model by utilizing the first class of training data and the second class of training data with the artificial labeling labels to obtain an updated model; and returning the updating model to the cloud for the cloud to update the global model by combining the obtained updating models. By applying the scheme disclosed by the disclosure, the model precision and the like can be improved.

Description

Model obtaining method, device, system, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a system, an electronic device, and a storage medium for model acquisition in the fields of deep learning and natural language processing.

Background

Speech emotion recognition has wide application in psychological assessment, robotic assistants, and mobile services. In practical applications, speech emotion recognition is usually performed by means of a speech emotion recognition model, and the accuracy of the model directly influences the accuracy of a recognition result.

Disclosure of Invention

The disclosure provides a model acquisition method, a model acquisition device, a model acquisition system, an electronic device and a storage medium.

A model acquisition method, comprising:

the method comprises the steps of obtaining a global model obtained by a cloud end latest, and sending the global model to at least two equipment ends when the cloud end is determined not to meet a preset end condition;

generating a pseudo label for the data which are not marked by the global model to obtain first class training data with the pseudo label;

training the global model by using the first type of training data and second type of training data with artificial labeling labels to obtain an updated model;

and returning the updating model to the cloud end, and using the cloud end to update the global model by combining the obtained updating models.

A model acquisition method, comprising:

obtaining a global model obtained by pre-training, and executing the following first processing:

sending the global model to at least two equipment ends, and obtaining an updated model returned by the equipment ends, wherein the updated model is obtained after the equipment ends train the global model by using first-class training data and second-class training data, the first-class training data are training data with pseudo labels, the pseudo labels are labels generated for unlabeled data by using the global model, and the second-class training data are training data with artificially labeled labels;

updating the global model by combining the obtained updating models;

in response to determining that a predetermined termination condition is met, the latest global model is taken as a finally required model, otherwise, the first process is repeatedly executed based on the latest global model.

A model acquisition apparatus comprising: the device comprises a first acquisition module, a generation module, a training module and a sending module;

the first acquisition module is used for acquiring a global model which is obtained latest by a cloud, and the global model is sent to at least two equipment ends when the cloud determines that a preset ending condition is not met;

the generation module is used for generating a pseudo label for the unmarked data by using the global model to obtain first-class training data with the pseudo label;

the training module is used for training the global model by utilizing the first type of training data and the second type of training data with artificial labeling labels to obtain an updated model;

and the sending module is used for returning the update model to the cloud end and updating the global model by combining the obtained update models with the cloud end.

A model acquisition apparatus comprising: a second obtaining module and an updating module;

the second obtaining module is used for obtaining a global model obtained by pre-training;

the update module is configured to perform the following first processing: sending the global model to at least two equipment ends, and obtaining an updated model returned by the equipment ends, wherein the updated model is obtained after the equipment ends train the global model by using first-class training data and second-class training data, the first-class training data are training data with pseudo labels, the pseudo labels are labels generated for unlabeled data by using the global model, and the second-class training data are training data with artificially labeled labels; updating the global model by combining the obtained updating models; and in response to determining that a predetermined termination condition is met, taking the global model obtained latest as a finally required model, otherwise, repeatedly executing the first processing based on the global model obtained latest.

A model acquisition system comprising two apparatus as described above.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described above.

A computer program product comprising computer programs/instructions which, when executed by a processor, implement a method as described above.

One embodiment in the above disclosure has the following advantages or benefits: the unlabelled data of each equipment end can be labeled by utilizing the capacity of the global model, so that training data is expanded, the global model can be updated by combining the model updating results of a plurality of equipment ends, the precision of the model is further improved, and correspondingly, the speech emotion recognition model is adopted for speech emotion recognition by taking the model as the speech emotion recognition model, so that the accuracy of the recognition result can be improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a first embodiment of a model acquisition method according to the present disclosure;

FIG. 2 is a schematic structural diagram of a global model according to the present disclosure;

FIG. 3 is a schematic diagram of the processing in each block according to the present disclosure;

FIG. 4 is a flow chart of a second embodiment of a model acquisition method according to the present disclosure;

fig. 5 is a schematic structural diagram illustrating a first embodiment 500 of a model obtaining apparatus according to the present disclosure;

FIG. 6 is a schematic diagram illustrating a second embodiment 600 of a model obtaining apparatus according to the present disclosure;

FIG. 7 is a schematic diagram of a component structure of an embodiment 700 of a model acquisition system according to the present disclosure;

FIG. 8 illustrates a schematic block diagram of an electronic device 800 that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 is a flowchart of a first embodiment of a model acquisition method according to the present disclosure. As shown in fig. 1, the following detailed implementation is included.

In step 101, a global model obtained by the cloud end most recently is obtained, and the global model is sent to at least two device ends when the cloud end determines that the global model does not meet a predetermined end condition.

In step 102, a global model is used to generate a pseudo label for the unlabeled data, and a first class of training data with the pseudo label is obtained.

In step 103, the global model is trained by using the first type of training data and the second type of training data with the artificial labeling labels, so as to obtain an updated model.

In step 104, the update model is returned to the cloud for the cloud to update the global model in combination with the obtained update models.

In order to improve the accuracy of the model, the key point is that enough training data is needed, but in practical application, due to reasons such as user privacy, business confidentiality, law and regulation supervision and the like, the data island problem generally exists, and a large amount of training data cannot be integrated together to train a model with a good effect.

By adopting the scheme of the method embodiment, the unmarked data of each equipment end can be marked by utilizing the capability of the global model, so that the training data is expanded, the global model can be updated by combining the model updating results of a plurality of equipment ends, namely, a joint training mode can be adopted, so that the precision of the model is improved, and correspondingly, the model is taken as a speech emotion recognition model for speech emotion recognition, so that the accuracy of the recognition result can be improved.

The initial global model may be pre-trained by the cloud. Taking a speech emotion recognition scene as an example, the speech emotion recognition scene may be configured to frame speech data, for example, each frame is 40ms long, the frame shift is 20ms, and feature extraction may be performed on each frame of speech data, where the feature may be a Mel Frequency Cepstrum Coefficient (MFCC), a Perceptual Linear prediction Coefficient (PLP), a Filter Bank Coefficient (FBank), or the like, the extracted speech feature data and a corresponding tag may be used as training data, and a global model is obtained through training, for example, forward calculation of the model may be performed, loss is obtained according to Cross Entropy (CE, Cross Entropy), a gradient is reversely relayed according to a random gradient descent criterion, a model parameter is updated, and multiple iterations are performed until convergence, so that a global model with a certain speech emotion recognition capability, that is, a basic model, is obtained.

The cloud may send the global model obtained by the pre-training to at least two device sides, and accordingly, the execution subject of the embodiment of the method shown in fig. 1 may be the device side.

After the device side obtains the global model from the cloud, the device side can generate a pseudo label for the data which are not marked by the global model, and therefore the first type of training data with the pseudo label is obtained. The label is not generated by a manual marking mode, so that the label is called a pseudo label.

In an embodiment of the present disclosure, for any unmarked data, M pieces of enhancement data may be respectively obtained, where M is a positive integer greater than one, each piece of enhancement data may be respectively data obtained by performing random noise superposition on the unmarked data, and then M pieces of enhancement data may be respectively used as an input of a global model to obtain M output results, so as to determine a pseudo tag of the unmarked data by combining the M output results.

Through the processing, the automatic labeling of the unlabeled data can be realized by means of the global model, so that the training data is expanded, and moreover, the robustness of the model and the adaptability to the complex environment can be improved by performing random noise superposition on the unlabeled data.

The random noise superposition, i.e. the random feature enhancement, can be realized by introducing coefficient factors, as shown in detail below:

φ(x)＝x⊙α+r； (1)

where φ (x) represents enhancement data, x represents unlabeled data, and α represents a Gaussian distribution subject to a mean of 1 and a variance of σ 1, i.e.

r denotes a Gaussian distribution obeying a mean of 0 and a variance of 2 (usually taken to be 0.1), i.e.

The strength of the enhancement can be adjusted by adjusting the magnitude of σ 1 (e.g., weak: 0.1, strong: 0.25).

The specific value of M may be determined according to actual needs, and may be 10, for example.

Correspondingly, random noise superposition can be performed on any unmarked data for 10 times respectively, so that 10 pieces of enhanced data can be obtained, coefficients (alpha, r) during superposition can be generated randomly each time, then, the 10 pieces of enhanced data can be input into the global model respectively, so that 10 output results are obtained respectively, and then the pseudo label of the unmarked data can be determined by combining the 10 output results.

In one embodiment of the present disclosure, the outputting the result may include: correspondingly, when the pseudo label of the data which is not marked is determined by combining M output results, the mean value of the M vectors can be calculated to obtain a mean value vector, and the label corresponding to the element with the largest value in the mean value vector can be used as the pseudo label of the data which is not marked.

Taking a speech emotion recognition scenario as an example, assuming that 4 emotion recognition results are included, which are angry, happy, neutral, and sad, respectively, a value of N may be 4, and each output result may be a vector composed of 4 elements, where the 1 st element may represent a probability value belonging to a label of "angry" (emotion category), the 2 nd element may represent a probability value belonging to a label of "happy", the 3 rd element may represent a probability value belonging to a label of "neutral", the 4 th element may represent a probability value belonging to a label of "sick", and for 10 obtained vectors, a mean value may be calculated to obtain a mean value vector, and a label (e.g., "happy") corresponding to an element with the largest value in the mean value vector may be used as a pseudo label of unlabeled data.

Through the processing, the pseudo label of the unmarked data can be finally determined by combining a plurality of pieces of enhanced data corresponding to the same unmarked data, so that the accuracy of the determination result is improved.

In an embodiment of the present disclosure, in response to determining that the element with the largest value in the mean vector is greater than the first threshold, the tag corresponding to the element with the largest value in the mean vector may be further used as a pseudo tag of the unlabeled data.

That is, for the element with the largest value in the mean vector, the element with the largest value in the mean vector may be compared with the first threshold, and if the element with the largest value in the mean vector is greater than the first threshold, the label corresponding to the element with the largest value in the mean vector may be used as the pseudo label of the unlabeled data.

The specific value of the first threshold can be determined according to actual needs and can be adjusted at any time, for example, the initial training period can be 0.5, and the middle and later training periods can be increased to 0.9.

By setting the first threshold value and comparing, the accuracy of the generated pseudo label can be further improved.

In one embodiment of the present disclosure, the variance of the M vectors may also be calculated, and in response to determining that the variance is less than a second threshold, the unlabeled data and corresponding pseudo-labels may be retained, i.e., the pseudo-labels are determined to be available, otherwise, the unlabeled data and corresponding pseudo-labels may be discarded.

The specific value of the second threshold may also be determined according to actual needs, for example, may be 0.005.

Through the processing, more accurate pseudo labels can be selected by fully utilizing the uncertainty of network prediction, and the quality of the pseudo labels is further improved.

After the first type of training data is obtained, the obtained global model can be trained by using the first type of training data and the second type of training data with the artificial labeling labels, and an updated model is obtained.

In an embodiment of the present disclosure, before training the global model, random noise superposition may be performed on the second type of training data, and by superposing random noise, robustness of the model and adaptability to a complex environment may be improved.

In one embodiment of the present disclosure, the noise superimposed for the first type of training data may be stronger than the noise superimposed for the second type of training data, and/or the number of the first type of training data may be smaller than the number of the second type of training data.

That is, for the second type of training data, weak enhancement can be performed, and for the first type of training data, strong enhancement can be performed, and as mentioned above, the strength of enhancement can be adjusted by adjusting the size of σ 1 (e.g., weak: 0.1, strong: 0.25), so as to further improve the robustness of the model, and the like. In addition, the amount of training data of the first type is typically smaller than the training data of the second type, e.g. the percentage in all training data typically does not exceed 20%, and may gradually increase up to 20%, i.e. the percentage may gradually increase up to 20% as the number of training rounds increases. The labeling result of the second type of training data is more accurate, so the occupation ratio of the second type of training data can be larger, the training effect of the model is improved, the performance of the global model is improved along with the increase of the number of training rounds, the obtained pseudo label is more and more accurate, correspondingly, the occupation ratio of the first type of training data can be properly increased, and the model can be trained by utilizing more unmarked data.

Taking a speech emotion recognition scene as an example, when any training data is input into the global model, a certain amount of context may be provided, for example, training data corresponding to 4 adjacent frames (2 frames before and after) of speech data is input together, so as to improve accuracy of a recognition result.

In an embodiment of the disclosure, when the device side trains the global model by using the first type of training data and the second type of training data, the device side may further update the model parameters of the global model according to the global control variables obtained from the cloud.

In one embodiment of the present disclosure, the global control variable may include: the model parameters respectively correspond to global control variables; for any model parameter, the global control variable corresponding to the model parameter may be an average value of update amounts of the model parameter in each update model obtained at the latest time by the cloud, and accordingly, updating the model parameter of the global model according to the global control variable obtained from the cloud may include: for any model parameter, the difference between the update quantity of the model parameter in the updated model generated (at the equipment end) at the latest time and the global control variable corresponding to the model parameter is respectively obtained, and the model parameter is updated according to the difference.

In the method disclosed by the disclosure, a global control variable can be used for guiding/guiding the updating of the model parameters of the equipment end, namely, the global control variable is introduced to effectively guide the training direction of the equipment end, and the knowledge of the global model is fully utilized to limit the updating of the model parameters of a local model (namely, the model of the equipment end) so as to prevent the local model with a larger difference from the global model from deviating from the training direction of the whole system, improve the training effect and the like.

For example, assuming that 10 device ends are all involved in the joint training, in each round of training, the cloud may randomly select some or all device ends from the cloud, and may send the global model to the selected device ends, and simultaneously may send global control variables corresponding to the model parameters to the device ends, where for any model parameter, the global control variable corresponding to the model parameter may be a mean value of update amounts of the model parameter in each update model (i.e., an update model returned by each device end selected last time) obtained by the cloud most recently, that is, a mean value of update amounts in each update model, and the update amount refers to a change amount compared to a value before update.

For the device side, when the global model is updated, for any model parameter, the difference between the update amount of the model parameter in the updated model generated at the latest time and the global control variable corresponding to the model parameter may be obtained, and the model parameter may be updated according to the difference, that is, the difference may be introduced into the update of the model parameter to prevent the update direction from deviating from the global optimum point, for example, the original parameter update amplitude may be increased or decreased based on the difference.

The device side can return the generated updating model to the cloud side, and accordingly the cloud side can update the global model by combining the obtained updating models, namely, the global model can be updated according to the updated model parameters of the device sides through comprehensive equalization, and the global control variables corresponding to the model parameters can be updated. Further, the cloud end can determine whether the ending condition is met or not, if yes, the processing can be ended, otherwise, the newly obtained global model can be sent to the at least two equipment ends, and the processing is repeated.

The cloud integration is carried out to a plurality of equipment end models promptly, updates jointly, promotes the high in the clouds model ability, and the redistribution is to each equipment end for each equipment end can enjoy the performance promotion that the data of other equipment ends brought.

In practical application, for the device side, if it is determined that the global model of the device side is converged, the device side can actively exit, and for the exiting device side, the subsequent cloud side does not send the global model to the exiting device side. For example, the device side may use 80% of all training data for each training and the remaining 20% for verifying the model effect, i.e. verifying whether the trained model converges. In addition, the unlabeled data can be reused, that is, the unlabeled data which is labeled before can be labeled again after the equipment side acquires a new global model each time.

In one embodiment of the present disclosure, the global model may include: p blocks (blocks), an attention module and a regression output module, wherein P is a positive integer; each block is used for performing convolution operation of a time domain and a frequency domain respectively, and outputting two operation results after splicing, the attention module is used for processing the output result of the last block based on a time-frequency domain attention mechanism and/or a channel attention mechanism, and the regression output module is used for generating the output result of the global model based on the output result of the attention module and a preset constant factor.

The specific value of P may be determined according to actual needs, and may be, for example, 4.

Fig. 2 is a schematic structural diagram of the global model according to the present disclosure. As shown in fig. 2, the first half of the block may be composed of 4 blocks, each of the blocks may perform two independent convolution operations, one is a convolution operation in a time domain Dimension (Temporal Dimension), the other is a convolution operation in a frequency domain Dimension (Spectral Dimension), and the two results may be output after being spliced, so that fine-grained characteristics may be captured from each Dimension, and high-level characteristics may be learned from their shared output.

Fig. 3 is a schematic diagram of the processing manner in each block according to the present disclosure. As shown in fig. 3, the processing of the time domain dimension and the frequency domain dimension may include: convolution (Conv), group normalization and rectification Linear Units (ReLU), and the output results of the time domain dimension and the frequency domain dimension may be output after the following processes are sequentially performed: convolution, group normalization, rectifying linear units, max pooling (MaxPooling), and space-based dropping.

In addition, preferably, the attention module shown in fig. 2 may employ a time-frequency domain attention mechanism and a channel attention mechanism at the same time, taking a speech emotion recognition scenario as an example, the time-frequency domain attention mechanism may capture prosodic features and spectral features such as rhythm, pitch, intonation, formants, harmony, and the like, and the channel attention mechanism may discover the mutual influence between different convolution channels, because emotion characteristics are sparsely distributed in speech and need to capture a sketch from multiple angles to capture the sketch accurately.

When calculating the regression (softmax) output, the regression output module shown in fig. 2 may combine a constant factor T (usually, value is 2) to make the output of the model more stable, i.e. e involved in the calculation can be used ^* Is modified as e ^*/T Denotes arbitrary content.

Fig. 4 is a flowchart of a second embodiment of the model obtaining method according to the present disclosure. As shown in fig. 4, the following detailed implementation is included.

In step 401, a global model obtained by pre-training is obtained.

In step 402, the following first processing is performed: sending the global model to at least two equipment ends, acquiring an updated model returned by the equipment ends, wherein the updated model is a model obtained by training the global model by the equipment ends by using first-class training data and second-class training data, the first-class training data are training data with pseudo labels, the pseudo labels are labels generated for unlabeled data by using the global model, and the second-class training data are training data with artificially labeled labels; updating the global model by combining the obtained updating models; in response to determining that a predetermined termination condition is met, the latest obtained global model is taken as a finally required model, otherwise, the first processing is repeatedly executed based on the latest obtained global model.

By adopting the scheme of the method embodiment, the unmarked data of each equipment end can be marked by utilizing the capability of the global model, so that the training data is expanded, and the global model can be updated by combining the model updating results of a plurality of equipment ends, namely, a joint training mode can be adopted, so that the precision of the model is improved.

In practical applications, the execution subject in the embodiment shown in fig. 4 may be a cloud.

In an embodiment of the disclosure, the cloud may further send a global control variable to the device side, so that the device side updates the model parameter of the global model according to the global control variable, thereby obtaining the updated model.

In one embodiment of the present disclosure, the global control variable may include: correspondingly, for any model parameter, the mean value of the updated quantity of the model parameter in each updated model obtained last time can be obtained respectively, and the mean value can be used as the global control variable corresponding to the model parameter and used by the equipment end to update the model parameter according to the global control variable corresponding to the model parameter.

The cloud end can obtain an initial global model through pre-training, and the obtained global model can be sent to at least two equipment ends because the initial global model does not meet the end condition. For example, assuming that 10 device ends are all involved in the joint training, for example, 5 device ends can be randomly selected from the device ends, and the initial global model can be respectively sent to the 5 device ends, and in addition, the global control variable corresponding to each model parameter can be sent to the 5 device ends, initially, each global control variable can be 0, and certainly, since 0 is adopted, the global control variable can not be sent, so as to save resources, each device end can generate a pseudo label for data which is not labeled by using the obtained global model, obtain first-class training data, train the global model by using the first-class training data and the second-class training data, obtain an updated model, and then return the obtained updated model to the cloud. The cloud end can update the global model by combining the obtained update models, can update the global control variables corresponding to the model parameters, and if the conditions are not met, the 6 equipment ends can be randomly selected from the 10 equipment ends participating in the joint training, the latest obtained global models are respectively sent to the 6 equipment ends, the global control variables corresponding to the updated model parameters can be sent to the 6 equipment ends, the obtained global models can be used by the equipment ends to generate pseudo labels for the data which are not marked, the first type of training data is obtained, the global model can be trained by the first type of training data and the second type of training data, the update model is obtained, the obtained global control variables can be used for guiding the update of the model parameters, and the obtained update model can be returned to the cloud end. The above process may then be repeated until the end condition is met.

For the equipment end, if the global model of the equipment end is determined to be converged, the equipment end can be actively quitted, and for the quitted equipment end, the follow-up cloud end can not send the global model to the equipment end. Correspondingly, if it is determined that all or most of the device ends participating in the joint training have reached convergence, the cloud end may consider that the termination condition is met, and may further use the newly obtained global model as the finally required model.

Taking a speech emotion recognition scene as an example, for the device side, subsequently, when speech emotion recognition is needed, a global model finally obtained by the cloud can be adopted, and if needed, a global model for self-training convergence can be adopted.

It is noted that while for simplicity of explanation, the foregoing method embodiments are described as a series of acts, those skilled in the art will appreciate that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure. In addition, for parts which are not described in detail in a certain embodiment, reference may be made to relevant descriptions in other embodiments.

In short, according to the scheme of the embodiment of the method, the accuracy of the obtained model can be improved, and the accuracy of the processing result of the processing (such as speech emotion recognition) based on the model can be improved.

The above is a description of embodiments of the method, and the embodiments of the apparatus are described below to further illustrate the aspects of the disclosure.

Fig. 5 is a schematic structural diagram illustrating a first embodiment 500 of a model obtaining apparatus according to the present disclosure. The device can be applied to the equipment side. As shown in fig. 5, includes: a first obtaining module 501, a generating module 502, a training module 503, and a sending module 504.

The first obtaining module 501 is configured to obtain a global model obtained by the cloud end most recently, where the global model is sent to at least two device ends when the cloud end determines that a predetermined end condition is not met.

A generating module 502, configured to generate a pseudo label for the unlabeled data by using the global model, so as to obtain a first class of training data with the pseudo label.

The training module 503 is configured to train the global model by using the first type of training data and the second type of training data with the artificial labeling labels, so as to obtain an updated model.

And a sending module 504, configured to return the update model to the cloud, and use the cloud to update the global model in combination with the obtained update models.

By adopting the scheme of the device embodiment, the unmarked data of each equipment end can be marked by utilizing the capability of the global model, so that the training data is expanded, and the global model can be updated by combining the model updating results of a plurality of equipment ends, namely, a joint training mode can be adopted, so that the precision of the model is improved, and the model is taken as a speech emotion recognition model as an example.

In an embodiment of the present disclosure, the generating module 502 may obtain, for any unmarked data, M pieces of enhanced data, where M is a positive integer greater than one, each piece of enhanced data is obtained by performing random noise superposition on the unmarked data, and may respectively use the M pieces of enhanced data as an input of the global model to obtain M output results, and may further determine, in combination with the M output results, a pseudo tag of the unmarked data.

Wherein the random noise superposition, i.e. the random feature enhancement, can be achieved by introducing coefficient factors.

In one embodiment of the present disclosure, the outputting the result may include: accordingly, when the generation module 502 determines the pseudo label of the data that is not labeled in combination with the M output results, the average value of the M vectors may be calculated to obtain an average value vector, and the label corresponding to the element with the largest value in the average value vector may be used as the pseudo label of the data that is not labeled.

In an embodiment of the present disclosure, the generating module 502 may further take a label corresponding to the element with the largest value in the mean vector as a pseudo label of the unlabeled data in response to determining that the element with the largest value in the mean vector is greater than the first threshold.

In one embodiment of the present disclosure, the generation module 502 may also calculate variances of the M vectors, and in response to determining that the variances are less than the second threshold, the unlabeled data and the corresponding pseudo labels may be retained, i.e., the pseudo labels are determined to be available, otherwise, the unlabeled data and the corresponding pseudo labels may be discarded.

After the first type of training data is obtained, the training module 503 may train the obtained global model by using the first type of training data and the second type of training data with the artificial labeling labels, so as to obtain an updated model.

In one embodiment of the present disclosure, the training module 503 may also perform random noise superposition on the second class of training data before training the global model.

In an embodiment of the disclosure, when the training module 503 trains the global model by using the first type of training data and the second type of training data, the model parameters of the global model may be updated according to the global control variables obtained from the cloud.

In one embodiment of the present disclosure, the global control variables may include: the model parameters respectively correspond to global control variables; for any model parameter, the global control variable corresponding to the model parameter may be an average value of update amounts of the model parameter in each update model obtained at the latest time by the cloud, and accordingly, updating the model parameter of the global model according to the global control variable obtained from the cloud may include: for any model parameter, the difference between the update quantity of the model parameter in the updated model generated (at the equipment end) at the latest time and the global control variable corresponding to the model parameter is respectively obtained, and the model parameter is updated according to the difference.

In addition, in one embodiment of the present disclosure, the global model may include: p blocks, an attention module and a regression output module, wherein P is a positive integer; each block is used for performing convolution operation of a time domain and a frequency domain respectively, and outputting two operation results after splicing, the attention module is used for processing the output result of the last block based on a time-frequency domain attention mechanism and/or a channel attention mechanism, and the regression output module is used for generating the output result of the global model based on the output result of the attention module and a preset constant factor.

Fig. 6 is a schematic structural diagram illustrating a second embodiment 600 of a model obtaining apparatus according to the present disclosure. The device can be applied to the cloud. As shown in fig. 6, includes: a second obtaining module 601 and an updating module 602.

The second obtaining module 601 is configured to obtain a global model obtained by pre-training.

An update module 602, configured to perform the following first processing: sending the global model to at least two equipment ends, acquiring an updated model returned by the equipment ends, wherein the updated model is a model obtained by training the global model by the equipment ends by using first-class training data and second-class training data, the first-class training data are training data with pseudo labels, the pseudo labels are labels generated for unlabeled data by using the global model, and the second-class training data are training data with artificially labeled labels; updating the global model by combining the obtained updating models; in response to determining that a predetermined termination condition is met, the latest obtained global model is taken as a finally required model, otherwise, the first processing is repeatedly executed based on the latest obtained global model.

In an embodiment of the present disclosure, the updating module 602 may further send a global control variable to the device side, so that the device side updates the model parameter of the global model according to the global control variable to obtain the updated model.

In one embodiment of the present disclosure, the global control variable may include: correspondingly, the updating module 602 may respectively obtain, for any model parameter, a mean value of the updated quantity of the model parameter in each updated model obtained last time, and may use the mean value as the global control variable corresponding to the model parameter, so that the device side updates the model parameter according to the global control variable corresponding to the model parameter.

Fig. 7 is a schematic diagram illustrating a structure of a model obtaining system 700 according to an embodiment of the present disclosure. As shown in fig. 7, includes: a first model acquiring means 701 and a second model acquiring means 702.

Here, the first model obtaining device 701 may be the model obtaining device shown in fig. 5, and the second model obtaining device 702 may be the model obtaining device shown in fig. 6.

The specific work flow of the above device and system embodiments may refer to the related description in the foregoing method embodiments, and is not repeated.

In a word, the scheme of the embodiment of the device and the system of the disclosure can improve the accuracy of the obtained model and the accuracy of the processing result of the processing (such as speech emotion recognition) based on the model, and in addition, the model in the scheme of the embodiment of the device and the system of the disclosure is not limited to the speech emotion recognition model, can be any other model, and has wide applicability.

The scheme disclosed by the disclosure can be applied to the field of artificial intelligence, in particular to the fields of deep learning, natural language processing and the like. Artificial intelligence is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has a hardware technology and a software technology, the artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge graph technology and the like.

The data and the like in the embodiments of the present disclosure are not specific to a particular user, and cannot reflect personal information of a particular user. In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 shows a schematic block diagram of an electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806 such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 801 performs various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in this disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by computing unit 801, may perform one or more steps of the methods described in the present disclosure. Alternatively, in other embodiments, the computing unit 801 may be configured by any other suitable means (e.g., by means of firmware) to perform the methods described by the present disclosure.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model acquisition method, comprising:

2. The method of claim 1, wherein said generating pseudo-labels for unlabeled data using the global model comprises:

respectively acquiring M pieces of enhanced data aiming at any unmarked data, wherein M is a positive integer greater than one, and each piece of enhanced data is data obtained by performing random noise superposition on the unmarked data;

respectively taking the M pieces of enhanced data as the input of the global model to obtain M output results;

and determining the pseudo label of the unmarked data by combining the M output results.

3. The method of claim 2, wherein,

the output result comprises: a vector consisting of N elements, N being a positive integer greater than one, each element representing a probability value belonging to a corresponding label, respectively;

the determining the pseudo label of the unlabeled data according to the M output results includes: and calculating the mean value of the M vectors to obtain a mean value vector, and taking the label corresponding to the element with the largest value in the mean value vector as the pseudo label of the unmarked data.

4. The method of claim 3, wherein the step of using the label corresponding to the element with the largest value in the mean vector as the pseudo label of the unlabeled data comprises:

and in response to the fact that the element with the largest value in the mean vector is larger than a first threshold value, taking a label corresponding to the element with the largest value in the mean vector as a pseudo label of the un-labeled data.

5. The method of claim 3, further comprising:

after the label corresponding to the element with the largest value in the mean vector is used as the pseudo label of the un-labeled data, calculating the variance of the M vectors;

in response to determining that the variance is less than a second threshold, retaining the unlabeled data and the pseudo-label, otherwise, discarding the unlabeled data and the pseudo-label.

6. The method of claim 2, further comprising:

and before the global model is trained, performing random noise superposition on the second class of training data.

7. The method of claim 6, wherein,

the noise superimposed for the first type of training data is stronger than the noise superimposed for the second type of training data, and/or the number of the first type of training data is smaller than the number of the second type of training data.

8. The method of any one of claims 1-7,

the training the global model comprises: and updating the model parameters of the global model according to the global control variables obtained from the cloud.

9. The method of claim 8, wherein,

the global control variables include: the model parameters respectively correspond to global control variables; for any model parameter, the global control variable corresponding to the model parameter is the mean value of the update quantity of the model parameter in each update model which is obtained by the cloud at the latest time;

the updating of the model parameters of the global model according to the global control variables obtained from the cloud comprises: and aiming at any model parameter, respectively obtaining the difference between the update quantity of the model parameter in the updated model generated at the latest time and the global control variable corresponding to the model parameter, and updating the model parameter according to the difference.

10. The method of any one of claims 1-7,

the global model comprises: p blocks, an attention module and a regression output module, wherein P is a positive integer; each block is used for performing convolution operation of a time domain and a frequency domain respectively, and outputting two operation results after splicing, the attention module is used for processing the output result of the last block based on a time-frequency domain attention mechanism and/or a channel attention mechanism, and the regression output module is used for generating the output result of the global model based on the output result of the attention module and a preset constant factor.

11. A model acquisition method, comprising:

updating the global model by combining the obtained updating models;

12. The method of claim 11, further comprising:

and sending a global control variable to the equipment end, wherein the global control variable is used for updating the model parameters of the global model by the equipment end according to the global control variable to obtain the updated model.

13. The method of claim 12, wherein,

the global control variables include: the model parameters respectively correspond to global control variables;

the method further comprises the following steps: and respectively obtaining the mean value of the update quantity of the model parameter in each update model obtained at the latest time aiming at any model parameter, taking the mean value as the global control variable corresponding to the model parameter, and using the mean value as the global control variable corresponding to the model parameter to update the model parameter by the equipment end according to the global control variable corresponding to the model parameter.

14. A model acquisition apparatus comprising: the device comprises a first acquisition module, a generation module, a training module and a sending module;

the generation module is used for generating a pseudo label for the data which are not marked by the global model to obtain first-class training data with the pseudo label;

the training module is used for training the global model by utilizing the first class of training data and the second class of training data with artificial labeling labels to obtain an updated model;

15. The apparatus of claim 14, wherein,

the generation module respectively acquires M pieces of enhanced data aiming at any unmarked data, wherein M is a positive integer larger than one, each piece of enhanced data is obtained by performing random noise superposition on the unmarked data, the M pieces of enhanced data are respectively used as the input of the global model to obtain M output results, and the pseudo label of the unmarked data is determined by combining the M output results.

16. The apparatus of claim 15, wherein,

the output result includes: a vector consisting of N elements, N being a positive integer greater than one, each element representing a probability value belonging to a corresponding label, respectively;

and the generation module calculates the average value of the M vectors to obtain an average value vector, and takes the label corresponding to the element with the largest value in the average value vector as the pseudo label of the un-labeled data.

17. The apparatus of claim 16, wherein,

and the generation module responds to the condition that the element with the largest value in the mean vector is larger than a first threshold value, and takes the label corresponding to the element with the largest value in the mean vector as a pseudo label of the unmarked data.

18. The apparatus of claim 16, wherein,

the generating module is further configured to calculate variances of the M vectors, retain the unlabeled data and the pseudo label in response to determining that the variances are less than a second threshold, and discard the unlabeled data and the pseudo label otherwise.

19. The apparatus of claim 15, wherein,

the training module is further configured to perform random noise superposition on the second type of training data before the global model is trained.

20. The apparatus of claim 19, wherein,

21. The apparatus of any one of claims 14-20,

and the training module updates the model parameters of the global model according to the global control variable obtained from the cloud.

22. The apparatus of claim 21, wherein,

the global control variables include: the model parameters respectively correspond to global control variables; for any model parameter, the global control variable corresponding to the model parameter is the average value of the update quantity of the model parameter in each update model which is obtained by the cloud at the latest time;

the training module respectively obtains the difference between the update quantity of the model parameter in the updated model generated at the latest time and the global control variable corresponding to the model parameter aiming at any model parameter, and updates the model parameter according to the difference.

23. The apparatus of any one of claims 14-20,

the global model comprises: p blocks, an attention module and a regression output module, wherein P is a positive integer; each block is respectively used for performing convolution operation of a time domain and a frequency domain, and outputting two operation results after splicing, the attention module is used for processing the output result of the last block based on a time-frequency domain attention system and/or a channel attention system, and the regression output module is used for generating the output result of the global model based on the output result of the attention module and a preset constant factor.

24. A model acquisition apparatus comprising: a second obtaining module and an updating module;

the update module is configured to perform the following first processing: sending the global model to at least two equipment ends, and obtaining an updated model returned by the equipment ends, wherein the updated model is obtained after the equipment ends train the global model by using first-class training data and second-class training data, the first-class training data are training data with pseudo labels, the pseudo labels are labels generated for unlabeled data by using the global model, and the second-class training data are training data with artificially labeled labels; updating the global model by combining the obtained updating models; in response to determining that a predetermined termination condition is met, the latest global model is taken as a finally required model, otherwise, the first process is repeatedly executed based on the latest global model.

25. The apparatus of claim 24, wherein,

the updating module is further configured to send a global control variable to the device side, and the device side is configured to update the model parameter of the global model according to the global control variable to obtain the updated model.

26. The apparatus of claim 25, wherein,

the updating module is further configured to, for any model parameter, respectively obtain a mean value of the update amount of the model parameter in each update model obtained last time, and use the mean value as a global control variable corresponding to the model parameter, so that the device side updates the model parameter according to the global control variable corresponding to the model parameter.

27. A model acquisition system, comprising: the apparatus of any one of claims 14-23, and the apparatus of any one of claims 24-26.

28. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

29. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-13.

30. A computer program product comprising a computer program/instructions which, when executed by a processor, implement the method of any one of claims 1-13.