CN112927714B

CN112927714B - Data processing method and device

Info

Publication number: CN112927714B
Application number: CN202110098039.1A
Authority: CN
Inventors: 陈颖
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2024-01-12
Anticipated expiration: 2041-01-25
Also published as: CN112927714A

Abstract

The embodiment of the application discloses a data processing method and equipment, wherein the method comprises the following steps: acquiring a first voice sample from a training voice set and acquiring a second voice sample from a target voice set; acquiring a first low-dimensional feature corresponding to a first voice sample and a second low-dimensional feature corresponding to a second voice sample by adopting a principal component analysis method; mapping the second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature, and generating a second mapping feature of the second low-dimensional feature in the first low-dimensional space; generating a first mapping feature corresponding to the first low-dimensional feature in a second low-dimensional space according to the second mapping feature and the second low-dimensional feature; and generating an emotion recognition model according to the first mapping characteristics, wherein the emotion recognition model is used for predicting the emotion type of the voice to be recognized, which is the same as the language type of the second voice sample. By adopting the method and the device, resources and development cost can be saved, and the accuracy of voice emotion classification is improved.

Description

Data processing method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to a data processing method and device.

Background

When the task of classifying the voice emotion is carried out, a large number of marked voice emotion samples are required to be prepared, the classification model is trained, and the trained model is adopted to classify the voice emotion. However, in the training process, the typical marked sample is only Chinese emotion sample, and the language types to be identified are various. The expressions of different emotions of different language types are slightly different, and a great amount of time and resources are consumed for labeling sample data for re-labeling emotion samples of other languages. In order to reasonably utilize the existing Chinese training database to realize recognition across a corpus, for example, a Chinese emotion database is used as training to recognize emotion types of English samples, the prior art adopts a denoising self-coding mode, but the self-coding involves setting of super parameters and is easy to cause network non-convergence, so that the accuracy of classifying the voice emotion is affected.

Disclosure of Invention

The embodiment of the application provides a data processing method and device, which can save resources and development cost and improve the accuracy of voice emotion classification.

In one aspect, a data processing method is provided, which may include:

Acquiring a first voice sample from a training voice set and acquiring a second voice sample from a target voice set; the first speech sample is a speech sample of a known emotion type; the language type corresponding to the first voice sample is different from the language type corresponding to the second voice sample;

acquiring a first low-dimensional feature corresponding to a first voice sample and a second low-dimensional feature corresponding to a second voice sample by adopting a principal component analysis method;

mapping the second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature, and generating a second mapping feature of the second low-dimensional feature in the first low-dimensional space;

generating a first mapping feature corresponding to the first low-dimensional feature in a second low-dimensional space according to the second mapping feature and the second low-dimensional feature;

and generating an emotion recognition model according to the first mapping characteristics, wherein the emotion recognition model is used for predicting the emotion type of the voice to be recognized, which is the same as the language type of the second voice sample.

Wherein, still include:

marking emotion types of the first voice samples of the target language types, and storing the marked first voice samples into a training voice set;

a second speech sample of the non-target language type is stored to the target speech set.

The method for obtaining the first voice sample from the training voice set and obtaining the second voice sample from the target voice set comprises the following steps:

acquiring a first voice sample from a training voice set, detecting a first language type of the first voice sample, and determining a second language type different from the first language type;

and acquiring a voice sample with a second language type from the target voice set as a second voice sample.

The method for obtaining the first low-dimensional feature corresponding to the first voice sample and the second low-dimensional feature corresponding to the second voice sample by adopting the principal component analysis method comprises the following steps:

extracting a first characteristic parameter in a first voice sample and a second characteristic parameter in a second voice sample;

performing vitamin reduction on the first characteristic parameters by adopting a principal component analysis method to obtain first low-dimensional characteristics corresponding to the first characteristic parameters and first low-dimensional spaces corresponding to the first low-dimensional characteristics;

and adopting a principal component analysis method to reduce the second characteristic parameters into second low-dimensional characteristics corresponding to the second characteristic parameters and second low-dimensional spaces corresponding to the second low-dimensional characteristics.

Generating a first mapping feature corresponding to the first low-dimensional feature in the second low-dimensional space according to the second mapping feature and the second low-dimensional feature, including:

Acquiring a conversion function between a second low-dimensional space corresponding to the second low-dimensional feature and a first low-dimensional space corresponding to the first low-dimensional feature; the transfer function is to minimize a difference between a second low-dimensional feature in the second low-dimensional space and a second mapping feature of the second low-dimensional feature in the first low-dimensional space;

and generating a first mapping feature corresponding to the first low-dimensional feature in the second low-dimensional space according to the conversion function.

The method for obtaining the conversion function between the second low-dimensional space corresponding to the second low-dimensional feature and the first low-dimensional space corresponding to the first low-dimensional feature comprises the following steps:

inputting the second mapping characteristic into the initial mapping model, and obtaining the output of the initial mapping model; the initial parameters of the initial mapping model include an initial conversion function;

obtaining a difference value between the output of the initial mapping model and the second low-dimensional feature, and adjusting an initial conversion function in the initial mapping model according to the difference value;

and when the difference value is smaller than the threshold value, determining the adjusted initial conversion function as a conversion function between a second low-dimensional space corresponding to the second low-dimensional feature and a first low-dimensional space corresponding to the first low-dimensional feature.

Wherein, still include:

obtaining a voice to be recognized, the language type of which is the same as that of the second voice sample;

Obtaining low-dimensional characteristics of the voice to be recognized by adopting a principal component analysis method;

and inputting the low-dimensional characteristics of the voice to be recognized into the emotion recognition model to obtain the emotion type of the voice to be recognized, which is output by the emotion recognition model.

In one aspect, a data processing apparatus is provided, which may include:

the sample acquisition unit is used for acquiring a first voice sample from the training voice set and acquiring a second voice sample from the target voice set; the first speech sample is a speech sample of a known emotion type; the language type corresponding to the first voice sample is different from the language type corresponding to the second voice sample;

the low-dimensional feature acquisition unit is used for acquiring a first low-dimensional feature corresponding to the first voice sample and a second low-dimensional feature corresponding to the second voice sample by adopting a principal component analysis method;

the second mapping feature acquisition unit is used for mapping the second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature and generating a second mapping feature of the second low-dimensional feature in the first low-dimensional space;

the first mapping feature acquisition unit is used for generating a first mapping feature corresponding to the first low-dimensional feature in the second low-dimensional space according to the second mapping feature and the second low-dimensional feature;

The model generation unit is used for generating an emotion recognition model according to the first mapping characteristics, and the emotion recognition model is used for predicting the emotion type of the voice to be recognized, which is the same as the language type of the second voice sample.

Wherein, still include:

the sample marking unit is used for marking the emotion type of the first voice sample of the target language type and storing the marked first voice sample into the training voice set;

The sample acquisition unit is specifically configured to:

The low-dimensional feature acquisition unit is specifically configured to:

Wherein the first mapping feature acquisition unit includes:

a conversion function obtaining subunit, configured to obtain a conversion function between a second low-dimensional space corresponding to the second low-dimensional feature and a first low-dimensional space corresponding to the first low-dimensional feature; the transfer function is to minimize a difference between a second low-dimensional feature in the second low-dimensional space and a second mapping feature of the second low-dimensional feature in the first low-dimensional space;

and the feature acquisition subunit is used for generating a first mapping feature of the first low-dimensional feature in the second low-dimensional space according to the conversion function.

The conversion function obtaining subunit is specifically configured to:

Wherein, still include:

the emotion type prediction unit is used for obtaining a voice to be recognized, the language type of which is the same as that of the second voice sample;

In one aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the above-described method steps.

In one aspect, a computer device is provided, including a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps described above.

In one aspect, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the method steps described above.

In the embodiment of the application, a first voice sample is obtained from a training voice set, and a second voice sample is obtained from a target voice set; acquiring a first low-dimensional feature corresponding to a first voice sample and a second low-dimensional feature corresponding to a second voice sample by adopting a principal component analysis method; mapping the second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature, and generating a second mapping feature of the second low-dimensional feature in the first low-dimensional space; generating a first mapping feature corresponding to the first low-dimensional feature in a second low-dimensional space according to the second mapping feature and the second low-dimensional feature; and generating an emotion recognition model according to the first mapping characteristics, and predicting the emotion type of the voice to be recognized, which is the same as the language type of the second voice sample, by adopting the emotion recognition model. When the voice emotion classification is carried out on different language types, the problem of re-labeling voice emotion samples is avoided, meanwhile, the emotion recognition model does not relate to the setting of super parameters, and the problem that the network is not converged easily caused by setting the super parameters is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a system architecture diagram for data processing according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Fig. 1 is a system architecture diagram for data processing according to an embodiment of the present invention. The server 10f establishes a connection with a cluster of user terminals through the switch 10e and the communication bus 10d, which may include: user terminal 10a, user terminal 10b, user terminal 10c. The database 10g stores therein a plurality of voice samples including a first voice sample and a second voice sample, the first voice sample being a voice sample of which emotion type is known; the language type corresponding to the first voice sample is different from the language type corresponding to the second voice sample. The server 10f extracts a first voice sample and a second voice sample from the database 10g, obtains a first low-dimensional feature corresponding to the first voice sample and a second low-dimensional feature corresponding to the second voice sample by adopting a principal component analysis method, maps the second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature, generates a second mapping feature of the second low-dimensional feature in the first low-dimensional space, generates a first mapping feature corresponding to the first low-dimensional feature in the second low-dimensional space according to the second mapping feature and the second low-dimensional feature, generates an emotion recognition model according to the first mapping feature, and can predict the emotion type of the voice to be recognized, which is the same as the language type of the second voice sample, by adopting the emotion recognition model.

The user terminal related to the embodiment of the application comprises: terminal devices such as tablet computers, smart phones, personal Computers (PCs), notebook computers, palm computers, and the like.

Referring to fig. 2, a flow chart of a data processing method is provided in an embodiment of the present application. As shown in fig. 2, the method of the embodiment of the present application may include the following steps S101 to S105.

S101, acquiring a first voice sample from a training voice set and acquiring a second voice sample from a target voice set;

specifically, the data processing device obtains a first voice sample from a training voice set, and obtains a second voice sample from a target voice set, and it can be understood that the first voice sample is a voice sample with a known emotion type, a language type corresponding to the first voice sample is different from a language type corresponding to the second voice sample, the training voice set is used for storing the voice sample with the known emotion type, that is, the first voice sample is a voice sample with the known emotion type, typically the first voice sample is a Chinese voice sample, the emotion type is an emotion state of a user corresponding to the voice sample, and the emotion type can include positive, negative and neutral, for example, the emotion type corresponding to audio data generated by speaking of the user under a happy emotion is positive; the target voice set is used for storing voice samples with unknown emotion types, namely, the emotion types of the second voice samples are unknown, the language types corresponding to the first voice samples are different from the language types corresponding to the second voice samples, for example, the language types of the first voice samples are Chinese, and the language types of the second voice samples can be English.

S102, acquiring a first low-dimensional feature corresponding to a first voice sample and a second low-dimensional feature corresponding to a second voice sample by adopting a principal component analysis method;

specifically, the data processing device extracts a first characteristic parameter in the first voice sample and a second characteristic parameter in the second voice sample, and it can be understood that the data processing device performs characteristic extraction on the first voice sample to generate the first characteristic parameter, performs characteristic extraction on the second voice sample to generate the second characteristic parameter, and the characteristic parameter extraction method is to extract a corresponding multidimensional characteristic vector from the voice sample, and the characteristic parameter extraction methods are various, including perceptual linear prediction (PerceptualLinearPredictive, PLP), linear prediction coefficients (LinearPredictionCoefficients, LPC) and Mel-frequency cepstrum coefficients (MelFrequencyCepstral Coefficient, MFCC);

performing vitamin reduction on the first characteristic parameters by adopting a principal component analysis method to obtain first low-dimensional characteristics corresponding to the first characteristic parameters and first low-dimensional spaces corresponding to the first low-dimensional characteristics; the first low-dimensional space is a dimensional space to which the first low-dimensional feature belongs, the first low-dimensional space can be represented by a set of basis vectors, the second feature parameters are reduced to second low-dimensional features corresponding to the second feature parameters by adopting a principal component analysis method, and the second low-dimensional space corresponding to the second low-dimensional features is a dimensional space to which the second low-dimensional features belong, and the second low-dimensional space can be represented by a set of basis vectors.

S103, mapping the second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature, and generating a second mapping feature of the second low-dimensional feature in the first low-dimensional space;

specifically, the data processing device maps the second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature, and generates a second mapping feature of the second low-dimensional feature in the first low-dimensional space.

S104, generating a first mapping feature corresponding to the first low-dimensional feature in a second low-dimensional space according to the second mapping feature and the second low-dimensional feature;

specifically, the data processing device generates a first mapping feature of the first low-dimensional feature corresponding to the second low-dimensional space according to the second mapping feature and the second low-dimensional feature, it can be understood that the second mapping feature and the second low-dimensional feature corresponding to the second low-dimensional space by the second feature parameter respectively, according to the relation between the second mapping feature and the second low-dimensional feature, the first mapping feature of the first low-dimensional feature corresponding to the second low-dimensional space can be obtained, specifically, the data processing device obtains a conversion function between the second low-dimensional space corresponding to the second low-dimensional feature and the first low-dimensional space corresponding to the first low-dimensional feature, the conversion function is used for minimizing the difference between the second low-dimensional feature in the second low-dimensional space and the second mapping feature of the second low-dimensional feature in the first low-dimensional space, and then generates the first mapping feature of the first low-dimensional feature corresponding to the second low-dimensional feature according to the conversion function. It should be noted that the transfer function may be a recurrent neural network.

For example, the first low-dimensional feature after the first feature parameter of the first voice sample is reduced in dimension is Xs, the first low-dimensional subspace corresponding to Xs is Ws, the second low-dimensional feature after the second feature parameter of the second voice sample is reduced in dimension is Xt, the second low-dimensional subspace corresponding to Xt is Wt, the second low-dimensional feature Xt is mapped to the first low-dimensional space, a second mapping feature Xts of the second low-dimensional feature Xt on the first low-dimensional space is obtained, the difference H between the first low-dimensional subspace Ws and the second low-dimensional subspace Wt can be obtained by calculating the second low-dimensional feature Xts and the second mapping feature Xts, and the difference H can be expressed as h= ||xt-g (Xts) | ² Wherein the function g is adjustable, and when the difference H takes the minimum value, the function g is a conversion function between the second low-dimensional space and the first low-dimensional space. The conversion function may convert the features in the first low-dimensional space into the second low-dimensional space, that is, migrate the features of the first speech sample into the second speech sample, specifically, the first low-dimensional features Xs in the first low-dimensional subspace Ws, and generate, by the conversion function g, first mapping features g (Xs) corresponding to the first low-dimensional features in the second low-dimensional space. The recognition model may be trained using the first mapped features to recognize emotion types of the second speech sample.

S105, generating an emotion recognition model according to the first mapping characteristics.

Specifically, the data processing device generates an emotion recognition model according to the first mapping feature, it can be understood that the data processing device inputs the first mapping feature into the initial emotion recognition model, obtains output of the initial emotion recognition model, trains the initial emotion recognition model according to the output of the initial emotion recognition model and the first mapping feature, and determines the initial emotion recognition model as an emotion recognition model when parameters of the initial emotion recognition model meet convergence conditions, and the judgment process of the initial emotion recognition model meeting the convergence conditions is as follows: and calculating a difference value between the output of the emotion recognition model and the first mapping feature, and when the difference value is smaller than a preset threshold value or the training frequency of the initial emotion recognition model is larger than the preset frequency threshold value, enabling the parameters of the initial emotion recognition model to meet convergence conditions. It should be noted that, the specific structure of the emotion recognition model is not limited in this application, and may be any neural network model that can be expected by those skilled in the art to be used for emotion type recognition.

After the emotion recognition model is trained, inputting a second low-dimensional feature corresponding to the second voice sample into the emotion recognition model, and predicting the emotion type of the second voice sample according to the output of the emotion recognition model, wherein the emotion recognition model can be used for predicting the emotion type of the voice to be recognized, which is the same as the language type of the second voice sample.

Referring to fig. 3, a flow chart of a data processing method is provided in an embodiment of the present application. As shown in fig. 3, the method of the embodiment of the present application may include the following step S201 to step S207.

S201, marking emotion types of first voice samples of target language types, and storing the marked first voice samples into a training voice set; a second speech sample of the non-target language type is stored to the target speech set.

Specifically, the data processing equipment marks the emotion type of the first voice sample of the target language type, and stores the marked first voice sample into a training voice set; storing a second voice sample of a non-target language type into the target voice set, wherein it can be understood that the target language type is a preset language type, the non-target language type is a language type different from the target language type, for example, the target language type can be Chinese, namely, the Chinese voice sample is marked with emotion type, and the non-target language type can be English;

the training voice set is used for storing marked voice samples, the target voice set is used for storing voice samples with different types from the target language, the voice samples in the target voice set are unmarked voice samples, and multiple types of voice samples can be stored in the target voice set.

S202, acquiring a first voice sample from a training voice set, detecting a first language type of the first voice sample, and determining a second language type different from the first language type; and acquiring a voice sample with a second language type from the target voice set as a second voice sample.

Specifically, the data processing device acquires a first voice sample from the training voice set, detects a first language type of the first voice sample, and determines a second language type different from the first language type, which can be understood as the language type of the voice sample needing emotion prediction, for example, if the language type of the first voice sample is Chinese, the first language type is Chinese, the second language type is any other language type different from Chinese, and the second language type can be English; the target speech set may store a plurality of types of speech samples, and a speech sample having a second language type is obtained from the target speech set as a second speech sample.

S203, acquiring a first low-dimensional feature corresponding to the first voice sample and a second low-dimensional feature corresponding to the second voice sample by adopting a principal component analysis method;

Specifically, the data processing device extracts a first characteristic parameter in the first voice sample and a second characteristic parameter in the second voice sample, and it can be understood that the data processing device performs characteristic extraction on the first voice sample to generate the first characteristic parameter, performs characteristic extraction on the second voice sample to generate the second characteristic parameter, and the characteristic parameter extraction method is to extract a corresponding multidimensional characteristic vector from the voice sample, and the characteristic parameter extraction methods are various and include PLP, LPC and MFCC;

performing vitamin reduction on the first characteristic parameters by adopting a principal component analysis method to obtain first low-dimensional characteristics corresponding to the first characteristic parameters and first low-dimensional spaces corresponding to the first low-dimensional characteristics; the first low-dimensional space is a dimensional space to which the first low-dimensional feature belongs, the first low-dimensional space can be represented by a set of basis vectors, the first low-dimensional feature can be represented by the basis vectors of the first low-dimensional space, the second low-dimensional feature corresponding to the second feature parameter and the second low-dimensional space corresponding to the second low-dimensional feature are reduced by adopting a principal component analysis method, the second low-dimensional space is a dimensional space to which the second low-dimensional feature belongs, the second low-dimensional space can be represented by a set of basis vectors, and the second low-dimensional feature can be represented by the basis vectors of the second low-dimensional space.

S204, mapping the second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature, and generating a second mapping feature of the second low-dimensional feature in the first low-dimensional space;

S205, obtaining a conversion function between a second low-dimensional space corresponding to the second low-dimensional feature and a first low-dimensional space corresponding to the first low-dimensional feature;

specifically, the data processing device obtains a conversion function between a second low-dimensional space corresponding to the second low-dimensional feature and a first low-dimensional space corresponding to the first low-dimensional feature, and it can be understood that the conversion function is used to minimize a difference between the second low-dimensional feature in the second low-dimensional space and a second mapping feature of the second low-dimensional feature in the first low-dimensional space, for example, a first low-dimensional feature obtained by reducing a first feature parameter of the first speech sample is Xs, a first low-dimensional subspace corresponding to Xs is Ws, a second low-dimensional feature obtained by reducing a second feature parameter of the second speech sample is Xt, a second low-dimensional subspace corresponding to Xt is Wt, the second low-dimensional feature Xt is mapped to the first low-dimensional space, and a difference H between the first low-dimensional subspace and the second low-dimensional subspace can be calculated by mapping the second low-dimensional feature to be Xs and the second low-dimensional feature is Xt, and the difference H is expressed as xg (Xts-H is expressed as xg| -H) ² Wherein the function g is adjustable, and when the difference H takes the minimum value, the function g is a conversion function between the second low-dimensional space and the first low-dimensional space.

The conversion function g can be obtained by adopting a mapping model to minimize the difference H, the mapping model can be a neural network model, the second mapping feature is input into an initial mapping model to obtain the output of the initial mapping model, the initial parameters of the initial mapping model comprise the initial conversion function, the difference between the output of the initial mapping model and the second low-dimensional feature is obtained, and the initial conversion function in the initial mapping model is adjusted according to the difference; when the difference is smaller than the threshold, the initial mapping model converges, and the adjusted initial conversion function is determined as a conversion function between a second low-dimensional space corresponding to the second low-dimensional feature and a first low-dimensional space corresponding to the first low-dimensional feature.

It should be noted that, the conversion function g may be obtained by using a mapping model to minimize the difference H, or may be obtained by using a matrix calculation, specifically, a relation matrix between the second low-dimensional feature and the second mapping feature is obtained by using a matrix calculation, for example, a relation between Xt and Xts may be represented by a matrix W, that is, a relation between the second low-dimensional feature Xt and the second mapping feature Xts may be represented as: xtsw=xt, then matrix W may be according to Xts ^-1 And Xt is calculated, the matrix W can be used as a conversion function g between the second low-dimensional space and the first low-dimensional space, and the matrix W can represent the relation between the second low-dimensional space and the first low-dimensional space so as to realize the migration of the features.

S206, generating a first mapping feature of the first low-dimensional feature in the second low-dimensional space according to the conversion function.

Specifically, the data processing device generates a first mapping feature of the first low-dimensional feature in the second low-dimensional space according to the conversion function, and it can be understood that the conversion function can convert the feature in the first low-dimensional space into the second low-dimensional space, that is, migrate the feature of the first voice sample into the second voice sample, specifically, the first low-dimensional feature Xs in the first low-dimensional subspace Ws, and generate a first mapping feature g (Xs) of the first low-dimensional feature in the second low-dimensional space according to the conversion function g. The first mapping feature may be used to train a recognition model to recognize emotion types of the second speech sample.

S207, generating an emotion recognition model according to the first mapping characteristic to obtain a voice to be recognized, the language type of which is the same as that of the second voice sample; obtaining low-dimensional characteristics of the voice to be recognized by adopting a principal component analysis method; and inputting the low-dimensional characteristics of the voice to be recognized into the emotion recognition model to obtain the emotion type of the voice to be recognized, which is output by the emotion recognition model.

Specifically, the data processing device generates an emotion recognition model according to the first mapping feature, predicts the emotion type of the second voice sample by adopting the emotion recognition model, and it can be understood that the data processing device inputs the first mapping feature into the initial emotion recognition model, acquires the output of the initial emotion recognition model, trains the initial emotion recognition model according to the output of the initial emotion recognition model and the first mapping feature, and determines the initial emotion recognition model as the emotion recognition model when the parameters of the initial emotion recognition model meet the convergence condition, and the judgment process of the initial emotion recognition model meeting the convergence condition is as follows: and calculating a difference value between the output of the emotion recognition model and the first mapping feature, and when the difference value is smaller than a preset threshold value or the training frequency of the initial emotion recognition model is larger than the preset frequency threshold value, enabling the parameters of the initial emotion recognition model to meet convergence conditions.

After the emotion recognition model is trained, the emotion recognition model can be used for predicting the emotion type of the voice to be recognized, which is the same as the language type of the second voice sample, specifically, the voice to be recognized, which is the same as the language type of the second voice sample, is obtained, the low-dimensional characteristics of the voice to be recognized are obtained by adopting a principal component analysis method, the low-dimensional characteristics of the voice to be recognized are input into the trained emotion recognition model, and the emotion type of the voice to be recognized is predicted according to the output of the emotion recognition model.

Referring to fig. 4, a schematic structural diagram of a data processing apparatus is provided in an embodiment of the present application. The data processing device may be a computer program (including program code) running in a computer device, for example the data processing device is an application software; the device may be used to perform the respective steps in the methods provided by the embodiments of the present application. As shown in fig. 4, the data processing apparatus 1 of the embodiment of the present application may include: a sample acquisition unit 11, a low-dimensional feature acquisition unit 12, a second mapping feature acquisition unit 13, a first mapping feature acquisition unit 14, and a model generation unit 15.

A sample acquiring unit 11, configured to acquire a first voice sample from a training voice set and acquire a second voice sample from a target voice set; the first speech sample is a speech sample of a known emotion type; the language type corresponding to the first voice sample is different from the language type corresponding to the second voice sample;

a low-dimensional feature acquiring unit 12, configured to acquire a first low-dimensional feature corresponding to the first voice sample and a second low-dimensional feature corresponding to the second voice sample by using a principal component analysis method;

a second mapping feature obtaining unit 13, configured to map a second low-dimensional feature to a first low-dimensional space corresponding to the first low-dimensional feature, and generate a second mapping feature of the second low-dimensional feature in the first low-dimensional space;

A first mapping feature obtaining unit 14, configured to generate a first mapping feature corresponding to the first low-dimensional feature in the second low-dimensional space according to the second mapping feature and the second low-dimensional feature;

and the model generating unit 15 is configured to generate an emotion recognition model according to the first mapping feature, and predict the emotion type of the second speech sample by using the emotion recognition model.

Referring to fig. 4, the data processing apparatus 1 of the embodiment of the present application may further include: a sample marking unit 16;

a sample marking unit 16, configured to mark a first voice sample of a target language type with an emotion type, and store the marked first voice sample into a training voice set;

The sample acquisition unit 11 is specifically configured to:

The low-dimensional feature acquisition unit 12 is specifically configured to:

Referring to fig. 4, the first mapping feature acquiring unit 14 of the embodiment of the present application may include: a conversion function acquisition subunit 141, a feature acquisition subunit 142;

a conversion function obtaining subunit 141, configured to obtain a conversion function between a second low-dimensional space corresponding to the second low-dimensional feature and a first low-dimensional space corresponding to the first low-dimensional feature; the transfer function is to minimize a difference between a second low-dimensional feature in the second low-dimensional space and a second mapping feature of the second low-dimensional feature in the first low-dimensional space;

the feature acquisition subunit 142 is configured to generate a first mapping feature corresponding to the first low-dimensional feature in the second low-dimensional space according to the conversion function.

The conversion function acquisition subunit 141 is specifically configured to:

Referring to fig. 4, the data processing apparatus 1 of the embodiment of the present application may further include: an emotion type prediction unit 17;

an emotion type prediction unit 17, configured to obtain a voice to be recognized, which has the same language type as that of the second voice sample;

Referring to fig. 5, a schematic structural diagram of a computer device is provided in an embodiment of the present application. As shown in fig. 5, the computer device 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, a memory 1005, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a random access memory (Random Access Memory, RAM) or a nonvolatile memory (NVM), such as at least one magnetic disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in FIG. 5, an operating system, network communications modules, user interface modules, and data processing applications may be included in memory 1005, which is a type of computer storage medium.

In the computer device 1000 shown in fig. 5, the network interface 1004 may provide a network communication function, and the user interface 1003 is mainly used as an interface for providing input to a user; the processor 1001 may be configured to invoke the data processing application stored in the memory 1005 to implement the description of the data processing method in any of the embodiments corresponding to fig. 2 to 3, which is not described herein.

It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in any of the embodiments corresponding to fig. 2 to 3, and may also perform the description of the data processing device in the embodiment corresponding to fig. 4, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted here that: the embodiments of the present application further provide a computer readable storage medium, where a computer program executed by the aforementioned data processing apparatus is stored, and the computer program includes program instructions, when executed by a processor, can perform the description of the data processing method in any of the corresponding embodiments of fig. 2 to 3, and therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or, alternatively, across multiple computing devices distributed across multiple sites and interconnected by a communication network, where the multiple computing devices distributed across multiple sites and interconnected by a communication network may constitute a blockchain system.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by computer programs to instruct related hardware, and the programs may be stored in a computer readable storage medium, which when executed may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, an NVM, a RAM, or the like.

The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims

1. A method of data processing, comprising:

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein obtaining a first speech sample from the training speech set and a second speech sample from the target speech set comprises:

4. The method of claim 1, wherein obtaining the first low-dimensional feature corresponding to the first speech sample and the second low-dimensional feature corresponding to the second speech sample using principal component analysis comprises:

5. The method of claim 1, wherein generating a first mapping feature of the first low-dimensional feature in the second low-dimensional space based on the second mapping feature and the second low-dimensional feature comprises:

6. The method of claim 5, wherein obtaining a transfer function between a second low-dimensional space corresponding to the second low-dimensional feature and a first low-dimensional space corresponding to the first low-dimensional feature comprises:

7. The method of claim 1, further comprising:

8. A data processing apparatus, comprising:

9. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method according to any of claims 1-7.

10. A computer device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method according to any of claims 1-7.