CN114023313A

CN114023313A - Training of speech processing model, speech processing method, apparatus, device and medium

Info

Publication number: CN114023313A
Application number: CN202210000504.8A
Authority: CN
Inventors: 迟耀明
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-02-08
Anticipated expiration: 2042-01-04
Also published as: CN114023313B

Abstract

The embodiment of the disclosure relates to a method, a device, equipment and a medium for training a voice processing model, wherein the method for training the voice processing model comprises the following steps: the method comprises the steps of obtaining an original voice sequence, adding a preset amount of Gaussian white noise to the original voice sequence to obtain a preset amount of voice sequences to be trained, carrying out empirical mode decomposition on the basis of each voice sequence to be trained to obtain voice mode components and target voice trend items of different frequencies, and training an initial neural network model on the basis of the voice mode components and the target voice trend items to obtain a voice processing model. According to the method, the decomposition efficiency is improved by adding the independently distributed Gaussian white noise into the original voice sequence, the voice processing model is obtained based on the decomposed voice modal component and the target voice trend item, and the voice processing accuracy of the voice processing model is improved on the basis of improving the model training efficiency.

Description

Training of speech processing model, speech processing method, apparatus, device and medium

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to methods, apparatuses, devices, and media for training a speech processing model.

Background

At present, with the continuous development of communication technologies and intelligent terminals, numerous devices need to perform voice interaction, and therefore, it is important to be able to accurately and effectively process voice in a voice interaction process.

In the related art, when a speech model is trained or processed, features of different scales are generally fused, and the features generate interference when used for model training, so that the accuracy of the trained model is not high.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides training of a speech processing model, a speech processing method, an apparatus, a device and a medium.

According to an aspect of the embodiments of the present disclosure, there is provided a method for training a speech processing model, including:

acquiring an original voice sequence;

adding a preset amount of Gaussian white noise to the original voice sequence to obtain a preset amount of voice sequences to be trained;

performing empirical mode decomposition on each to-be-trained voice sequence to obtain voice mode components with different frequencies and a target voice trend item;

and training an initial neural network model based on the voice modal component and the target voice trend item to obtain an obtained voice processing model.

According to another aspect of the embodiments of the present disclosure, there is provided a speech processing method, including:

acquiring a voice sequence to be processed;

performing empirical mode decomposition on the basis of the voice sequence to be processed to obtain a current voice mode component and a current voice trend item;

and inputting a voice processing model for processing based on the current voice modal component and the current voice trend item, and acquiring a voice processing result.

According to another aspect of the embodiments of the present disclosure, there is provided a training apparatus for a speech processing model, including:

the first acquisition module is used for acquiring an original voice sequence;

the adding module is used for adding a preset amount of Gaussian white noise to the original voice sequence to obtain the preset amount of voice sequences to be trained;

the first decomposition module is used for carrying out empirical mode decomposition on each to-be-trained voice sequence to obtain a voice mode component and a target voice trend item;

and the training acquisition module is used for training the initial neural network model based on the voice modal component and the target voice trend item so as to obtain a voice processing model.

According to another aspect of the embodiments of the present disclosure, there is provided a speech processing apparatus including:

the second acquisition module is used for acquiring a voice sequence to be processed;

the second decomposition module is used for carrying out empirical mode decomposition on the basis of the voice sequence to be processed to obtain a current voice mode component and a current voice trend item;

and the processing module is used for inputting the voice processing model for processing based on the current voice modal component and the current voice trend item to obtain a voice processing result.

According to another aspect of the disclosed embodiments, an electronic device, comprising: a processor; a memory for storing the processor-executable instructions; the processor is used for reading the executable instructions from the memory and executing the instructions to realize the training method of the voice processing model or the voice processing method provided by the embodiment of the disclosure.

According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program for executing a training method or a speech processing method of a speech processing model as provided by the embodiments of the present disclosure.

According to another aspect of the embodiments of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the training method or the speech processing method of the speech processing model provided by the embodiments of the present disclosure.

According to the technical scheme provided by the embodiment of the disclosure, an original voice sequence can be obtained, a preset amount of white gaussian noise is added to the original voice sequence to obtain a preset amount of voice sequences to be trained, empirical mode decomposition is performed on the basis of each voice sequence to be trained to obtain voice mode components and target voice trend items with different frequencies, and an initial neural network model is trained on the basis of the voice mode components and the target voice trend items to obtain a voice processing model. According to the method, the decomposition efficiency is improved by adding the independently distributed Gaussian white noise into the original voice sequence, the voice processing model is obtained based on the decomposed voice modal component and the target voice trend item, and the voice processing accuracy of the voice processing model is improved on the basis of improving the model training efficiency.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic flow chart illustrating a method for training a speech processing model according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating another method for training a speech processing model according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a speech processing model according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart illustrating a method for training a speech processing model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of another speech processing model according to an embodiment of the present disclosure;

fig. 6 is a schematic flow chart of a speech processing method according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart diagram illustrating another speech processing method according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of a training apparatus for a speech processing model according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

In practical application, a speech model is trained after feature extraction is performed on a speech sequence in the related technology, so that the efficiency and the accuracy of a speech processing model are reduced, especially for the field of education, students need to frequently answer the questions posed by teachers, or the students need to propose the speech questions to the teachers, and the like. In view of the foregoing problems, embodiments of the present disclosure provide a method, an apparatus, a device, and a medium for training a speech processing model, which can improve decomposition efficiency by adding gaussian white noise distributed independently to an original speech sequence, and obtain a speech processing model based on a decomposed speech modal component and a target speech trend term, so as to improve speech processing accuracy of the speech processing model on the basis of improving model training efficiency.

First, an embodiment of the present disclosure provides a method for training a speech processing model, and fig. 1 is a schematic flow chart of the method for training a speech processing model according to the embodiment of the present disclosure, which can be executed by a device of the method for training a speech processing model, where the device can be implemented by software and/or hardware, and can be generally integrated in an electronic device. As shown in fig. 1, the method mainly includes the following steps S102 to S106:

step 102, obtaining an original voice sequence.

And 104, adding a preset amount of Gaussian white noise to the original voice sequence to obtain a preset amount of voice sequences to be trained.

The original speech sequence may be a speech sequence obtained by preprocessing speech data acquired by a sound acquisition device such as a microphone in the electronic device, or a preprocessed speech sequence received by the electronic device executing the training method of the speech processing model, and the acquisition mode is not limited here. The voice data is preprocessed in various manners, for example, a child voice sequence is subjected to fourier transform to obtain frequency domain information of the child voice sequence, then low-frequency parts (for example, lower than 300 hz) in the transformed frequency domain information are removed in consideration of characteristics (for example, the frequency is 300-120000 hz) of the child voice, and then an original voice sequence is obtained through inverse fourier transform.

The gaussian white noise is noise whose instantaneous value obeys gaussian distribution and whose power spectral density is uniformly distributed, and the probability distribution is a normal function.

In some embodiments, the original speech sequences are added to a preset amount of white gaussian noise, respectively, to obtain a preset amount of speech sequences to be trained.

The preset number of values can be selected and set according to application scene needs, and is determined according to user gender, user age, voice scene, calculation mode and the like of the original voice sequence.

And 106, performing empirical mode decomposition on each to-be-trained voice sequence to obtain voice mode components with different frequencies and a target voice trend item.

The voice mode component refers to obtaining all maximum value points and all minimum value points of each voice sequence to be trained, fitting the maximum value points and all minimum value points respectively, obtaining an average value obtained by averaging an upper envelope line and a lower envelope line, and the voice trend item refers to a residual value of the original voice sequence and the voice mode component, namely the residual value is obtained by subtracting the calculated voice mode component from the original voice sequence.

In some embodiments, performing empirical mode decomposition on each to-be-trained voice sequence may be understood as obtaining all maximum value points and all minimum value points of each to-be-trained voice sequence, respectively fitting, obtaining an upper envelope line and a lower envelope line, averaging, obtaining a preset number of first-order voice mode components, obtaining a first-order voice trend item based on the preset number of first-order voice mode components, continuing to perform empirical mode decomposition processing after adding a preset number of white gaussian noises to the original voice sequence by using the first-order voice trend item as the original voice sequence under the condition that the amplitude of the first-order voice trend item is not less than a preset amplitude threshold, and obtaining voice mode components with different frequencies and a target voice trend item until the obtained amplitude of the target voice trend item is less than the preset amplitude threshold.

The preset amplitude threshold may be selected and set according to the application scene needs, and illustratively, the preset amplitude threshold is inversely proportional to the number of classification categories of the voice scene.

And 108, training a neural network based on the voice modal component and the target voice trend item to obtain a voice processing model.

In some embodiments, after the speech modal components with different frequencies are divided according to the frequencies, an initial neural network model (such as a long-term and short-term memory model LSTM, a convolutional network model, and the like) is trained in combination with a target speech trend term, and a plurality of speech processing models are obtained.

In other embodiments, a data matrix is constructed by the voice modal components with different frequencies and the target voice trend item, and the initial neural network model is trained based on the voice sequence of the data matrix to obtain a voice processing model.

To sum up, in the training method of the speech processing model according to the embodiment of the present disclosure, a preset number of to-be-trained speech sequences are obtained by obtaining an original speech sequence and adding a preset number of white gaussian noises to the original speech sequence, where the original speech sequence has a labeled text, empirical mode decomposition is performed on the basis of each to-be-trained speech sequence to obtain speech mode components and target speech trend items with different frequencies, and neural network training is performed on the basis of the speech mode components and the target speech trend items to obtain the speech processing model. According to the method, the decomposition efficiency is improved by adding the independently distributed Gaussian white noise into the original voice sequence, the voice processing model is obtained based on the decomposed voice modal component and the target voice trend item, and the voice processing accuracy of the voice processing model is improved on the basis of improving the model training efficiency.

Based on the above description of the embodiments, in order to further meet the requirements of the scene and improve the accuracy of the speech processing model, the following description is made in detail with reference to fig. 2. As shown in FIG. 2, the method mainly includes the following steps S202 to S210:

step 202, obtaining an original voice sequence and corresponding user basic information, a voice scene and a calculation mode, and performing parameter table query processing based on the user basic information, the voice scene and the calculation mode to obtain a standard deviation of white gaussian noise.

And 204, acquiring the sequence length, the voice scene and the calculation mode of the original voice sequence, and determining the number of the preset number based on the sequence length, the voice scene and the calculation mode.

And step 206, adding the original voice sequence and a preset amount of Gaussian white noise to obtain a preset amount of voice sequences to be trained.

The user basic information refers to basic information such as gender, age and the like of the user; the voice scene refers to an interaction scene, a reading-aloud scene, an identification scene and other scenes, as an example, an online education scene is taken as an example, for example, a scene in which a child answers a question provided by a teacher with voice is taken as an interaction scene, and for example, a scene in which a child reads aloud for display content of a display device is taken as a reading-aloud scene; for example, in the process of answering a question by a child, the answer "XXX" is spoken by voice, the scene for recognizing the answer "XXX" is a recognition scene, the interactive scenes obtained based on the original voice sequence are different, and the standard deviation of gaussian white noise is also different; the calculation mode refers to a fast mode, a simple mode, a full mode, and the like. (ii) a As an example, taking an online education scene as an example, only voice data cached by a terminal is acquired for analysis, a mode capable of quickly calculating and acquiring an original voice sequence is an extreme speed mode, all voice data stored by a cloud platform is acquired for analysis, the mode capable of more accurately acquiring the original voice sequence is a complete mode, partial data is randomly acquired from the cloud platform for analysis while voice data cached by the terminal is acquired, the mode capable of giving consideration to efficiency and accuracy to a certain extent is a simple mode, the calculation modes acquired based on the original voice sequence are different, and the standard deviation of white gaussian noise is also different.

The standard deviation of the white gaussian noise of different user basic information (such as age, gender and the like), voice scenes and calculation modes can be obtained according to historical voice data analysis, a parameter table is established, and the parameter table is inquired subsequently and directly based on the user basic information, the voice scenes and the calculation modes to obtain the standard deviation of the white gaussian noise.

The method comprises the steps of obtaining sequence lengths of different voice sequences, standard deviations of Gaussian white noise of a voice scene and a calculation mode according to historical voice data analysis, constructing a parameter table, and inquiring the parameter table directly on the basis of the sequence lengths of the voice sequences, the voice scene and the calculation mode to obtain the number of preset numbers.

For example, let the original speech sequence be

Where n denotes the time, where n denotes,

the time-varying value is expressed as a mathematical expression of the original speech sequence, in which the children speech sequence

Adding a preset number of t normally distributed Gaussian white noises to obtain t to-be-trained voice sequences, wherein the t to-be-trained voice sequences are shown in a formula (1):

（1）

wherein the content of the first and second substances,

in the form of an original speech sequence,

is white Gaussian noise, and a total of t are respectively

，

The standard deviation of the noise is a parameter that can be adjusted, and can be obtained based on the parameter table query described above, for example, the standard deviation of the fast mode of the interactive scenario is 30%, and the white noise that can be added for each original speech sequence is the same standard deviation. In addition, the parameter table can be formed by setting according to the age, the gender and the like of the user, namely, the users with the same age and the same gender adopt the same standard deviation, and the accuracy of subsequent processing is further improved.

Wherein t is the number of the added white gaussian noise and is also a controllable parameter, and has a certain relevance with the user environment and the electronic equipment, and the value mainly depends on the sequence length, for example, based on the described parameter table, the quick mode for obtaining the interaction scene can be selected

The integer part of (2).

Step 208, performing an nth empirical mode decomposition process based on each to-be-trained voice sequence to obtain a preset number of N-order voice mode components; and performing average processing based on a preset number of N-order voice modal components to obtain a target N-order voice modal component, and performing difference calculation based on the original voice sequence and the target N-order voice modal component to obtain an N-order voice trend item.

And step 210, under the condition that the amplitude of the N-order speech trend item is not smaller than a preset amplitude threshold, taking the N-order speech trend item as an original speech sequence, adding a preset amount of white gaussian noise, and then performing (N + 1) th empirical mode decomposition processing, and stopping the empirical mode decomposition processing until the obtained amplitude of the target speech trend item is smaller than the preset amplitude threshold, so as to obtain speech mode components with different frequencies and a target speech trend item.

Continuing with the above example as an example, for the t to-be-trained speech sequences, an empirical mode method is used to obtain a first-order speech mode component

And taking the average value of t first-order voice mode components as the first one of the original voice sequence

The components (i.e. the target first-order speech modal components) are subtracted from the original speech sequence to calculate residual values, which are the speech residuals of the first round, i.e. the first-order speech trend terms, and the specific calculation is as shown in equations (2) and (3):

（2）

（3）

wherein t is the number of Gaussian white noises added, and is shown in formula 2Averaging all the first-order voice modal components obtained by adding t white noise formed t voice sequences to be trained, and averaging the average value

I.e. as the target first-order speech modality component of the original child speech sequence.

The subscript 1 in (a) denotes the target first order speech modal component, which will be calculated later for the other order modal components of the original speech sequence.

Wherein, the original child voice sequence subtracts the calculated target first-order voice modal component of the original child voice sequence to obtain a residual value, namely the residual value in the formula

And representing a first-order speech residual corresponding to the target first-order speech modal component, namely a first-order speech trend term.

Taking a first-order voice trend term as an original voice sequence, adding self-adaptive white Gaussian noise, then carrying out empirical mode decomposition to obtain voice mode components and voice trend terms of each order, repeating the steps until the voice trend terms can not be decomposed continuously, namely the amplitude of the voice trend terms is smaller than a preset amplitude threshold value, indicating that the voice trend terms are monotonous functions or constants, and finally obtaining k orthogonal voice mode components and a target trend term (namely final voice residual errors)

This decomposes the original speech sequence into the following equations (4):

（4）

wherein k is the number of speech modal components formed after repeating the above process to be unable to be decomposed, and the corresponding speech residual is

As the target voice trend term, there is,

and obtaining each order of voice modal components for repeating the process.

In the embodiment of the present disclosure, the preset amplitude threshold is inversely proportional to the number of classification categories of the voice scene, that is, the preset amplitude threshold is an external parameter, and is set according to a specific scene and precision, for example, in the case that some final classification categories are less, such as yes or no click of a user, or several options, and emotion recognition of the user, the set amplitude threshold is larger; and for more classification cases, the amplitude threshold value is set to be smaller if the speech characters are recognized. It can be understood that the larger amplitude threshold forms less speech modal components, and the subsequent neural network model operation is faster; smaller amplitude thresholds result in more speech modal components, and subsequent neural network model operations are slower.

And 212, dividing the voice modal component based on the frequency of the voice modal component to obtain a first frequency voice modal component and a second frequency voice modal component, and training the initial neural network model based on the first frequency voice modal component, the second frequency voice modal component and the target voice trend item to obtain a plurality of voice processing models.

In some embodiments, the speech modal components with different frequencies may be divided into three components, i.e., a first frequency speech model component (e.g., a high frequency term), a second frequency speech modal component (e.g., a low frequency term), and a target speech trend term, and the initial neural network model, e.g., the LSTM model, may be trained for the three components to obtain a trained LSTM model, i.e., generate a plurality of speech processing models, so as to improve training efficiency while reducing the number of LSTM models while ensuring speech processing accuracy. For example, as shown in fig. 3, there are 1 to 10 speech modal components, which are sorted into speech modal components 1 to 10 according to frequency, 10 speech modal components are reconstructed to obtain three types of speech modal components, for example, the first 40% speech modal components (speech modal components 1 to 4) are combined to form a high-frequency speech modal component, the last 60% speech modal components (speech modal components 6 to 10) are combined to form a low-frequency speech modal component and a target speech trend, then an LSTM model is established for the three speech modal components, the three speech modal components are input to a long-time memory network for training, and three speech processing models are obtained, which can greatly reduce training time and calculation time, and is particularly suitable for situations requiring efficiency and insufficient training data.

The LSTM neural network comprises an input gate, an output gate, a forgetting gate and a memory unit, the training parameters of the LSTM model can be selectively set according to an application scene, and exemplarily, the training parameters of the LSTM model are set as: the number of the hidden units of the LSTM layer is 200, the maximum training iteration number is 200, the gradient threshold value is set to 1, the initial learning rate is 0.005, the learning rate is reduced by multiplying a factor of 0.2 after iteration for 125 times, the prediction step length is 1, the loss function selects the average absolute error or the root mean square error, the training data is input into the constructed LSTM model for training, the LSTM model with the minimum error is obtained as a voice processing model, and the obtained eigenmode function is combined with a long-time memory network, so that the voice data can be better analyzed and processed.

It should be noted that, not limited to the above training LSTM model for the three subdivided components, t-test (student t-test) may be performed on the speech modal components with different frequencies, the components with significance result >0.05 after the test are summed up to the high-frequency speech modal component, the rest are low-frequency speech modal components, and the three components of the target speech trend item, for example, 3 layers of convolution +1 layers of full connection are adopted, the lengths of convolution kernels are 17, 7, and 5, the three components are trained to obtain a speech processing model, and for the speech of the child, the speech processing model obtained by the training mode may quickly confirm the selection item expressed by the speech of the child (that is, when there are multiple buttons or answers selected, what the answer and selection of the child are).

The above method provided by the embodiment of the present disclosure determines a standard deviation of white gaussian noise based on a specific scene, adds a preset amount of white gaussian noise to an original voice sequence to obtain a preset amount of voice sequences to be trained, determines decomposed voice modal components with different frequencies and a target voice trend item based on the specific scene, and performs division training on the voice modal components with different frequencies based on the frequencies to obtain a plurality of voice processing models, thereby further satisfying the requirements of the scene, improving the efficiency of the voice processing model training, satisfying the requirements of different scenes for voice processing, and improving the accuracy of personalized scene voice processing.

Based on the above description of the embodiments, the present disclosure may also perform speech processing model training based on a convolutional neural network, which is described in detail below with reference to fig. 4. As shown in fig. 4, the method mainly includes the following steps S402 to S404:

step 402, constructing a data matrix based on the voice modality components and the target voice trend item.

And step 404, inputting the voice sequence based on the data matrix into a convolutional neural network for training to obtain a voice processing model.

In some embodiments, the voice modal components and the target voice trend items are respectively arranged according to a two-dimensional data matrix and are placed in different channels to construct a data matrix, the features of the voice modal components and the target voice trend items in the data matrix are extracted through convolution kernels, a voice sequence of the data matrix can be divided into a training set and a testing set, the training set is input into a convolution neural network model, and a gradient descent method training is performed through a back propagation algorithm to obtain a voice processing model.

For example, as shown in fig. 5, after adding white gaussian noise to an original speech sequence, a speech sequence to be trained is obtained, and after empirical mode decomposition, the speech sequence to be trained obtains speech mode components 1 to 10 and a target speech trend term, and a data matrix is constructed as follows:

。

the voice modal components 1 to 10 and the target voice trend term respectively correspond to channels 1 to 11, the voice modal components and the target voice trend term are input to a convolutional neural network, for example, 2 layers of convolution +2 layers of pooling +1 layers of full connection are adopted, convolution kernels are 5 x 5 and 3 x 3 respectively, 2 x 2 and 2 x 2 layers of pooling layers are adopted for average pooling, and a voice processing model is obtained based on gradient descent method training.

Note that, in the case where the speech modal components 1 to 10 are misaligned, the short components may be filled with the maximum values (the maximum values of all the speech modal components).

According to the method provided by the embodiment of the disclosure, the data matrix is constructed based on the voice modal component and the target voice trend item, and the voice processing model is obtained by training through the convolutional neural network, so that the training efficiency and accuracy of the voice processing model are improved, and the subsequent voice processing precision is improved.

Based on the description of the above embodiments, the speech processing model trained in the embodiments of the present disclosure can improve the accuracy and efficiency of subsequent speech processing, which is described in detail below with reference to fig. 6. As shown in fig. 6, the method mainly includes the following steps S602 to S606:

step 602, a to-be-processed speech sequence is obtained.

And step 604, performing empirical mode decomposition based on the voice sequence to be processed to obtain a current voice mode component and a current voice trend term.

The voice sequence to be processed may be a voice sequence obtained by preprocessing voice data acquired by a sound acquisition device such as a microphone in the electronic device, or a voice sequence obtained by preprocessing the voice data received by the electronic device executing the voice processing method, and the acquisition mode is not limited here.

The current voice mode component refers to the fact that all maximum value points and all minimum value points of the voice sequence to be processed are obtained and are respectively fitted, the average value obtained by averaging the upper envelope line and the lower envelope line is obtained, and the current voice trend item refers to the fact that the calculated current voice mode component is subtracted from the processed voice sequence to obtain a residual value.

In some embodiments, performing empirical mode decomposition on the voice sequence to be processed may be understood as obtaining all maximum value points and all minimum value points of the voice sequence to be processed, respectively fitting, obtaining an upper envelope line and a lower envelope line, averaging, obtaining a first-order voice mode component, obtaining a first-order voice trend term based on the first-order voice mode component, continuing to perform empirical mode decomposition on the first-order voice trend term as the voice sequence to be processed when the amplitude of the first-order voice trend term is not less than a preset amplitude threshold, stopping performing empirical mode decomposition until the obtained amplitude of the current voice trend term is less than the preset amplitude threshold, and obtaining current voice mode components of different frequencies and a current voice trend term.

And 606, inputting the current voice modal component and the current voice trend item into a voice processing model for processing, and obtaining a voice processing result.

In some embodiments, the frequency of the current speech modal component is divided to obtain a first current frequency speech modal component and a second current frequency speech modal component, the first current frequency speech modal component, the second current frequency speech modal component and the current speech trend item are respectively input into the speech processing model for processing to obtain a plurality of speech processing results, and the target speech processing result is obtained from the plurality of speech processing results based on the current speech scene.

In other embodiments, a current data matrix is constructed based on the current speech modal component and the current speech trend term, and the current data matrix is input into the speech processing model for processing to obtain a speech processing result.

In order to make it more clear to those skilled in the art how to perform speech processing for a specific scene and ensure speech processing accuracy, the following is described in detail with reference to fig. 7. As shown in fig. 7, the method mainly includes the following steps S702 to S706:

step 702, acquiring a voice sequence to be processed, and performing empirical mode decomposition based on the voice sequence to be processed to obtain a current voice mode component and a current voice trend item.

Step 704, performing a division process based on the frequency of the current speech modal component to obtain a first current frequency speech modal component and a second current frequency speech modal component.

Step 706, respectively inputting the voice processing model for processing based on the first current frequency voice modal component, the second current frequency voice modal component and the current voice trend item, obtaining a plurality of voice processing results, and obtaining a target voice processing result from the plurality of voice processing results based on the current voice scene.

Illustratively, the original speech sequence is a children speech sequence, and has a labeled text Y, and the speech modal component subjected to modal decomposition is divided into three speech modal components (high frequency, low frequency and trend), each speech modal component corresponds to an LSTM, and the LSTM labels Y, so that a plurality of speech processing models (a high-frequency speech processing model, a low-frequency speech processing model and a trend speech processing model) are trained.

As an example of a scene, when the method is used, a to-be-processed child speech sequence is entered into a plurality of speech processing models according to the above steps, and a plurality of speech processing results are obtained, for example, a prediction result of a high-frequency speech processing model in a recognition scene is used as a target speech processing result, and a prediction result of a low-frequency speech processing model in an interactive scene is used as a target speech processing result, thereby further improving speech processing accuracy.

To sum up, the speech processing method provided by the embodiment of the present disclosure obtains a speech sequence to be processed, performs empirical mode decomposition based on the speech sequence to be processed, obtains a current speech mode component and a current speech trend item, and inputs the current speech mode component and the current speech trend item into a speech processing model for processing, obtains a speech processing result, and improves speech processing accuracy and efficiency of the speech processing model.

Corresponding to the foregoing method for training a speech processing model, an embodiment of the present disclosure provides a device for training a speech processing model, and fig. 8 is a schematic structural diagram of the device for training a speech processing model according to an embodiment of the present disclosure, which may be implemented by software and/or hardware and may be generally integrated in an electronic device, as shown in fig. 8, the device 800 for training a speech processing model includes the following modules:

a first obtaining module 802, configured to obtain an original speech sequence.

An adding module 804, configured to add a preset amount of white gaussian noise to the original voice sequence to obtain a preset amount of voice sequences to be trained.

The first decomposition module 806 is configured to perform empirical mode decomposition on each to-be-trained speech sequence to obtain a speech mode component and a target speech trend term.

And a training obtaining module 808, configured to perform neural network training based on the speech modal component and the target speech trend term, and obtain a speech processing model.

The device provided by the embodiment of the disclosure improves the decomposition efficiency by adding the independently distributed white gaussian noise in the original voice sequence, obtains the voice processing model based on the decomposed voice modal component and the target voice trend item, and improves the voice processing accuracy of the voice processing model on the basis of improving the model training efficiency.

In some embodiments, the above apparatus further comprises: the first acquisition processing module is used for acquiring the user basic information, the voice scene and the calculation mode of the original voice sequence, and performing parameter table query processing based on the user basic information, the voice scene and the calculation mode to obtain the standard deviation of Gaussian white noise.

In some embodiments, the above apparatus further comprises: and the second acquisition processing module is used for acquiring the sequence length, the voice scene and the calculation mode of the original voice sequence and determining the number of the preset number based on the sequence length, the voice scene and the calculation mode.

In some embodiments, the first decomposition module 806 is specifically configured to: performing N-time empirical mode decomposition processing on each to-be-trained voice sequence to obtain N-order voice mode components in preset number; wherein N is a positive integer; carrying out average processing based on a preset number of N-order voice modal components to obtain a target N-order voice modal component; performing difference value calculation based on an original voice sequence and the target N-order voice modal component to obtain an N-order voice trend item; and under the condition that the amplitude of the N-order voice trend item is not smaller than a preset amplitude threshold, taking the N-order voice trend item as an original voice sequence, adding a preset amount of Gaussian white noise, and then performing empirical mode decomposition processing for the (N + 1) th time, and stopping the empirical mode decomposition processing until the obtained amplitude of the target voice trend item is smaller than the preset amplitude threshold, so as to obtain voice modal components with different frequencies and a target voice trend item.

In some embodiments, the preset amplitude threshold is inversely proportional to the number of classification classes of the speech scene.

In some embodiments, the training acquisition module 808 is specifically configured to: dividing the voice modal component based on the frequency of the voice modal component to obtain a first frequency voice modal component and a second frequency voice modal component; training the initial neural network model based on the first frequency speech modal component, the second frequency speech modal component and the target speech trend term to obtain a plurality of speech processing models.

In some embodiments, the training acquisition module 808 is specifically configured to: and constructing a data matrix based on the voice modal component and the target voice trend item, and inputting the data matrix into a convolutional neural network for training to obtain a voice processing model.

The training device of the speech processing model provided by the embodiment of the disclosure can execute the training method of the speech processing model provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatus embodiments may refer to corresponding processes in the method embodiments, and are not described herein again.

Corresponding to the foregoing speech processing method, an embodiment of the present disclosure provides a speech processing apparatus, and fig. 9 is a schematic structural diagram of a speech processing apparatus provided in an embodiment of the present disclosure, which may be implemented by software and/or hardware and may be generally integrated in an electronic device, as shown in fig. 9, the speech processing apparatus 900 includes the following modules:

a second obtaining module 902, configured to obtain a to-be-processed speech sequence.

And a second decomposition module 904, configured to perform empirical mode decomposition based on the to-be-processed speech sequence to obtain a current speech mode component and a current speech trend term.

And the processing module 906 is configured to input the speech processing model for processing based on the current speech modality component and the current speech trend item, and obtain a speech processing result.

According to the device provided by the embodiment of the disclosure, the current voice modal component and the current voice trend item are obtained by obtaining the voice sequence to be processed and performing empirical mode decomposition based on the voice sequence to be processed, and the voice processing result is obtained by inputting the voice processing model based on the current voice modal component and the current voice trend item for processing, so that the voice processing accuracy and efficiency of the voice processing model are improved.

In some embodiments, the processing module 906 is specifically configured to: the method comprises the steps of dividing and processing based on the frequency of a current voice modal component to obtain a first current frequency voice modal component and a second current frequency voice modal component, respectively inputting the voice processing models for processing based on the first current frequency voice modal component, the second current frequency voice modal component and a current voice trend item to obtain a plurality of voice processing results, and obtaining a target voice processing result from the plurality of voice processing results based on a current voice scene.

In some embodiments, the processing module 906 is specifically configured to: and constructing a current data matrix based on the current voice modal component and the current voice trend item, inputting the current data matrix into a voice processing model for processing, and obtaining a voice processing result.

The voice processing device provided by the embodiment of the disclosure can execute the voice processing method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a training method or a speech processing method of a speech processing model according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a training method or a speech processing method of a speech processing model according to an embodiment of the present disclosure.

Exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when being executed by a processor of a computer, is adapted to cause the computer to carry out a method of training a speech processing model or a method of speech processing according to an embodiment of the present disclosure.

Referring to fig. 10, a block diagram of a structure of an electronic device 1000, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in the electronic device 1000 are connected to the I/O interface 1005, including: input section 1006, output section 1007, storage section 1008, and communication section 1009. The input unit 1006 may be any type of device capable of inputting information to the electronic device 1000, and the input unit 1006 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 1007 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 1004 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 1009 allows the electronic device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1001 executes the respective methods and processes described above. For example, in some embodiments, the method 102, 106, etc. may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 1000 via ROM 1002 and/or communications unit 1009. In some embodiments, the computing unit 1001 may be configured in any other suitable way (e.g. by means of firmware) to perform a training method of a speech processing model or a speech processing method.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of training a speech processing model, comprising:

acquiring an original voice sequence;

adding a preset amount of Gaussian white noise to the original voice sequence to obtain a preset amount of voice sequences to be trained; performing empirical mode decomposition on each to-be-trained voice sequence to obtain voice mode components with different frequencies and a target voice trend item;

training an initial neural network model based on the speech modal components and the target speech trend term to obtain a speech processing model.

2. The method of training a speech processing model of claim 1, further comprising:

acquiring user basic information, a voice scene and a calculation mode of the original voice sequence;

and determining the standard deviation of the preset amount of Gaussian white noise based on the user basic information, the voice scene and the calculation mode.

3. The method of training a speech processing model of claim 1, further comprising:

acquiring the sequence length, the voice scene and the calculation mode of the original voice sequence;

determining the number of the preset number based on the sequence length, the voice scene and the calculation mode.

4. The method for training a speech processing model according to claim 1, wherein the performing empirical mode decomposition based on each speech sequence to be trained to obtain speech modal components of different frequencies and a target speech trend term comprises:

performing N-time empirical mode decomposition processing on each to-be-trained voice sequence to obtain N-order voice mode components of the preset number; wherein N is a positive integer;

carrying out average processing on the N-order voice modal components based on the preset number to obtain target N-order voice modal components;

performing difference value calculation based on the original voice sequence and the target N-order voice modal component to obtain an N-order voice trend item;

and under the condition that the amplitude of the N-order speech trend item is not smaller than a preset amplitude threshold, taking the N-order speech trend item as the original speech sequence, adding the preset amount of Gaussian white noise, and then performing empirical mode decomposition processing for the (N + 1) th time, and stopping the empirical mode decomposition processing until the obtained amplitude of the target speech trend item is smaller than the preset amplitude threshold, so as to obtain the speech modal components with different frequencies and the target speech trend item.

5. The method of training a speech processing model according to claim 4,

the preset amplitude threshold value is inversely proportional to the number of classification categories of the voice scene.

6. The method for training a speech processing model according to claim 1, wherein the training an initial neural network model based on the speech modality components and the target speech trend term to obtain a speech processing model comprises:

dividing the voice modal component based on the frequency of the voice modal component to obtain a first frequency voice modal component and a second frequency voice modal component;

training the initial neural network model based on the first frequency speech modal component, the second frequency speech modal component, and the target speech trend term to obtain a plurality of speech processing models.

7. The method for training a speech processing model according to claim 1, wherein the training an initial neural network model based on the speech modality components and the target speech trend term to obtain a speech processing model comprises:

constructing a data matrix based on the voice modality components and the target voice trend item;

training the initial neural network model based on the data matrix to obtain a speech processing model.

8. A method of speech processing, comprising:

acquiring a voice sequence to be processed;

9. The speech processing method according to claim 8, wherein the inputting the speech processing model for processing based on the current speech modality component and the current speech trend term to obtain a speech processing result comprises:

dividing the current voice modal component based on the frequency of the current voice modal component to obtain a first current frequency voice modal component and a second current frequency voice modal component;

respectively inputting the first current frequency voice modal component, the second current frequency voice modal component and the current voice trend item into the voice processing model for processing to obtain a plurality of voice processing results;

and acquiring a target voice processing result from the plurality of voice processing results based on the current voice scene.

10. The speech processing method according to claim 8, wherein the inputting the speech processing model for processing based on the current speech modality component and the current speech trend term to obtain a speech processing result comprises:

constructing a current data matrix based on the current voice modality component and the current voice trend item;

and inputting the current data matrix into the voice processing model for processing to obtain the voice processing result.

11. The speech processing method according to claim 8, further comprising:

acquiring a current interactive scene of a voice sequence to be processed;

matching a target voice processing model from a plurality of voice processing models according to the current interaction scene;

inputting the target speech processing model process based on the current speech modality component and the current speech trend term.

12. An apparatus for training a speech processing model, comprising:

the first acquisition module is used for acquiring an original voice sequence;

13. A speech processing apparatus comprising:

and the processing module is used for inputting a voice processing model for processing based on the current voice modal component and the current voice trend item to obtain a voice processing result.

14. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method for training a speech processing model according to any one of claims 1 to 7 or the method for speech processing according to any one of claims 8 to 11.

15. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the method of training a speech processing model according to any of claims 1-7 or the method of speech processing according to any of claims 8-11.