CN118038887A

CN118038887A - Mixed voice processing method, device, computer equipment and storage medium

Info

Publication number: CN118038887A
Application number: CN202410335647.3A
Authority: CN
Inventors: 马士乾
Original assignee: Chongqing Changan Automobile Co Ltd
Current assignee: Chongqing Changan Automobile Co Ltd
Priority date: 2024-03-22
Filing date: 2024-03-22
Publication date: 2024-05-14

Abstract

The invention provides a processing method, a device, a computer device and a storage medium of mixed voice, wherein the method comprises the following steps: acquiring a mixed voice signal, wherein the mixed voice signal comprises initial voice characteristics corresponding to a plurality of voice objects; encoding initial voice characteristics corresponding to each voice object in the mixed voice signal to obtain a target mixed voice signal; performing voice separation on the target mixed voice signal to obtain a voice separation result, and decoding the voice separation result to obtain target voice characteristics corresponding to the voice object; identifying the target voice feature to obtain text information, and performing text classification operation based on the text information. The invention solves the problems of insufficient voice separation performance and lack of generality of the existing mixed voice processing technology in a complex environment.

Description

Mixed voice processing method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of speech processing, and in particular, to a method and apparatus for processing mixed speech, a computer device, and a storage medium.

Background

With the continuous development of speech technology, mixed speech processing has become an important issue in complex speech environments. The core of the mixed speech processing is to extract the speech of each individual target from the mixed speech data, i.e. speech separation technology. The voice separation technology is particularly important in complex environments including noise, reverberation and the like, and is a key front-end technology applied to the back-end of voice recognition and the like.

The traditional frequency domain processing method has limitations in processing non-stationary signals, and is difficult to ensure the quality and real-time performance of voice separation. Second, time-domain based speech separation methods, while having higher computational efficiency, still face challenges when dealing with complex speech environments, such as noise interference, reverberation, and the like. In addition, when the existing voice separation algorithm is used for coping with different scenes and requirements, customization and optimization are often needed to be carried out aiming at specific scenes, and generality and flexibility are lacked.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a method, an apparatus, a computer device, and a storage medium for processing mixed speech, so as to solve the problems of insufficient speech separation performance and lack of versatility of the existing mixed speech processing technology in a complex environment.

In a first aspect, an embodiment of the present invention provides a method for processing mixed speech, where the method includes:

acquiring a mixed voice signal, wherein the mixed voice signal comprises initial voice characteristics corresponding to a plurality of voice objects;

encoding initial voice characteristics corresponding to each voice object in the mixed voice signal to obtain a target mixed voice signal;

Performing voice separation on the target mixed voice signal to obtain a voice separation result, and decoding the voice separation result to obtain target voice characteristics corresponding to the voice object;

and identifying the target voice characteristics to obtain text information, and executing text classification operation based on the text information.

Further, the encoding the initial voice feature corresponding to each voice object in the mixed voice signal to obtain a target mixed voice signal includes:

Performing Fourier transform on the mixed voice signal to obtain a transformed mixed voice signal;

inputting the transformed mixed voice signal into an encoder;

mapping initial voice characteristics corresponding to each voice object in the transformed mixed voice signal into voice characteristics of a target dimension through the encoder;

The target mixed speech signal is constructed based on the speech characteristics of the individual speech objects.

Further, the performing the voice separation on the target mixed voice signal to obtain a voice separation result includes:

inputting the target mixed voice signal into a voice separation network;

And recognizing the voice characteristics of each voice object through the voice separation network, and separating the voice characteristics of the voice objects to obtain the voice separation result.

Further, the decoding the voice separation result to obtain the target voice feature corresponding to the voice object includes:

calculating voice mask features corresponding to the voice objects based on the voice features of the voice objects;

and decoding the voice mask features corresponding to the voice objects to obtain target voice features corresponding to the voice objects.

Further, the decoding the voice mask feature corresponding to the voice object to obtain a target voice feature corresponding to the voice object includes:

and performing dot multiplication on the voice mask characteristics and the voice characteristics corresponding to the voice objects in the target mixed voice signals to obtain target voice characteristics corresponding to the voice objects.

Further, before performing a text classification operation based on the text information, the method further includes:

acquiring a training data set, wherein the training data set comprises a text sample and a real classification label;

Inputting the text sample into an initial language model to obtain sample characteristics;

Processing the sample characteristics by using the full connection layer to obtain corresponding classification results;

and updating model parameters of the initial language model by using the classification result and the real classification label to obtain a text classification model.

Further, updating the model parameters of the initial language model by using the classification result and the real classification label to obtain a text classification model, including:

Calculating a loss value between the classification result and the real classification label;

calculating a model gradient of the loss value to the initial language model by using a back propagation algorithm;

And updating model parameters of the initial language model according to the model gradient and an optimization algorithm until the model gradient of the initial language model reaches a preset condition, and taking the initial language model as the text classification model.

In a second aspect, an embodiment of the present invention provides a processing apparatus for mixed speech, where the apparatus includes:

The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a mixed voice signal, wherein the mixed voice signal comprises initial voice characteristics corresponding to a plurality of voice objects;

the coding module is used for coding initial voice characteristics corresponding to each voice object in the mixed voice signal to obtain a target mixed voice signal;

The separation module is used for carrying out voice separation on the target mixed voice signal to obtain a voice separation result, and decoding the voice separation result to obtain target voice characteristics corresponding to the voice object;

and the classification module is used for identifying the target voice characteristics to obtain text information and executing text classification operation based on the text information.

In a third aspect, an embodiment of the present invention provides a computer apparatus, including: the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions to perform the method of the first aspect or any implementation manner corresponding to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect or any of its corresponding embodiments.

The method provided by the embodiment of the application has the following beneficial effects:

The method provided by the embodiment of the application provides basic data for subsequent steps (such as voice separation, feature extraction, recognition and the like) by acquiring the mixed voice signal. The mixed speech signal contains initial speech characteristics of a plurality of speech objects, from which speech information of a plurality of persons can be extracted. The method has obvious scene effects on multi-person conversations, conference records, multi-person voice interactions and the like, and can realize independent analysis and processing of a plurality of voice objects. By acquiring and processing the mixed voice signal, the efficiency of voice processing is improved, and the computing resource and time are saved.

The method provided by the embodiment of the application converts the mixed voice signal from the time domain to the frequency domain by carrying out Fourier transform, which is helpful for extracting and analyzing the frequency domain characteristics of voice in the subsequent steps. The speech features in the frequency domain are mapped to the speech features in the target dimension by the encoder in order to efficiently learn and extract complex features in the speech signal. By constructing the target mixed speech signal based on the speech characteristics of the individual speech objects, a new signal is generated that is representative of the original mixed speech signal, which is more efficient and effective in subsequent processing, such as speech separation.

The method provided by the embodiment of the application utilizes the special voice separation network to accurately identify and separate each voice object characteristic in the mixed voice signal, thereby effectively reducing noise and interference and enabling the target voice to be clearer. By calculating and decoding the voice mask characteristics, the original voice signal is accurately restored, and the efficiency and accuracy of voice processing are improved. The voice mask features are decoded through the dot multiplication operation, so that high-dimension voice features are obtained, and high-quality data support is provided for subsequent voice recognition and text classification.

The method provided by the embodiment of the application can accurately extract the text information by identifying the target voice characteristics. The efficient conversion of voice features and text information is realized. And the text classification operation is executed based on the extracted text information, so that automatic text processing and analysis can be realized, the text processing efficiency is improved, and the manual intervention requirement is reduced. By classifying text information, powerful data support can be provided for subsequent decisions and applications.

The method provided by the embodiment of the application can train the initial language model in a targeted way by using the training data set which comprises the text sample and the corresponding real classification label. The text sample is input into the initial language model to obtain sample characteristics, so that the characteristic extraction and representation of text data are realized, and effective data input is provided for subsequent classification tasks. And processing the sample characteristics by using the full connection layer to obtain corresponding classification results, and realizing weighting and activating the extracted characteristics so as to generate classification decisions. By continuously adjusting the parameters of the full connection layer, the classification performance of the model can be optimized. The method realizes the adjustment and optimization of the initial language model parameters by calculating the loss value between the classification result and the real classification label and calculating the model gradient by using a back propagation algorithm.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of processing mixed speech according to some embodiments of the invention;

FIG. 2 is a model framework diagram of an encoder and decoder according to some embodiments of the present invention;

FIG. 3 is a schematic diagram of a voice separation network according to some embodiments of the present invention;

FIG. 4 is a schematic diagram of a text classification model according to some embodiments of the invention;

fig. 5 is a block diagram of a processing apparatus for mixed speech according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

According to embodiments of the present invention, there is provided a method, apparatus, computer device and storage medium for processing mixed speech, it should be noted that the steps illustrated in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different from that herein.

In this embodiment, a method for processing mixed speech is provided, fig. 1 is a flowchart of a method for processing mixed speech according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:

step S11, a mixed voice signal is obtained, wherein the mixed voice signal comprises initial voice characteristics corresponding to a plurality of voice objects.

It should be noted that, sound signals generated simultaneously by a plurality of voice objects are captured or received from an actual environment. These speech objects may be different persons, different sound sources, or different speech segments of the same person. The mixed speech signal is characterized by speech features comprising a plurality of speech objects, which features may overlap and interleave each other in the time and frequency domain. Specifically, the mixed voice signal can be derived from various scenes, such as a conference room where multiple persons are simultaneously talking, a telephone conversation where multiple persons participate, or an intelligent home environment where multiple persons are simultaneously giving instructions, etc. In these scenarios, the speech signals may interfere with each other and overlap due to factors such as different sound sources, different distances, different directions, etc., forming a mixed speech signal.

In embodiments of the present application, the process of acquiring the mixed speech signal typically involves the use of an audio acquisition device, such as a microphone, recording device, or the like. These devices are capable of converting sound signals in the environment into electrical signals and digitally processing them with an appropriate sampling rate and quantization bit number to obtain digital signals that can be processed on a computer. In the field of digital signal processing, a mixed speech signal is generally regarded as a complex signal containing a plurality of speech object information. This signal may contain various components of speech characteristics, background noise, echoes, etc. of different speech objects.

Step S12, the initial voice characteristics corresponding to each voice object in the mixed voice signal are encoded to obtain a target mixed voice signal.

In the embodiment of the present application, first, the mixed speech signal is fourier transformed to be converted from the time domain to the frequency domain. The transformed signal is then input into an encoder, and the initial speech features of each speech object are mapped to the speech features of the target dimension by the network structure and parameters of the encoder. And finally, constructing a target mixed voice signal based on the mapped voice characteristics, and providing data support for subsequent voice processing tasks. The whole process combines the technology of signal processing and deep learning, and realizes the effective processing and analysis of the mixed voice signal.

In an embodiment of the present application, step S12 includes the following steps A1-A4:

And A1, carrying out Fourier transform on the mixed voice signal to obtain a transformed mixed voice signal.

In an embodiment of the present application, fourier transform is a method of converting a signal in the time domain (or spatial domain) into a signal in the frequency domain. In speech signal processing, the frequency content of the speech signal and its variation over time are often of interest. The mixed speech signal can be represented as a superposition of a series of sine and cosine waves, each having a different frequency and phase, by fourier transforming it. The purpose of fourier transforming the mixed speech signal is to better understand and analyze the mixed speech signal, and in the frequency domain, to identify the relative relationship between the individual frequency components. The transformed mixed speech signal thus contains information of the individual frequency components of the original signal.

As one example, the mixed speech signal is expressed as the following formula:

where x (t) is the time domain representation of the mixed speech signal, C is the number of speech overlays, and is the speech signal of the i-th layer overlay.

X (t) is a time domain signal representing a time-varying speech waveform, and samples of x (t) at a series of discrete time points are taken for fourier transformation to form a discrete time signal x (n), where n represents the index of the samples. This discrete-time signal is subjected to a Fast Fourier Transform (FFT) which converts it from the time domain to the frequency domain X (k).

And step A2, inputting the converted mixed voice signal into an encoder.

In an embodiment of the application, the encoder (Encoder) is an important component in complex speech tasks such as speech separation in the flow of speech signal processing. The main function of the encoder is to convert the input signal (the fourier transformed mixed speech signal) into a more representative form, involving extraction of key features of the signal. The encoder may be a complex neural network structure, and may specifically include multiple convolutional layers, pooled layers, fully-connected layers, and the like. These network layers gradually extract and transform the characteristics of the input signal through a series of mathematical operations (e.g., convolution, pooling, activation functions, etc.). In the speech separation task, the encoder focuses on extracting features that are able to distinguish between different speakers. The result of the encoder output is typically a characteristic representation that contains critical information of the original input signal, but is of lower dimensions and easier to handle later. In the speech separation task, this feature representation can be used directly to generate a separate signal for each speaker or as input to a subsequent Decoder (Decoder).

And step A3, mapping initial voice characteristics corresponding to each voice object in the transformed mixed voice signal into voice characteristics of a target dimension through an encoder.

In an embodiment of the application, the core task of the encoder is to map the mixed speech signal from the original feature space to a higher dimensional feature space. This mapping process can be expressed in terms of a mathematical formula. For example, the transformed mixed signal is X (k), where k represents different frequency bins. The mixed signal is mapped to a feature space with higher dimension by an encoder, so that the voice feature Y of the initial voice feature of each voice object in the target dimension is obtained, and the key information of the original mixed voice signal is contained, but is converted into a form which is more suitable for subsequent processing. The feature representation Y may be expressed as y= { Y1, …, yc, …, yc }, where c is the number of speech overlays and yc represents the potential spatial representation of the c-th layer overlay.

Specifically, the encoder is a neural network composed of multiple network layers, as shown in FIG. 2, which may include two 2-D convolutional layers, two normalization layers, and Relu activation functions. Demse Net is a specific structure within the encoder that contains four expansion convolutional layers with expansion factors of 1, 2, 4, 8, respectively. This structure enables the model to capture speech features on different scales and map those features to a higher dimensional feature space.

And step A4, constructing a target mixed voice signal based on the voice characteristics of each voice object.

In an embodiment of the present application, a target mixed speech signal is reconstructed from speech features of individual speech objects extracted from the mixed speech signal. The actual reverse of speech synthesis, i.e., regression from speech features to speech waveforms. The "target mixed speech signal" does not refer to the original mixed speech signal, but refers to a mixed speech signal synthesized again based on the speech characteristics of the individual speech objects separated. Specifically, the speech features of each speech object are inversely transformed from the target dimension back to the original feature space, and after the inversely transformed speech features are obtained, these features are transformed back to the speech waveform. And remixing the voice waveforms of the voice objects to obtain the target mixed voice signal. In summary, the process of constructing the target mixed speech signal is a reconstruction process from speech features to speech waveforms, aiming at re-synthesizing a speech signal similar to the target mixed speech signal based on the speech features of the individual speech objects that are separated. The process combines the technologies of signal processing, deep learning, voice synthesis and the like, and realizes the effective processing and analysis of the mixed voice signal.

And S13, performing voice separation on the target mixed voice signal to obtain a voice separation result, and decoding the voice separation result to obtain target voice characteristics corresponding to the voice object.

In an embodiment of the application, the target mixed speech signal is input into a trained speech separation network through which speech characteristics of individual speech objects are identified and separated. Then, a voice mask is calculated based on the separated voice features, and the mask is subjected to dot multiplication with the features of the corresponding voice objects in the target mixed voice signal, so that the target voice features corresponding to each voice object are obtained through decoding. This process aims to accurately extract the original speech information of each speech object from the mixed speech signal.

In the embodiment of the application, the target mixed voice signal is subjected to voice separation to obtain a voice separation result, which comprises the following steps of:

and step B1, inputting the target mixed voice signal into a voice separation network.

It should be noted that the speech separation network is a deep learning model, and specifically includes a complex network structure, such as a cyclic neural network (RNN), a Convolutional Neural Network (CNN), or a variant self-encoder structure. These networks learn how to distinguish and separate different sounds in the mixed speech by learning and optimizing a large amount of training data. In this example, the speech separation network framework is shown in fig. 3, the main part of which is an RNN network. RNNs perform well in processing sequence data because of their ability to capture time dependencies in the sequence. By inputting the mixed speech signal into such a network, the model can learn to recognize the unique features of each speaker and attempt to separate out the individual speech of each speaker.

In the embodiment of the application, after the target mixed speech signal is input into the speech separation network, the network will start to process the signal. The process may include multiple stages such as feature extraction, sequence modeling, and separation of outputs. At each stage, the network uses its learned weights and parameters to convert and manipulate the input signal to progressively separate out the individual speech components. Wherein the output of the speech separation network is a set of separated speech signals, each of which may represent an individual speaker of the original mixed speech. These signals are represented in the time or frequency domain and can be used directly for further analysis or processing, such as speech recognition, text conversion, etc.

And step B2, recognizing the voice characteristics of each voice object through a voice separation network, and separating the voice characteristics of the voice objects to obtain a voice separation result.

In an embodiment of the application, the target mixed speech signal is input into a speech separation network. The main part of the speech separation network is an RNN network, and the speech characteristics of each speech object are identified and separated by processing the potential characteristics Y. In this process, the voice separation network learns the unique characteristics of each speaker and separates them. To achieve this separation, the speech separation network generates a Mask (Mask) for each speech object, which Mask is used to identify the location and characteristics of the speech object in the mixed speech. By performing a dot product operation on the mask and the output of the encoder, a potential representation di of the corresponding target speech object, i.e. the speech mask feature, can be obtained. Finally, decoding the voice mask characteristics of each voice object obtained by separation, and restoring the voice mask characteristics into time domain signals to obtain a voice separation result. This result contains the original speech information of the individual speech objects, the sound of each of which can be clearly recognized.

In the embodiment of the application, the voice separation result is decoded to obtain the target voice characteristics corresponding to the voice object, which comprises the steps of B21-B22:

step B21, calculating the voice mask characteristics corresponding to the voice object based on the voice characteristics of the voice object.

In the embodiment of the application, the voice mask characteristics corresponding to each voice object are calculated by utilizing the voice characteristics of each voice object extracted from the mixed voice signal. Where the speech mask feature is an important tool for identifying the location and characteristics of speech objects in the mixed speech. Which is also a vector of the same length as the mixed speech signal, each element of the vector corresponding to a point in time of the mixed speech signal. The value of this vector is higher at the point in time when the speech object appears and lower at the point in time when no speech object appears. By masking features, the location and duration of each speech object in the mixed speech can be accurately identified. Computing the speech mask features may be based on a deep learning model, such as a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN). These models may automatically extract features of the speech object by learning a large amount of training data and generate corresponding mask features.

And step B22, decoding the voice mask features corresponding to the voice objects to obtain target voice features corresponding to the voice objects.

In the embodiment of the present application, step B22 includes: and performing dot multiplication on the voice mask characteristics and the voice characteristics corresponding to the voice objects in the target mixed voice signals to obtain target voice characteristics corresponding to the voice objects.

In the embodiment of the application, point multiplication is an operation of multiplying element by element, and mask features can be combined with original voice features, so that pure features of a target voice object in mixed voice can be extracted. In the dot multiplication operation, the value of each voice mask feature is multiplied by the original voice feature value of the corresponding position. Since the voice mask feature has a higher value at the point in time when the voice object appears and a lower value at the point in time when no voice object appears, the dot product operation can effectively separate the feature of the target voice object from the mixed voice. The target voice feature corresponding to the voice object can be obtained by performing the dot multiplication operation, wherein the target voice feature is a result obtained by decoding the original voice feature and does not contain interference of other voice objects. Therefore, the target voice characteristics corresponding to the voice objects output by the voice separation network can be used for subsequent voice recognition and text classification operations.

Step S14, identifying the target voice characteristics to obtain text information, and executing text classification operation based on the text information.

In the embodiment of the application, after the voice feature extraction and the decoding of the voice mask feature of the voice object are completed, the decoded target voice feature is input into a voice recognition model, so that corresponding text information is obtained, and text classification operation is performed based on the text information. First, the decoded target speech features are input into a speech recognition model for recognition. The speech recognition model may be an offline model, such as Kaldi, or other models suitable for a particular scenario. The speech recognition model maps these features to corresponding text information by processing them. The process may involve multiple steps of matching acoustic models, searching dictionaries, decoding language models, and the like, and finally outputting the recognized text.

In the embodiment of the application, after the text information is obtained, text classification operation is performed based on the text information, and the text classification operation can be realized by a deep learning model, for example, bert is used as a pre-training model. Text is first processed through tokenizer of the BERT model, converting the text into tokens and masks, which are then input into the BERT pre-training model. After the model is processed, a semantic representation of the text is output. These representations are then fed into the full junction layer and softmax layer to obtain the final classification result. Wherein, the classification result is the tagging of the text information, which can help us understand and organize the text information. For example, in a dialog system, text classification may be used to identify intent, emotion, or subject of a dialog, etc.

In an embodiment of the present application, before performing the text classification operation based on the text information, the method further comprises the following steps C1-C4:

And step C1, acquiring a training data set, wherein the training data set comprises a text sample and a real classification label.

In the embodiment of the application, the training data set comprises a large number of text samples and real classification labels corresponding to each text sample. These text samples may be collected from various sources, and the true classification labels are key to accurately classifying the text samples, typically by manual labeling or by other reliable means.

And step C2, inputting the text sample into the initial language model to obtain sample characteristics.

In an embodiment of the application, the collected text samples are input into an initial language model. The initial language model is typically a pre-trained language model, such as Bert, that has been trained on a large amount of text data to understand and process the semantic information of the text. As the text sample passes through the initial language model, the initial language model generates a series of feature representations that capture the semantics and other important information of the text sample.

And step C3, processing the sample characteristics by using the full connection layer to obtain a corresponding classification result.

In the embodiment of the application, after the feature representation of the text sample is obtained, the features are input into a fully connected layer. The full connection layer is a neural network layer, and performs linear transformation and nonlinear activation function processing on the input features, so as to obtain the score or probability of each classification. And mapping the text sample into a specific classification space through the processing of the full connection layer, so as to obtain a classification result of the text sample.

And step C4, updating model parameters of the initial language model by using the classification result and the real classification label to obtain a text classification model.

In the embodiment of the present application, step C4 specifically includes: calculating a loss value between the classification result and the real classification label; calculating a model gradient of the loss value to the initial language model by using a back propagation algorithm; and updating model parameters of the initial language model according to the model gradient and the optimization algorithm until the model gradient of the initial language model reaches a preset condition, and taking the initial language model as a text classification model.

In the embodiment of the application, the loss value between the classification result and the real classification label is calculated, wherein the loss value is a quantization index, and the deviation between the model prediction result and the actual result is measured. In the text classification task, cross entropy loss or other similar loss functions may be utilized. These functions enable calculation of the degree of mismatch between the probability assigned by the model to each text sample and the true class labels. Higher loss values mean inaccuracy of model prediction, while lower loss values mean model prediction is closer to reality.

In an embodiment of the application, model gradients of the loss values to the initial language model are calculated using a back propagation algorithm, which is the core of the training neural network. The partial derivatives of the model parameters, i.e. the gradients, are calculated from the chain law by back propagation from the output layer. The gradient is used to characterize how the loss value changes when adjusting the model parameters, which is critical information for optimizing the model. With gradient information, optimization algorithms (e.g., gradient descent, adam, etc.) can be used to update the parameters of the model. In each iteration, the parameters of the model are adjusted according to the calculated gradient, and the process is repeated until a certain preset condition is met. These conditions may include loss values below a certain threshold, gradients becoming very small, or performance of the model on the validation set no longer significantly improving, etc. When these conditions are met, the model is considered to have converged and enough information has been learned to accurately classify the new data. The method provided by the embodiment of the application can train the initial language model in a targeted way by using the training data set which comprises the text sample and the corresponding real classification label. The text sample is input into the initial language model to obtain sample characteristics, so that the characteristic extraction and representation of text data are realized, and effective data input is provided for subsequent classification tasks. And processing the sample characteristics by using the full connection layer to obtain corresponding classification results, and realizing weighting and activating the extracted characteristics so as to generate classification decisions. By continuously adjusting the parameters of the full connection layer, the classification performance of the model can be optimized. The method realizes the adjustment and optimization of the initial language model parameters by calculating the loss value between the classification result and the real classification label and calculating the model gradient by using a back propagation algorithm.

Fig. 2 is a model framework diagram of an encoder and decoder according to an embodiment of the present invention, as shown in fig. 2, an encoder (Encoder) model framework includes: input layer, 2-D convolution layer, full connection layer, prelu activation function, dense connection network, normalization layer, prelu activation function, explained in detail as follows:

an input layer, which refers to a starting point of an encoder model, for receiving raw data, the raw data including audio or text, etc.;

A 2-D convolution layer, referred to as the first layer of the encoder, is used to perform two-dimensional convolution operations, which are operations commonly used in image processing, capable of capturing local features of an image. The convolution kernel (or filter) slides over the input image, generating a feature map that contains some local feature of the input data;

The full-connection layer is used for extracting high-level features or classifying, and integrating the features extracted by the two-dimensional convolution layer into a global feature vector;

prelu activate a function for introducing nonlinearities so that the network can learn more complex patterns;

A dense connection network (DenseNet), referred to as a deep convolutional neural network architecture, in which each layer is directly connected to all previous layers for feature reuse and enhancement of feature propagation, alleviating the gradient vanishing problem; 2-D convolution layer: the two-dimensional convolution layer is used again for further extracting or converting the characteristics output by the densely connected network;

A normalization Layer (LN) for reducing internal covariate offset by normalizing the activation values of the layer, improving training speed and stability of the network.

The Decoder (Decoder) model framework includes: the dense connection network, the sub-pixel two-dimensional convolution, the normalization layer, the 2-D convolution layer, the Tanh activation function and the Sigmoid activation function are explained specifically as follows:

Dense connectivity network (DenseNet): the initial end of the decoder model framework further extracts and converts features based on the feature representation learned in the encoding stage;

Sub-pixel two-dimensional convolution, also known as upsampling convolution or deconvolution, is used to convert low-dimensional features to high-dimensional features;

Normalization layer: also, the normalization layer is used to improve training of the network;

2-D convolution layer: for further processing and converting features.

A Tanh activation function, referred to as a hyperbolic tangent function, for compressing the output value of the neural network between-1 and 1;

sigmoid activation function refers to a logic function for mapping the output value of the neural network to between 0 and 1, the output layer of the classification problem.

Fig. 3 is a schematic structural diagram of a voice separation network according to an embodiment of the present invention, and as shown in fig. 3, the voice separation network includes: encoder output, recurrent neural network, relu activation function, sigmoid function, mask-w Mask generation layer, target speech feature output, explained in detail as follows:

the encoder output, the start of the speech separation network, is used for receiving the speech signal or audio data. The task of the encoder is to convert the input audio signal into an internal representation, typically a sequence of feature vectors, containing key information of the audio signal;

A Recurrent Neural Network (RNN) for receiving the output from the encoder and processing the sequence of feature vectors. Since RNNs have the ability to process sequence data, for further extraction and conversion of these features to facilitate subsequent separation tasks;

Relu activate functions, which are typically used to introduce nonlinearities after the RNN layer, to help the network learn more complex patterns. ReLU (RECTIFIED LINEAR Unit) is a commonly used activation function that can set all negative values to 0, while positive values remain unchanged;

sigmoid function: for mapping the output values between 0 and 1, generating a Mask (Mask) indicating which parts of the audio signal should be preserved (near 1) and which parts should be suppressed (near 0);

Mask-w Mask: a mask generated based on the output processed by the Sigmoid function. The mask is a vector of the same time length as the input audio signal, each element of which corresponds to a time step of the input signal, and the mask is generated for separating the target speech from the mixed speech signal;

The target speech feature, the output of the speech separation network, refers to the feature of speech separated from the mixed speech signal using the generated mask for speech recognition, speech classification, or further restoration to a target speech waveform by a decoder.

FIG. 4 is a schematic diagram of the structure of a text classification model according to an embodiment of the invention, as shown in FIG. 4, the model comprising: token encoder, bert model, dropout layer, linear layer, softmax function.

In an embodiment of the present application, first, the text classification model receives the input text information, feeds the text information into a Token encoder, and splits the original text into a series Tokens (e.g., words, subwords, or characters), each Token being converted into a digital vector. Next, the Token vector sequence processed by the Token encoder is sent to the Bert model. The Bert model is a pre-trained transducer model that captures context information of text and generates a vector representation that contains more rich semantic information. In the Bert model, these vectors are processed by multiple transducer encoder layers to further extract and refine the features of the text. To prevent model overfitting, a Dropout layer is applied to the output of the Bert model. In the training process, the Dropout layer randomly sets the output of a part of neurons to 0, so that the dependence of the model on specific characteristics is reduced, and the generalization capability of the model is improved. The eigenvectors processed by the Dropout layer are sent into a linear layer, the eigenvectors are subjected to linear transformation and mapped to an output space required by a classification task, and how to transform the eigenvectors into classification results is learned by adjusting weight and bias parameters. Finally, the output of the linear layer is fed into a Softmax function, which converts the output of the linear layer into a probability distribution, wherein the probability of each class is between 0 and 1, and the sum of the probabilities of all classes is 1. Thus, the model can allocate a probability of belonging to each category to each input text and output a classification label, thereby completing the text classification task.

The present embodiment provides a processing apparatus for mixed speech, as shown in fig. 5, including:

An obtaining module 51, configured to obtain a mixed voice signal, where the mixed voice signal includes initial voice features corresponding to a plurality of voice objects;

the encoding module 52 is configured to encode initial speech features corresponding to each speech object in the mixed speech signal to obtain a target mixed speech signal;

the separation module 53 is configured to perform voice separation on the target mixed voice signal to obtain a voice separation result, and decode the voice separation result to obtain a target voice feature corresponding to the voice object;

the classification module 54 is configured to identify the target speech feature to obtain text information, and perform a text classification operation based on the text information.

In an alternative embodiment of the present application, the apparatus further comprises: the training module is used for acquiring a training data set, wherein the training data set comprises a text sample and a real classification label; inputting a text sample into an initial language model to obtain sample characteristics; processing the sample characteristics by using the full connection layer to obtain corresponding classification results; and updating model parameters of the initial language model by using the classification result and the real classification label to obtain a text classification model.

In an optional embodiment of the present application, the training module is configured to calculate a loss value between the classification result and the real classification label; calculating a model gradient of the loss value to the initial language model by using a back propagation algorithm; and updating model parameters of the initial language model according to the model gradient and the optimization algorithm until the model gradient of the initial language model reaches a preset condition, and taking the initial language model as a text classification model.

In an alternative embodiment of the present application, the encoding module 52 is configured to perform fourier transform on the mixed speech signal to obtain a transformed mixed speech signal; inputting the transformed mixed voice signal into an encoder; mapping initial voice characteristics corresponding to each voice object in the transformed mixed voice signal into voice characteristics of a target dimension through an encoder; a target mixed speech signal is constructed based on the speech characteristics of the individual speech objects. In an alternative embodiment of the present application, the separation module 53 is configured to input the target mixed speech signal into the speech separation network; and recognizing the voice characteristics of each voice object through the voice separation network, and separating the voice characteristics of the voice objects to obtain a voice separation result.

In an optional embodiment of the present application, the separation module 53 is configured to calculate a voice mask feature corresponding to the voice object based on the voice feature of the voice object; and decoding the voice mask features corresponding to the voice objects to obtain target voice features corresponding to the voice objects.

In an alternative embodiment of the present application, the separation module 53 is configured to dot multiply the speech mask feature with the speech feature corresponding to the speech object in the target mixed speech signal to obtain the target speech feature corresponding to the speech object.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 6, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system).

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the computer device of the presentation of a sort of applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A method of processing mixed speech, the method comprising:

2. The method of claim 1, wherein the encoding the initial speech feature corresponding to each speech object in the mixed speech signal to obtain the target mixed speech signal includes:

inputting the transformed mixed voice signal into an encoder;

3. The method of claim 1, wherein performing the speech separation on the target mixed speech signal to obtain a speech separation result comprises:

inputting the target mixed voice signal into a voice separation network;

4. The method of claim 3, wherein decoding the speech separation result to obtain the target speech feature corresponding to the speech object comprises:

5. The method of claim 4, wherein decoding the voice mask feature corresponding to the voice object to obtain the target voice feature corresponding to the voice object comprises:

6. The method of claim 1, wherein prior to performing a text classification operation based on the text information, the method further comprises:

Processing the sample characteristics by using a full connection layer to obtain corresponding classification results;

7. The method of claim 6, wherein updating model parameters of the initial language model with the classification result and the real classification label to obtain a text classification model comprises:

8. A device for processing mixed speech, the device comprising:

9. A computer device, comprising:

a memory and a processor in communication with each other, the memory having stored therein computer instructions which, upon execution, cause the processor to perform the method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 7.