CN114360575A

CN114360575A - Single-channel human-computer interaction voice separation method and device, computer equipment and medium

Info

Publication number: CN114360575A
Application number: CN202210038787.5A
Authority: CN
Inventors: 王天圣
Original assignee: Shanghai Shuhe Information Technology Co Ltd
Current assignee: Shanghai Shuhe Information Technology Co Ltd
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-04-15

Abstract

The application relates to a single-channel human-computer interaction voice separation method and device, computer equipment and a storage medium. The method comprises the following steps: training a first time-frequency mask used for extracting a machine voice spectrum and a second time-frequency mask used for extracting a real person voice spectrum through a time sequence cyclic neural network; and performing voice separation processing on the target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained through the time sequence cyclic neural network to obtain the machine voice and the real human voice separated from the target human-computer interaction voice. According to the embodiment of the application, on the premise that the separation precision of the single-track man-machine interaction voice is not lost, the training data required by deep learning are obviously reduced, the training time is shortened, and meanwhile, the voice separation speed can be improved.

Description

Single-channel human-computer interaction voice separation method and device, computer equipment and medium

Technical Field

The present application relates to the field of monophonic voice separation technologies, and in particular, to a method and an apparatus for separating a single-channel human-computer interaction voice, a computer device, and a storage medium.

Background

The problem of single-channel speech separation is also called the 'cocktail party' problem in the speech industry, and is intended to separate target sound in a noisy environment or extract background music, and the difficulty lies in the number of input sound sources and speech targets to be separated, such as speech separation in a two-person conversation and speech separation in a three-person conversation, and even speech separation in a multi-person evening is a scene with different difficulties, and the technical schemes used by the problems are different.

From the perspective of the traditional technology, such as spectral subtraction and wiener filtering methods, the methods estimate the power spectrum of noise, the wiener filtering is optimal in the scene of separating pure voice by using minimum mean square error, and the spectral coefficient of voice can be inferred by given noise under the condition of supposing the prior distribution of voice and noise. The deep learning-based voice separation is trained by means of supervision of a deep neural network, and a good effect can be achieved.

However, the a priori conditions assumed by the above conventional methods are difficult to satisfy in real situations, especially in the case of low signal-to-noise ratio, which causes the speech separation performance of the conventional methods in many scenes to be seriously degraded, resulting in greatly reduced industrial floor practicability. The general deep learning method has large requirements on data volume and long training time, and usually takes more than 1000 hours.

Disclosure of Invention

Aiming at the defects or shortcomings, the application provides a single-channel human-computer interaction voice separation method, a single-channel human-computer interaction voice separation device, computer equipment and a storage medium.

The application provides a single-channel human-computer interaction voice separation method according to a first aspect, and in one embodiment, the method includes:

training a first time-frequency mask used for extracting a machine voice spectrum and a second time-frequency mask used for extracting a real person voice spectrum through a time sequence cyclic neural network;

performing voice separation processing on the target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained through the time sequence cyclic neural network to obtain machine voice and real human voice separated from the target human-computer interaction voice;

the process of training each time through the time sequence recurrent neural network comprises the following steps:

acquiring human-computer interaction voice for the training;

carrying out short-time Fourier transform on the acquired human-computer interaction voice to obtain a frequency spectrum of the human-computer interaction voice;

carrying out cepstrum mean variance normalization and spectral clustering on the frequency spectrum of the human-computer interaction voice to obtain a first machine voice frequency spectrum and a first real human voice frequency spectrum;

inputting the frequency spectrum of the human-computer interaction voice into a time sequence cyclic neural network to obtain a first time-frequency mask and a second time-frequency mask output by the time sequence cyclic neural network;

performing matrix decomposition on a first time-frequency mask and a second time-frequency mask output by the time sequence cyclic neural network in the training and a frequency spectrum of the human-computer interaction voice respectively to obtain a second machine voice frequency spectrum and a second real human voice frequency spectrum;

respectively calculating a first error between the first machine voice frequency spectrum and the second machine voice frequency spectrum and a second error between the first real human voice frequency spectrum and the second real human voice frequency spectrum;

judging whether the training stopping condition is met or not according to the first error and the second error;

if not, updating the network parameters of the time sequence cyclic neural network according to the first error and the second error, and carrying out next training through the updated time sequence cyclic neural network;

and if so, stopping training, and taking the first time-frequency mask and the second time-frequency mask output by the time sequence cyclic neural network in the training as the trained first time-frequency mask and second time-frequency mask.

In one embodiment, performing voice separation processing on a target human-computer interaction voice to be separated according to a first time-frequency mask and a second time-frequency mask trained by a time sequence recurrent neural network to obtain a machine voice and a real human voice separated from the target human-computer interaction voice, includes:

acquiring target human-computer interaction voice to be separated;

carrying out short-time Fourier transform on the target human-computer interaction voice to obtain a frequency spectrum of the target human-computer interaction voice;

performing matrix decomposition on a first time-frequency mask and a second time-frequency mask which are trained through a time sequence cyclic neural network and a frequency spectrum of target human-computer interaction voice respectively to obtain a machine voice frequency spectrum and a real human voice frequency spectrum;

and respectively carrying out inverse Fourier transform on the machine voice frequency spectrum and the real human voice frequency spectrum to obtain machine voice and real human voice.

In one embodiment, cepstrum mean variance normalization and spectral clustering of a spectrum of human-computer interaction speech to obtain a first machine speech spectrum and a first real human speech spectrum, comprising:

carrying out cepstrum mean variance normalization processing on the frequency spectrum of the human-computer interaction voice to obtain a processed frequency spectrum;

and carrying out frequency spectrum clustering on the processed frequency spectrum, and obtaining a first machine voice frequency spectrum and a first real person voice frequency spectrum according to the processing result of the frequency spectrum clustering.

In one embodiment, the spectral clustering of the processed spectrum, and obtaining the first machine speech spectrum and the first human speech spectrum according to the processing result of the spectral clustering, includes:

dividing the spectrum units included in the processed spectrum into a machine voice cluster and a non-machine voice cluster by using a first clustering algorithm;

using a second clustering algorithm to separate a cluster from a spectrum unit corresponding to the non-machine voice cluster as a real-person voice cluster;

obtaining a first machine voice frequency spectrum according to the machine voice cluster;

and obtaining a first real person voice frequency spectrum according to the real person voice cluster.

In one embodiment, the timing round robin neural network is a long short term memory round robin neural network; the first clustering algorithm is a K-means clustering algorithm; the second clustering algorithm is referred to as db-scan clustering algorithm.

In one embodiment, matrix decomposition is performed on a first time-frequency mask and a second time-frequency mask output by the time sequence recurrent neural network in the training, and the spectrum of the human-computer interaction speech is respectively subjected to matrix decomposition, so as to obtain a second machine speech spectrum and a second real human speech spectrum, including:

performing point multiplication on a first time-frequency mask output by the time sequence cyclic neural network in the training and a frequency spectrum of the human-computer interaction voice to obtain a second machine voice frequency spectrum;

and performing point multiplication on a second time-frequency mask output by the time sequence cyclic neural network in the training and the frequency spectrum of the human-computer interaction voice to obtain a second real human voice frequency spectrum.

In one embodiment, the first error is indicative of a degree of match between the first machine speech spectrum and the second machine speech spectrum; the second error is used to represent a degree of match between the first human voice spectrum and the second human voice spectrum.

The present application provides a single-channel human-computer interaction voice separation apparatus according to a second aspect, and in one embodiment, the apparatus includes:

the time-frequency mask training module is used for training a first time-frequency mask used for extracting a machine voice spectrum and a second time-frequency mask used for extracting a real human voice spectrum through a time sequence cyclic neural network;

the voice separation module is used for carrying out voice separation processing on the target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained through the time sequence cyclic neural network to obtain machine voice and real human voice separated from the target human-computer interaction voice;

the time-frequency mask training module is specifically configured to, when used for performing each training through the time sequence recurrent neural network:

acquiring human-computer interaction voice for the training;

According to a third aspect, the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of an embodiment of any of the methods described above when executing the computer program.

The present application provides according to a fourth aspect a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the embodiments of the method of any of the above.

In the embodiment of the application, a first time-frequency mask used for extracting a machine voice spectrum and a second time-frequency mask used for extracting a real human voice spectrum are trained through a time sequence recurrent neural network; and performing voice separation processing on the target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained through the time sequence cyclic neural network to obtain the machine voice and the real human voice separated from the target human-computer interaction voice. When training is carried out through the time sequence circulation neural network, cepstrum mean variance normalization and frequency spectrum clustering are carried out on the frequency spectrum of the man-machine interaction voice, the machine voice frequency spectrum and the real man voice frequency spectrum can be obtained quickly, the frequency spectrum obtained by cepstrum mean variance normalization and frequency spectrum clustering is high in quality, and therefore the frequency spectrum can be directly used for carrying out error analysis on the output of the time sequence circulation neural network, so that the man-machine interaction voice only needs to be prepared during training without preparing corresponding clean machine voice and clean real man voice for error analysis, training data needed by deep learning can be obviously reduced on the premise of not losing the separation precision of the single-channel man-machine interaction voice, the training time is shortened, and the voice separation speed can be improved.

Drawings

FIG. 1 is a flow chart of a single-channel human-computer interaction voice separation method according to an embodiment;

FIG. 2 is a diagram illustrating the overall process of speech separation in one embodiment;

FIG. 3 is a schematic diagram of a process for training by a sequential recurrent neural network in one embodiment;

FIG. 4 is a flowchart illustrating speech separation of target human-computer interaction speech according to an embodiment;

FIG. 5 is a block diagram of a single-channel human-computer interaction voice separation apparatus according to an embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The application provides a single-channel man-machine interaction voice separation method aiming at scenes of AI (Artificial Intelligence) equipment (such as intelligent sound boxes, intelligent mobile phones and the like which can perform voice interaction with a real person) and human interaction. In one embodiment, the single-channel human-computer interaction voice separation method comprises the steps as shown in fig. 1, and the method is explained below.

S110: and training a first time-frequency mask for extracting a machine voice spectrum and a second time-frequency mask for extracting a real human voice spectrum through a time sequence recurrent neural network.

The time sequence cyclic neural network is a LSTM (Long-Short Term Memory) cyclic neural network. The LSTM is selected in this embodiment because the time sequence characteristics can be well captured through the long-term and short-term memory capability of the LSTM, which is helpful for the invariance of the permutation and replacement of the voice, and the time-frequency masks (also called masks, etc., and its english called masks) of each voice tone can be well trained.

Specifically, in this embodiment, multiple pieces of human-computer interactive speech are prepared as training samples (to obtain a better time-frequency mask, more human-computer interactive speech may be prepared as training samples), and iterative training is performed through the timing-loop neural network and the training samples.

S120: and performing voice separation processing on the target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained through the time sequence cyclic neural network to obtain the machine voice and the real human voice separated from the target human-computer interaction voice.

The target human-computer interaction voice refers to human-computer interaction voice to be separated. The human-computer interaction voice is a mixed voice in which a machine voice and a real human voice are mixed. It should be noted that, in the above-mentioned human-computer interaction, each interaction is usually performed by one real person and the AI device, and certainly, in the whole human-computer interaction process, a plurality of real persons may interact with the AI device.

In this embodiment, the process of training the first time-frequency mask and the second time-frequency mask through the timing cycle neural network and performing speech separation using the trained first time-frequency mask and second time-frequency mask may be as shown in fig. 2.

The process of performing each training through the time sequence recurrent neural network, as shown in fig. 3, includes the following steps:

s210: and acquiring the man-machine interaction voice for the training.

In this embodiment, the human-computer interaction speech for training does not need to be labeled manually.

S220: and carrying out short-time Fourier transform (STFT) on the acquired human-computer interaction voice to obtain the frequency spectrum of the human-computer interaction voice.

The man-machine interactive voice is time domain information, and before voice separation processing is carried out on the man-machine interactive voice, the man-machine interactive voice is converted from the time domain information into frequency domain information through short-time Fourier transform, namely the frequency spectrum of the man-machine interactive voice.

Before short-time Fourier transform, human-computer interactive voice can be subjected to framing and windowing. Exemplarily, taking 20ms sampling points as a frame signal (this embodiment does not specifically limit this, and may also take 32ms sampling points as a frame signal), and dividing the human-computer interactive speech into J frames, where one frame is 256 sampling points if the sampling rate is 8kHz, one frame is 512 sampling points if the sampling frequency is 16kHz, and the sampling points are first zeroed to 256 or 512 sampling points if the length is less than 20 ms; then windowing each frame signal, wherein a Hamming window or a Hanning window can be adopted as a windowing function; and finally, carrying out short-time Fourier transform on the windowed signal of each frame so as to obtain the frequency spectrum of the human-computer interaction voice. It should be noted that, the short-time fourier transform may be performed to obtain a spectrum amplitude spectrum and a phase spectrum of the human-computer interaction voice, and hereinafter, if no particular description is given, the spectrum of the human-computer interaction voice refers to a spectrum amplitude of the human-computer interaction voice. The phase spectrum of the human-computer interaction also needs to be reserved, and when the inverse Fourier transform is performed on the machine voice spectrum or the real human voice spectrum in the following, the phase spectrum of the human-computer interaction needs to be combined to transform the machine voice spectrum or the real human voice spectrum into time domain information. The short-time fourier transform and the inverse fourier transform used in this embodiment are common techniques, and will not be described in detail here.

S230: and (3) carrying out Cepstrum Mean Variance Normalization (CMVN) and spectral clustering on the frequency spectrum of the human-computer interaction voice to obtain a first machine voice frequency spectrum and a first real human voice frequency spectrum.

S240: and inputting the frequency spectrum of the human-computer interaction voice into the time sequence recurrent neural network to obtain a first time-frequency mask and a second time-frequency mask output by the time sequence recurrent neural network.

The time sequence cyclic neural network in this embodiment is used to train two time-frequency masks, namely, a first time-frequency mask and a second time-frequency mask.

The time-frequency mask can be understood as a filter screen and is used for filtering specific frequency domain information from the frequency domain information of the man-machine interaction voice. For example, the first time-frequency mask is trained to filter spectral information of machine speech (i.e., machine speech spectrum), and the second time-frequency mask is trained to filter spectral information of human speech (i.e., human speech spectrum).

S250: and respectively carrying out matrix decomposition on the first time-frequency mask and the second time-frequency mask output by the time sequence cyclic neural network in the training and the frequency spectrum of the human-computer interaction voice to obtain a second machine voice frequency spectrum and a second real human voice frequency spectrum.

S260: a first error between the first machine speech spectrum and the second machine speech spectrum, and a second error between the first human-real speech spectrum and the second human-real speech spectrum are calculated, respectively.

Wherein the first error is indicative of a degree of match between the first machine speech spectrum and the second machine speech spectrum; the second error is used to represent a degree of match between the first human voice spectrum and the second human voice spectrum.

In this embodiment, the first machine voice spectrum and the first real human voice spectrum are real machine voice spectrum and real human voice spectrum, and the second machine voice spectrum and the second real human voice spectrum are predicted machine voice spectrum and real human voice spectrum, in order to enable the predicted machine voice spectrum and real human voice spectrum to more approximate to the real machine voice spectrum and real human voice spectrum, in each training, a first error between the first machine voice spectrum and the second machine voice spectrum and a second error between the first real human voice spectrum and the second real human voice spectrum are calculated, so as to determine whether to continue training on the timing cycle neural network according to the calculated first error and second error.

S270: and judging whether the training stopping condition is met or not according to the first error and the second error.

And summing the first error and the second error to obtain a total error, and judging whether a preset training stopping condition is met or not according to the total error. The training stopping condition may be flexibly adjusted according to a specific application scenario, which is not specifically limited in this embodiment, for example, the training frequency may reach a preset threshold, or whether the loss of the time sequence recurrent neural network is reduced to a minimum value is determined according to the total loss of the current training and the total loss of the previous training, and if the loss is reduced to the minimum value, it is determined that the current training satisfies the training stopping condition.

S280: if not, updating the network parameters of the time sequence recurrent neural network according to the first error and the second error, and carrying out next training through the updated time sequence recurrent neural network.

And summing the first error and the second error to obtain a total error, and performing back propagation according to the total error to optimize the network parameters of the time sequence recurrent neural network. And then carrying out next training through the time sequence cyclic neural network after the network parameters are updated. For the next training process, please refer to the process shown in fig. 3, which is not described herein.

S290: and if so, stopping training, and taking the first time-frequency mask and the second time-frequency mask output by the time sequence cyclic neural network in the training as the trained first time-frequency mask and second time-frequency mask.

The embodiment aims at the characteristics of a human-computer interaction scene, namely, acoustic features and dialogues of machine voice are often regular and often interactive with real human voice (also called human voice), the overlapped parts of the acoustic features and the dialogues are few, the global CMVN is added in a mask-based neural network to enhance robustness (particularly, the performance can be obviously improved under the condition of low signal-to-noise ratio), so that better acoustic features of the machine voice can be quickly extracted from the human-computer interaction voice through once simple deep clustering, and then the deep clustering is performed on non-machine voice again, so that the acoustic features of the real human voice can be conveniently extracted. If the conventional neural network scheme is adopted, besides the mixed voice which needs to be input into the neural network, clean voice (specifically, including clean machine voice and clean real human voice) which is corresponding to the mixed voice and used for performing error analysis needs to be prepared, and in the embodiment, the global CMVN and the deep clustering are added into the neural network based on the mask, so that the acoustic features of the machine voice and the acoustic features of the real human voice with high quality can be extracted, the extracted acoustic features can be directly used for performing error analysis on the output of the neural network, and thus, only the mixed voice which is used for being input into the neural network needs to be prepared for training, and thus, the training data and the training duration which are required by training can be greatly reduced under the condition of not losing precision. After the first time-frequency mask and the second time-frequency mask are trained, voice separation processing can be directly performed on the target human-computer interaction voice according to the trained first time-frequency mask and second time-frequency mask, so that compared with a general separation algorithm, if Independent Component Analysis (ICA), sparse principal Component Analysis (Spars PCA), non-negative matrix decomposition (NMF) and the like exist, the embodiment has a faster voice separation speed, exemplarily, when the general separation algorithm is used for separating the human-computer interaction voice, the separation time of the voice with the duration of 1 minute is about 10 seconds, but the embodiment only needs about 6 seconds, and the speed is increased by 40%.

In an embodiment, as shown in fig. 4, the performing, by the first time-frequency mask and the second time-frequency mask trained by the time-sequence recurrent neural network, a voice separation process on the target human-machine interaction voice to be separated to obtain a machine voice and a real human voice separated from the target human-machine interaction voice includes:

s310: acquiring target human-computer interaction voice to be separated;

s320: carrying out short-time Fourier transform on the target human-computer interaction voice to obtain a frequency spectrum of the target human-computer interaction voice;

s330: performing matrix decomposition on a first time-frequency mask and a second time-frequency mask which are trained through a time sequence cyclic neural network and a frequency spectrum of target human-computer interaction voice respectively to obtain a machine voice frequency spectrum and a real human voice frequency spectrum;

s340: and respectively carrying out inverse Fourier transform on the machine voice frequency spectrum and the real human voice frequency spectrum to obtain machine voice and real human voice.

In this embodiment, please refer to fig. 2, after the first time-frequency mask and the second time-frequency mask are trained, after the target human-computer interactive voice to be separated is subjected to short-time fourier transform to obtain frequency domain information (i.e., a frequency spectrum) of the target human-computer interactive voice, the trained first time-frequency mask and the trained second time-frequency mask can be used to perform voice separation processing on the target human-computer interactive voice without performing subsequent operations such as CMVN, spectrum clustering, and the like, so that the voice separation speed can be increased.

Specifically, matrix decomposition is carried out on the trained first time-frequency mask and the frequency spectrum of the target human-computer interaction voice to obtain a predicted machine voice frequency spectrum, and matrix decomposition is carried out on the trained second time-frequency mask and the frequency spectrum of the target human-computer interaction voice to obtain a predicted real human voice frequency spectrum. Then, the predicted machine voice spectrum and the real human voice spectrum are subjected to inverse Fourier transform respectively to transform the predicted machine voice spectrum into time domain information (i.e., machine voice) and transform the predicted real human voice spectrum into time domain information (i.e., real human voice).

In one embodiment, the performing cepstrum mean variance normalization and spectral clustering on the spectrum of the human-computer interaction speech to obtain a first machine speech spectrum and a first human-real speech spectrum includes: carrying out cepstrum mean variance normalization processing on the frequency spectrum of the human-computer interaction voice to obtain a processed frequency spectrum; and carrying out frequency spectrum clustering on the processed frequency spectrum, and obtaining a first machine voice frequency spectrum and a first real person voice frequency spectrum according to the processing result of the frequency spectrum clustering.

The method for clustering the frequency spectrum after processing and obtaining the first machine voice frequency spectrum and the first real person voice frequency spectrum according to the processing result of the frequency spectrum clustering comprises the following steps:

In a human-computer interaction scene, because the tone of a voice robot is fixed in advance, the acoustic features of the robot voice are often regular, and the human voice is often irregular, based on which, a K-means (K-means) clustering algorithm is selected to perform initial clustering on a processed spectrum, the K-means can well distinguish the robot voice from real voice, the algorithm specifically divides time-frequency units included in the processed spectrum into two clusters, one cluster is regular (namely the robot voice cluster), and the other cluster is irregular (namely the non-machine voice cluster). Because the voices are often irregular and have noise, and a third voice may also appear in a human-computer interaction scene, a db-scan clustering algorithm is selected to process irregular signal data, specifically, db-scan is used to continuously cluster time-frequency units corresponding to the non-machine voice clusters, the time-frequency units are divided into one cluster, and the divided cluster is a real-human voice cluster. The frequency spectrum obtained through the time-frequency unit corresponding to the machine voice cluster is the first machine voice frequency spectrum, and the frequency spectrum obtained through the time-frequency unit corresponding to the real person voice cluster is the first real person voice frequency spectrum. The embodiment carries out spectrum clustering through the combination of K-means and db-scan, so that the voice separation precision is far higher than that of common single-channel separation.

It should be noted that the K-means and db-scan used in the spectrum clustering process are only used as examples, and when the method is implemented, other clustering algorithms suitable for distinguishing regular class features, that is, the first clustering algorithm (including the K-means) may be selected according to specific requirements instead of performing the first clustering, and other clustering algorithms suitable for distinguishing irregular class features, that is, the second clustering algorithm (including the db-scan) may be selected to perform the second clustering, so that the effect of making the speech separation accuracy exceed that of common single-channel separation can also be achieved.

In an embodiment, the matrix decomposition of the first time-frequency mask and the second time-frequency mask output by the time-sequence recurrent neural network in the training respectively with the spectrum of the human-computer interaction speech to obtain the second machine speech spectrum and the second real-person speech spectrum includes: performing point multiplication on a first time-frequency mask output by the time sequence cyclic neural network in the training and a frequency spectrum of the human-computer interaction voice to obtain a second machine voice frequency spectrum; and performing point multiplication on a second time-frequency mask output by the time sequence cyclic neural network in the training and the frequency spectrum of the human-computer interaction voice to obtain a second real human voice frequency spectrum.

The first time-frequency mask and the second time-frequency mask output by the time sequence recurrent neural network are both binarized matrixes, and the frequency spectrum of the human-computer interaction voice is also binarized matrix, so that when the matrix decomposition is carried out, the two matrixes (such as the first time-frequency mask and the frequency spectrum of the human-computer interaction voice, and the second time-frequency mask and the frequency spectrum of the human-computer interaction voice) are subjected to point multiplication to obtain corresponding voice frequency spectrums (such as the second machine voice frequency spectrum and the second real human voice frequency spectrum).

Fig. 1, fig. 3 and fig. 4 are schematic flow charts of a single-channel human-computer interaction voice separation method in an embodiment. It should be understood that although the steps in the flowcharts of fig. 1, 3 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1, 3, and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

Based on the same inventive concept, the application also provides a single-channel human-computer interaction voice separation device. In this embodiment, as shown in fig. 5, the single-channel human-computer interaction voice separation apparatus includes the following modules:

the time-frequency mask training module 110 is used for training a first time-frequency mask used for extracting a machine voice spectrum and a second time-frequency mask used for extracting a real human voice spectrum through a time sequence recurrent neural network;

the voice separation module 120 is configured to perform voice separation processing on the target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained by the time sequence recurrent neural network, so as to obtain a machine voice and a real human voice separated from the target human-computer interaction voice;

the time-frequency mask training module 110 is specifically configured to, when performing each training through the timing cycle neural network:

acquiring human-computer interaction voice for the training;

In an embodiment, the speech separation module is specifically configured to:

acquiring target human-computer interaction voice to be separated;

In one embodiment, the time-frequency mask training module is specifically configured to, when configured to perform cepstrum-mean-variance normalization and spectral clustering on a spectrum of human-computer interaction speech to obtain a first machine speech spectrum and a first real-person speech spectrum:

In one embodiment, the time-frequency mask training module is specifically configured to, when performing spectrum clustering on the processed spectrum and obtaining the first machine voice spectrum and the first real person voice spectrum according to a processing result of the spectrum clustering:

In one embodiment, the time-frequency mask training module is specifically configured to perform matrix decomposition on a first time-frequency mask and a second time-frequency mask output by the time-sequence recurrent neural network in the training, respectively, and a spectrum of the human-computer interaction speech to obtain a second machine speech spectrum and a second real human speech spectrum:

For specific limitations of the single-channel human-computer interaction voice separation apparatus, reference may be made to the above limitations of the single-channel human-computer interaction voice separation method, which is not described herein again. All modules in the single-channel human-computer interaction voice separation device can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, the internal structure of which may be as shown in FIG. 6.

The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as the first time-frequency mask, the second time-frequency mask, and the like, and the specific stored data may also refer to the limitations in the above method embodiments. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a single-channel human-computer interaction voice separation method.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The embodiment also provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the following steps are implemented:

training a first time-frequency mask used for extracting a machine voice spectrum and a second time-frequency mask used for extracting a real person voice spectrum through a time sequence cyclic neural network; performing voice separation processing on the target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained through the time sequence cyclic neural network to obtain machine voice and real human voice separated from the target human-computer interaction voice; the process of training each time through the time sequence recurrent neural network comprises the following steps: acquiring human-computer interaction voice for the training; carrying out short-time Fourier transform on the acquired human-computer interaction voice to obtain a frequency spectrum of the human-computer interaction voice; carrying out cepstrum mean variance normalization and spectral clustering on the frequency spectrum of the human-computer interaction voice to obtain a first machine voice frequency spectrum and a first real human voice frequency spectrum; inputting the frequency spectrum of the human-computer interaction voice into a time sequence cyclic neural network to obtain a first time-frequency mask and a second time-frequency mask output by the time sequence cyclic neural network; performing matrix decomposition on a first time-frequency mask and a second time-frequency mask output by the time sequence cyclic neural network in the training and a frequency spectrum of the human-computer interaction voice respectively to obtain a second machine voice frequency spectrum and a second real human voice frequency spectrum; respectively calculating a first error between the first machine voice frequency spectrum and the second machine voice frequency spectrum and a second error between the first real human voice frequency spectrum and the second real human voice frequency spectrum; judging whether the training stopping condition is met or not according to the first error and the second error; if not, updating the network parameters of the time sequence cyclic neural network according to the first error and the second error, and carrying out next training through the updated time sequence cyclic neural network; and if so, stopping training, and taking the first time-frequency mask and the second time-frequency mask output by the time sequence cyclic neural network in the training as the trained first time-frequency mask and second time-frequency mask.

In one embodiment, the processor executes a computer program to perform voice separation processing on a target human-computer interaction voice to be separated according to a first time-frequency mask and a second time-frequency mask trained by a time sequence recurrent neural network, and when a machine voice and a real human voice separated from the target human-computer interaction voice are obtained, the following steps are further implemented:

acquiring target human-computer interaction voice to be separated; carrying out short-time Fourier transform on the target human-computer interaction voice to obtain a frequency spectrum of the target human-computer interaction voice; performing matrix decomposition on a first time-frequency mask and a second time-frequency mask which are trained through a time sequence cyclic neural network and a frequency spectrum of target human-computer interaction voice respectively to obtain a machine voice frequency spectrum and a real human voice frequency spectrum; and respectively carrying out inverse Fourier transform on the machine voice frequency spectrum and the real human voice frequency spectrum to obtain machine voice and real human voice.

In one embodiment, when the processor executes the computer program to perform cepstrum mean variance normalization and spectral clustering on the frequency spectrum of the human-computer interaction voice to obtain a first machine voice frequency spectrum and a first real human voice frequency spectrum, the following steps are further performed:

carrying out cepstrum mean variance normalization processing on the frequency spectrum of the human-computer interaction voice to obtain a processed frequency spectrum; and carrying out frequency spectrum clustering on the processed frequency spectrum, and obtaining a first machine voice frequency spectrum and a first real person voice frequency spectrum according to the processing result of the frequency spectrum clustering.

In one embodiment, the processor executes the computer program to perform spectrum clustering on the processed spectrum, and when the first machine voice frequency spectrum and the first real person voice frequency spectrum are obtained according to the processing result of the spectrum clustering, the following steps are further performed:

dividing the spectrum units included in the processed spectrum into a machine voice cluster and a non-machine voice cluster by using a first clustering algorithm; using a second clustering algorithm to separate a cluster from a spectrum unit corresponding to the non-machine voice cluster as a real-person voice cluster; obtaining a first machine voice frequency spectrum according to the machine voice cluster; and obtaining a first real person voice frequency spectrum according to the real person voice cluster.

In one embodiment, when the processor executes a computer program to perform matrix decomposition on a first time-frequency mask and a second time-frequency mask output by the time-series recurrent neural network in the training and a spectrum of the human-computer interaction speech respectively to obtain a second machine speech spectrum and a second real human speech spectrum, the following steps are further implemented:

performing point multiplication on a first time-frequency mask output by the time sequence cyclic neural network in the training and a frequency spectrum of the human-computer interaction voice to obtain a second machine voice frequency spectrum; and performing point multiplication on a second time-frequency mask output by the time sequence cyclic neural network in the training and the frequency spectrum of the human-computer interaction voice to obtain a second real human voice frequency spectrum.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, when the computer program is executed by the processor, and performs voice separation processing on a target human-computer interaction voice to be separated according to a first time-frequency mask and a second time-frequency mask trained by a time-sequence recurrent neural network to obtain a machine voice and a real human voice separated from the target human-computer interaction voice, the following steps are further implemented:

In one embodiment, the computer program, when executed by a processor, performs cepstral mean variance normalization and spectral clustering on a spectrum of human-computer interaction speech to obtain a first machine speech spectrum and a first real human speech spectrum, further performs the following steps:

In one embodiment, the computer program executed by the processor, performs spectrum clustering on the processed spectrum, and when obtaining the first machine voice spectrum and the first real person voice spectrum according to the processing result of the spectrum clustering, further implements the following steps:

In one embodiment, when the computer program is executed by the processor, and performs matrix decomposition on the first time-frequency mask and the second time-frequency mask output by the time-series recurrent neural network in the training and the frequency spectrum of the human-computer interaction speech respectively to obtain a second machine speech frequency spectrum and a second real human speech frequency spectrum, the following steps are further implemented:

It will be understood by those skilled in the art that all or part of the processes of the embodiments of the methods described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A single-channel human-computer interaction voice separation method is characterized by comprising the following steps:

performing voice separation processing on target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained through the time sequence cyclic neural network to obtain machine voice and real human voice separated from the target human-computer interaction voice;

acquiring human-computer interaction voice for the training;

performing cepstrum mean variance normalization and spectral clustering on the frequency spectrum of the human-computer interaction voice to obtain a first machine voice frequency spectrum and a first real human voice frequency spectrum;

performing matrix decomposition on a first time-frequency mask and a second time-frequency mask output by the time sequence recurrent neural network in the training and the frequency spectrum of the man-machine interactive voice respectively to obtain a second machine voice frequency spectrum and a second real man voice frequency spectrum;

calculating a first error between the first machine speech spectrum and the second machine speech spectrum and a second error between the first human-like speech spectrum and the second human-like speech spectrum, respectively;

judging whether a training stopping condition is met or not according to the first error and the second error;

2. The method of claim 1, wherein the performing voice separation processing on the target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained by the timing cycle neural network to obtain the machine voice and the real human voice separated from the target human-computer interaction voice comprises:

acquiring the target human-computer interaction voice to be separated;

performing matrix decomposition on the first time-frequency mask and the second time-frequency mask trained by the time sequence recurrent neural network and the frequency spectrum of the target human-computer interaction voice respectively to obtain a machine voice frequency spectrum and a real human voice frequency spectrum;

and respectively carrying out inverse Fourier transform on the machine voice frequency spectrum and the real person voice frequency spectrum to obtain the machine voice and the real person voice.

3. The method of claim 1, wherein said cepstral mean variance normalization and spectral clustering of the spectrum of human-computer interaction speech to obtain a first machine speech spectrum and a first human-being speech spectrum, comprises:

4. The method of claim 3, wherein the clustering the processed spectrum to obtain the first machine speech spectrum and the first human speech spectrum according to the processing result of the clustering the spectrum comprises:

using a second clustering algorithm to separate a cluster from the spectrum unit corresponding to the non-machine voice cluster as a real-person voice cluster;

5. The method of claim 4, in which the time-sequential recurrent neural network is a long-short term memory recurrent neural network; the first clustering algorithm is a K-means clustering algorithm; the second clustering algorithm is db-scan clustering algorithm.

6. The method of claim 1, wherein the matrix decomposition is performed on a first time-frequency mask and a second time-frequency mask output by the time-series recurrent neural network in the training, and the spectrum of the human-computer interactive speech, respectively, to obtain a second machine speech spectrum and a second real human speech spectrum, and includes:

performing point multiplication on a first time-frequency mask output by the time sequence cyclic neural network in the training and the frequency spectrum of the human-computer interaction voice to obtain a second machine voice frequency spectrum;

7. The method of claim 1, wherein the first error is indicative of a degree of match between the first machine speech spectrum and the second machine speech spectrum; the second error is used for representing the matching degree between the first real person voice spectrum and the second real person voice spectrum.

8. A single-channel human-computer interaction voice separation device, the device comprising:

the voice separation module is used for carrying out voice separation processing on target human-computer interaction voice to be separated according to the first time-frequency mask and the second time-frequency mask trained through the time sequence cyclic neural network to obtain machine voice and real human voice separated from the target human-computer interaction voice;

the time-frequency mask training module is specifically configured to, when performing each training through the time-sequence recurrent neural network:

acquiring human-computer interaction voice for the training;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.