CN113345461A

CN113345461A - Voice processing method and device for voice processing

Info

Publication number: CN113345461A
Application number: CN202110454916.4A
Authority: CN
Inventors: 崔国辉
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2021-09-03

Abstract

The embodiment of the invention provides a voice processing method and device and a device for voice processing, which are applied to terminal equipment. The method comprises the following steps: receiving a voice to be processed, wherein the voice to be processed comprises noise and target user sound, and the number of the target users is more than or equal to 1; acquiring the registered voice characteristics of the target user; and inputting the voice to be processed and the registered voice characteristics of the target user into a speaker extraction model, wherein the speaker extraction model extracts the target voice of the target user from the voice to be processed according to the registered voice characteristics of the target user and outputs the target voice. The embodiment of the invention can improve the quality of the call voice and protect the privacy of the user.

Description

Voice processing method and device for voice processing

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech processing method and apparatus, and an apparatus for speech processing.

Background

With the development of communication technology, voice communication has been referred to as the current main communication mode, but noise and interference from the surrounding environment during voice communication are always important factors affecting the communication experience of users.

For example, when a user uses a voice communication device to perform voice communication, noise and interference in the surrounding environment may be transmitted to the voice communication device of the user together, which may cause a communication partner not to hear the voice of the user, or cause the communication partner to hear other voices (such as the voice of a speaker around) that the user does not wish to hear, which may not only affect the voice communication effect but also expose the personal privacy of the user.

Disclosure of Invention

The embodiment of the invention provides a voice processing method and device and a voice processing device, which can improve the quality of call voice and protect the privacy of users.

In order to solve the above problem, an embodiment of the present invention discloses a speech processing method, where the method includes:

receiving a voice to be processed, wherein the voice to be processed comprises noise and target user sound, and the number of the target users is more than or equal to 1;

acquiring the registered voice characteristics of the target user;

and inputting the voice to be processed and the registered voice characteristics of the target user into a speaker extraction model, wherein the speaker extraction model extracts the target voice of the target user from the voice to be processed according to the registered voice characteristics of the target user and outputs the target voice.

Optionally, the method further comprises:

collecting a user voice sample of a registered user;

acquiring the registration voice characteristics and pure voice of the registered user;

inputting the user voice sample, the registered voice characteristics and the pure voice into an initial speaker extraction model, wherein the speaker extraction model extracts the target voice of the registered user in the user voice sample according to the registered voice characteristics of the registered user;

and iteratively optimizing the model parameters of the speaker extraction model according to the feature difference between the extracted target voice of the registered user and the pure voice of the registered user, and achieving a preset convergence condition to obtain the trained speaker extraction model.

Optionally, the speaker extraction model includes a first processing network and a second processing network, and the speaker extraction model extracts the target voice of the registered user in the user voice sample according to the registered voice feature of the registered user, including:

carrying out short-time Fourier transform on the user voice sample to obtain a sample voice magnitude spectrum;

carrying out short-time Fourier transform on the pure voice to obtain a pure voice magnitude spectrum;

extracting noisy speech features of the sample speech magnitude spectrum through the first processing network;

multiplying the registered voice feature of the registered user by the noisy voice feature through elements of a matrix to obtain a modulated voice feature;

performing feature extraction processing on the modulated voice features through the second processing network to obtain a magnitude spectrum mask;

multiplying the sample voice amplitude spectrum by the amplitude spectrum mask through matrix elements to obtain a noise reduction voice amplitude spectrum;

iteratively optimizing model parameters of the speaker extraction model according to feature differences between the extracted target speech of the registered user and the clean speech of the registered user, comprising:

calculating the characteristic difference between the noise reduction voice amplitude spectrum and the pure voice amplitude spectrum according to a preset loss function;

and iteratively optimizing the model parameters of the speaker extraction model according to the characteristic difference.

Optionally, the acquiring the registered voice feature of the target user includes:

acquiring the registration voice of the target user;

inputting the registered voice of the target user into a feature extraction model, and performing feature extraction on the registered voice of the target user to obtain the registered voice feature of the target user.

Optionally, the method further comprises:

collecting a registered voice sample of a registered user;

inputting the registered voice sample into an initial feature extraction model, and extracting to obtain a feature vector of the registered voice sample;

extracting the characteristics of the pure voice of the registered user to obtain the characteristic vector of the pure voice;

and iteratively optimizing the model parameters of the feature extraction model according to the feature difference between the feature vector of the registered voice sample and the feature vector of the pure voice to reach a preset convergence condition to obtain the trained feature extraction model.

Optionally, the inputting the registered voice sample into an initial feature extraction model, and extracting to obtain a feature vector of the registered voice sample includes:

carrying out voice activation detection on the registered voice sample, and filtering a non-voice section in the registered voice sample to obtain filtered voice;

segmenting the filtered voice according to a preset frame length to obtain a voice frame sequence corresponding to the filtered voice;

performing short-time Fourier transform on each voice frame in the voice frame sequence to obtain a voice frame magnitude spectrum corresponding to each voice frame;

inputting the voice frame magnitude spectrum into a feature extraction network of a feature extraction model, and outputting a feature vector of each voice frame magnitude spectrum;

and carrying out averaging calculation on the feature vectors of the voice frame magnitude spectrums corresponding to the voice frames in the voice frame sequence to obtain the feature vectors of the registered voice samples.

Optionally, the number of the registered users is greater than 1, the performing voice activation detection on the registered voice sample, and filtering the non-voice segment in the registered voice sample to obtain a filtered voice includes:

carrying out voice activation detection on the registered voice sample of each registered user to obtain the filtered voice of each registered voice sample;

the segmenting the filtered voice according to the preset frame length to obtain the voice frame sequence corresponding to the filtered voice comprises the following steps:

segmenting the filtered voice of each registered voice sample according to a preset frame length to obtain a voice frame sequence corresponding to each registered voice sample;

the short-time fourier transform is performed on each speech frame in the speech frame sequence to obtain a speech frame magnitude spectrum corresponding to each speech frame, and the method includes:

performing short-time Fourier transform on each voice frame in the voice frame sequence of each registered voice sample to obtain a voice frame magnitude spectrum corresponding to each voice frame in the voice frame sequence of each registered voice sample;

the inputting the voice frame magnitude spectrum into a feature extraction network of a feature extraction model and outputting a feature vector of each voice frame magnitude spectrum comprises the following steps:

inputting the voice frame magnitude spectrum corresponding to each voice frame in the voice frame sequence of each registered voice sample into a feature extraction network of a feature extraction model, and outputting a feature vector of each voice frame magnitude spectrum;

the averaging calculation of the feature vectors of the voice frame magnitude spectrums corresponding to the voice frames in the voice frame sequence to obtain the feature vectors of the registered voice samples includes:

and averaging the feature vectors of the voice frame magnitude spectrums corresponding to the voice frames in the voice frame sequence of each registered voice sample to obtain the feature vector of each registered voice sample.

Optionally, the method further comprises:

establishing a voice feature library of the registered user, wherein a mapping relation between the registered voice feature of the registered user and the user identifier of the registered user is stored in the voice feature library;

the acquiring of the registered voice feature of the target user includes:

and querying the voice feature library according to the user identification of the target user to obtain the registered voice feature of the target user.

Optionally, the number of the target users is greater than 1, and the inputting the to-be-processed speech and the registered speech feature of the target user into the speaker extraction model includes:

inputting the voice to be processed and the registered voice characteristics of each target user into a speaker extraction model;

the outputting the target voice includes:

and respectively outputting the target voice corresponding to each target user, or outputting a mixed voice containing the target voices of all the target users.

In another aspect, an embodiment of the present invention discloses a speech processing apparatus, including:

the voice acquisition module is used for receiving voice to be processed, wherein the voice to be processed comprises noise and target user voice, and the number of the target users is more than or equal to 1;

the characteristic acquisition module is used for acquiring the registered voice characteristics of the target user;

and the voice processing module is used for inputting the voice to be processed and the registered voice characteristics of the target user into a speaker extraction model, and the speaker extraction model extracts the target voice of the target user from the voice to be processed and outputs the target voice according to the registered voice characteristics of the target user.

Optionally, the apparatus further comprises:

the first collection module is used for collecting a user voice sample of a registered user;

the first acquisition module is used for acquiring the registered voice characteristics and the pure voice of the registered user;

the characteristic extraction module is used for inputting the user voice sample, the registered voice characteristics and the pure voice into an initial speaker extraction model, and the speaker extraction model extracts the target voice of the registered user in the user voice sample according to the registered voice characteristics of the registered user;

and the first iteration module is used for iteratively optimizing the model parameters of the speaker extraction model according to the feature difference between the extracted target voice of the registered user and the extracted pure voice of the registered user, and achieving a preset convergence condition to obtain the trained speaker extraction model.

Optionally, the speaker extraction model includes a first processing network and a second processing network, and the feature extraction module includes:

the first Fourier transform submodule is used for carrying out short-time Fourier transform on the user voice sample to obtain a sample voice magnitude spectrum;

the second Fourier transform submodule is used for carrying out short-time Fourier transform on the pure voice to obtain a pure voice magnitude spectrum;

the first network processing submodule is used for extracting the noise-containing voice characteristics of the sample voice amplitude spectrum through the first processing network;

the characteristic modulation submodule is used for multiplying the registered voice characteristic of the registered user and the noisy voice characteristic by the elements of the matrix to obtain a modulated voice characteristic;

the second network processing submodule is used for carrying out feature extraction processing on the modulated voice features through the second processing network to obtain an amplitude spectrum mask;

the multiplication submodule is used for carrying out matrix element multiplication on the sample voice amplitude spectrum and the amplitude spectrum mask to obtain a noise reduction voice amplitude spectrum;

the first iteration module includes:

the difference calculation submodule is used for calculating the characteristic difference between the noise reduction voice amplitude spectrum and the pure voice amplitude spectrum according to a preset loss function;

and the iterative optimization submodule is used for iteratively optimizing the model parameters of the speaker extraction model according to the characteristic difference.

Optionally, the feature obtaining module includes:

a registered voice obtaining sub-module, configured to obtain a registered voice of the target user;

and the model extraction submodule is used for inputting the registered voice of the target user into a feature extraction model, and performing feature extraction on the registered voice of the target user to obtain the registered voice feature of the target user.

Optionally, the apparatus further comprises:

the second collection module is used for collecting a registered voice sample of the registered user;

the first extraction module is used for inputting the registered voice sample into an initial feature extraction model and extracting to obtain a feature vector of the registered voice sample;

the second extraction module is used for extracting the characteristics of the pure voice of the registered user to obtain the characteristic vector of the pure voice;

and the second iteration module is used for iteratively optimizing the model parameters of the feature extraction model according to the feature difference between the feature vector of the registered voice sample and the feature vector of the pure voice, and obtaining the trained feature extraction model when a preset convergence condition is reached.

Optionally, the number of the target users is greater than 1, and the speech processing module is specifically configured to input the speech to be processed and the registered speech features of each target user into the speaker extraction model, and output the target speech corresponding to each target user respectively, or output a mixed speech including the target speech of all the target users.

In yet another aspect, an embodiment of the present invention discloses an apparatus for speech processing, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for performing the speech processing method according to any one of claims 1 to 9.

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a speech processing method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

the embodiment of the invention can perform denoising processing on the voice to be processed and extract the target voice of the target user in the voice to be processed. When the target user is in a noisy environment, the embodiment of the invention can acquire the registered voice characteristics of the target user, input the to-be-processed voice of the target user and the registered voice characteristics of the target user into the speaker extraction model, and filter out the voice except the voice of the target user as noise through the speaker extraction model. The number of target users may be greater than or equal to 1. Thus, when the number of target users is 1, all sounds (including sounds of other users) except the sound of the target user can be filtered out as noise, and only the target voice of the target user is retained. When the number of target users is greater than 1, the target voices of a plurality of target users can be reserved. By the embodiment of the invention, the voice to be processed can be denoised, only the target voice specified by the user is reserved, the quality of the call voice can be improved, and the privacy of the user can be protected.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of one embodiment of a speech processing method of the present invention;

FIG. 2 is a schematic process flow diagram of a feature extraction model of the present invention;

FIG. 3 is a schematic flow chart of feature extraction for registered voice samples of 3 registered users according to the present invention;

FIG. 4 is a diagram illustrating an inter-user cosine similarity matrix according to the present invention;

FIG. 5 is a schematic flow chart of a method for training a speaker extraction model according to the present invention;

FIG. 6 is a schematic flow chart of the present invention for extracting a target voice of a target user online using a speaker extraction model;

FIG. 7 is a block diagram of a speech processing apparatus according to an embodiment of the present invention;

FIG. 8 is a block diagram of an apparatus 800 for speech processing of the present invention;

fig. 9 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a speech processing method according to the present invention is shown, where the method specifically includes the following steps:

step 101, receiving a voice to be processed, wherein the voice to be processed comprises noise and target user sound, and the number of the target users is more than or equal to 1;

102, acquiring the registered voice characteristics of the target user;

step 103, inputting the voice to be processed and the registered voice characteristics of the target user into a speaker extraction model, wherein the speaker extraction model extracts the target voice of the target user from the voice to be processed according to the registered voice characteristics of the target user and outputs the target voice.

The voice processing method provided by the embodiment of the invention can be applied to terminal equipment, and the terminal equipment comprises but is not limited to: the system comprises earphones, a recording pen, a household intelligent terminal (comprising an air conditioner, a refrigerator, an electric cooker, a water heater and the like), a business intelligent terminal (comprising a video telephone, a conference desktop intelligent terminal and the like), a wearable device (comprising an intelligent watch, intelligent glasses and the like), a financial intelligent terminal, a smart phone, a tablet Personal computer (PDA), a vehicle-mounted device, a computer and the like.

The embodiment of the invention can be used for denoising the voice to be processed so as to filter the noise in the voice to be processed and extract the target voice of the target user in the voice to be processed. The filtered noise includes, but is not limited to, background noise, interfering tones, and other speaker (non-target user) sounds.

The voice to be processed can be the voice sent by the target user through the instant messaging terminal or the received voice. The voice to be processed can also be a voice instruction sent to the intelligent terminal device by the target user. The voice to be processed can also be a voice segment recorded by the terminal equipment or any voice segment downloaded through the network. It is understood that the embodiment of the present invention does not limit the source of the to-be-processed speech.

In the embodiment of the invention, the target user is a registered voice user. The registered voice refers to a voice recorded by a user through a recording device (devices with a recording function, such as a mobile phone, etc.), the voice is used as the registered voice of the user, the registered voice may contain noise, and the specific content of the registered voice is not limited by the present invention, and may be any voice content recorded for the registered user. The embodiment of the invention refers to the user who has registered voice as a registered user.

For the registered voice of the registered user, the embodiment of the invention can perform feature extraction to obtain the registered voice feature of the registered user, and can extract the voice of the target user from the voice to be processed to obtain the target voice based on the registered voice feature. Specifically, the embodiment of the present invention inputs the voice to be processed and the registered voice feature of the target user into the speaker extraction model, and the speaker extraction model extracts the target voice of the target user from the voice to be processed according to the registered voice feature of the target user and outputs the target voice.

In one example, the target user uses the instant messaging terminal for voice communication, the target user is in a noisy environment with background noise such as car, wind, etc., and also with sound from other speakers. The embodiment of the invention can take the conversation voice of the target user as the voice to be processed, input the voice to the speaker extraction model together with the registered voice characteristics of the target user, filter out the voice except the voice of the target user as noise through the speaker extraction model, and only keep the conversation voice of the target user. Thus, the quality of the call voice can be improved.

The voice processing method of the embodiment of the invention can be applied to an instant communication scene, carries out noise reduction processing on the call voice, only keeps the voice of a caller and improves the call quality. Further, for an instant messaging scene, the to-be-processed voice may be a voice in a voice call or a voice extracted from a video call. In addition, the embodiment of the invention can also be used in a voice recognition scene, carries out noise reduction processing on the voice command, only keeps the sound of a command sender, and improves the accuracy rate of voice command recognition.

In the embodiment of the invention, before the target voice of the target user is extracted, the registered voice feature of the target user needs to be acquired. Based on the registered voice characteristics of the target user, the voice except the voice of the target user can be used as noise for filtering, and the target voice only retaining the voice of the target user is obtained.

In an optional embodiment of the present invention, the number of the target users may be greater than 1, and the inputting the to-be-processed speech and the registered speech feature of the target user into the speaker extraction model specifically may include: inputting the voice to be processed and the registered voice characteristics of each target user into a speaker extraction model;

the outputting the target voice may specifically include: and respectively outputting the target voice corresponding to each target user, or outputting a mixed voice containing the target voices of all the target users.

Specifically, the speech to be processed and the registered speech features of each target user in n (n is larger than 1) target users are input into the speaker extraction model. In one example, the target users include user 1, user 2, and user 3, and according to the setting of the users, the target voice corresponding to each target user may be output, such as outputting the target voice of user 1, the target voice of user 2, and the target voice of user 3, respectively. Alternatively, a mixed voice including the target voices of all the target users may be output, such as outputting a mixed voice including the target voice of user 1, the target voice of user 2, and the target voice of user 3.

The embodiment of the invention does not limit the specific mode for acquiring the registered voice characteristics of the target user.

Further, the embodiment of the invention can pre-train the feature extraction model, and extract the registered voice feature of the target user through the feature extraction model. In addition, the embodiment of the invention also trains the speaker extraction model in advance, and extracts the target voice of the target user in the voice to be processed through the speaker extraction model.

In an optional embodiment of the present invention, the acquiring the registered voice feature of the target user specifically may include:

step S11, acquiring the registered voice of the target user;

and step S12, inputting the registered voice of the target user into a feature extraction model, and performing feature extraction on the registered voice of the target user to obtain the registered voice feature of the target user.

The embodiment of the invention trains the feature extraction model in advance, and extracts the registered voice feature of the target user through the feature extraction model. The feature extraction model may be used off-line or on-line. When the method is used off line, the registered voice of each registered user can be pre-recorded, and the trained feature extraction model is used for carrying out feature extraction on the registered voice of each registered user to obtain and store the registered voice features of each registered user. And querying the stored registered voice characteristics of the registered user to obtain the registered voice characteristics of the target user. When the voice denoising method is used online, the registered voice of the target user can be recorded in real time, the registered voice feature of the target user is extracted online, and then denoising processing is carried out on the voice to be processed according to the registered voice feature of the target user extracted in real time.

In an optional embodiment of the invention, the method may further comprise: establishing a voice feature library of the registered user, wherein a mapping relation between the registered voice feature of the registered user and the user identifier of the registered user is stored in the voice feature library;

the obtaining of the registered voice feature of the target user may specifically include: and querying the voice feature library according to the user identification of the target user to obtain the registered voice feature of the target user.

After the training of the feature extraction model is completed, feature extraction can be performed on the registered voice of each registered user to obtain the registered voice feature of each registered user, and then the mapping relationship between the registered voice feature of each registered user and the user identifier of the registered user can be stored to establish a voice feature library.

Therefore, when the speaker extraction model is used for extracting the target voice of the target user from the voice to be processed, the user identification of the target user can be used for directly inquiring in the voice feature library to obtain the registered voice feature of the target user, and the efficiency of extracting the target voice of the target user can be improved.

In an optional embodiment of the invention, the method may further comprise:

step S21, collecting a registered voice sample of the registered user;

step S22, inputting the registered voice sample into an initial feature extraction model, and extracting to obtain a feature vector of the registered voice sample;

step S23, extracting the characteristics of the pure voice of the registered user to obtain the characteristic vector of the pure voice;

and step S24, iteratively optimizing the model parameters of the feature extraction model according to the feature difference between the feature vector of the registered voice sample and the feature vector of the pure voice, and obtaining the trained feature extraction model when a preset convergence condition is reached.

The feature extraction model can be obtained by performing supervised training on the existing neural network according to a large number of training samples and a machine learning method. It should be noted that, the embodiment of the present disclosure does not limit the model structure and the training method of the feature extraction model. The feature extraction model may fuse multiple neural networks. The neural network includes, but is not limited to, at least one or a combination, superposition, nesting of at least two of the following: CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory) Network, RNN (Simple Recurrent Neural Network), attention Neural Network, and the like.

The registered voice samples are collected registered voices input by a large number of registered users, the registered voice samples are input into an initial feature extraction model, and feature vectors of the registered voice samples are extracted and obtained, and if the feature vectors are recorded as x.

In addition, for each registered voice sample, the embodiment of the present invention obtains the clean voice recorded by the registered user corresponding to the registered voice sample, that is, each registered voice sample has a corresponding clean voice, and the clean voice refers to the voice recorded by the registered user in an environment without noise. And extracting the characteristics of the pure voice to obtain the characteristic vector of the pure voice, wherein the characteristic vector is marked as y.

It should be noted that, the embodiment of the present invention does not limit the content of the registered voice sample and the content of the clean voice.

And calculating the characteristic difference between the characteristic vector x of the registered voice sample and the characteristic vector y of the pure voice by using a preset loss function, and then iteratively optimizing the model parameters of the characteristic extraction model according to the characteristic difference until a preset convergence condition is reached, so that the trained characteristic extraction model can be obtained.

In an optional embodiment of the present invention, the inputting the registered voice sample into an initial feature extraction model, and extracting to obtain the feature vector of the registered voice sample may specifically include:

step S31, carrying out voice activation detection on the registered voice sample, and filtering the non-voice section in the registered voice sample to obtain filtered voice;

step S32, segmenting the filtered voice according to a preset frame length to obtain a voice frame sequence corresponding to the filtered voice;

step S33, performing short-time Fourier transform on each voice frame in the voice frame sequence to obtain a voice frame magnitude spectrum corresponding to each voice frame;

step S34, inputting the voice frame magnitude spectrum into a feature extraction network of a feature extraction model, and outputting a feature vector of each voice frame magnitude spectrum;

step S35, performing averaging calculation on the feature vectors of the speech frame magnitude spectrum corresponding to each speech frame in the speech frame sequence to obtain the feature vectors of the registered speech sample.

Voice Activity Detection (VAD) aims to detect whether a Voice signal is contained in a current Voice signal. In order to enhance the robustness of the feature extraction model to environmental changes, the embodiment of the invention firstly performs voice activation detection on the input registered voice sample, and filters out non-voice sections therein, so as to reduce the interference of noise to the feature extraction model.

After voice activation detection is carried out on the registered voice sample to obtain filtered voice, the filtered voice is segmented according to a preset frame length to obtain a voice frame sequence corresponding to the filtered voice. The specific length of the preset frame length is not limited in the embodiment of the present invention, for example, the preset frame length may be 20ms to 30 ms. Through the frame level processing, the streaming processing of the voice can be realized, so that the feature extraction model can be suitable for a scene of online real-time feature extraction, the online processing efficiency can be improved, and the processing delay is reduced.

After the frame-level segmentation is performed on the filtered speech to obtain a speech frame sequence corresponding to the filtered speech, Short-Time Fourier Transform (STFT) is performed on each speech frame in the speech frame sequence to obtain a speech frame magnitude spectrum corresponding to each speech frame.

And inputting the voice frame magnitude spectrum into a feature extraction network of a feature extraction model, and outputting a feature vector of each voice frame magnitude spectrum. In order to enable the feature extraction model to adapt to the change of the user environment, the feature extraction network of the embodiment of the invention adopts an LSTM-based network structure. In specific implementation, the number of layers of the LSTM may be flexibly set according to actual needs, for example, 1 layer to 3 layers may be set. In one example, the feature extraction network may include 3 LSTM layers and 1 Linear layer. The Linear layer converts the input signatures into output signatures using matrix multiplication.

After the voice frame magnitude spectrum passes through the LSTM layer and the Linear layer, a feature vector of each voice frame magnitude spectrum is obtained, and the feature vectors of the voice frame magnitude spectrums corresponding to the voice frames in the voice frame sequence are subjected to averaging calculation to obtain the feature vector of the registered voice sample. Specifically, an arithmetic mean is calculated for the feature vector of the magnitude spectrum of each speech frame in the dimension of the frame to obtain the feature vector of the registered speech sample. By averaging the feature vectors of multiple speech frames, the effect of noise in the registered speech samples can be reduced.

In an optional embodiment of the present invention, after obtaining the feature vector of each registered speech sample, the method may further include:

and carrying out normalization calculation on the feature vector of the registered voice sample to obtain the final feature vector of the registered voice sample.

The volume of the user speaking in the real environment may be different, and in order to further reduce the influence of the volume of the registered user on the performance of the feature extraction model, the embodiment of the present invention performs normalization calculation on the feature vector x of the obtained registered voice sample. Specifically, a 2-norm may be calculated for the feature vector x. The norm calculated is denoted x _ norm, i.e., eigen _ vector ═ x/x _ norm. The eigen _ vector is a final feature vector of the registered voice sample of the single registered user.

The dimension of the feature vector of the final output can be generally selected according to actual needs, for example, the dimension of the output can be selected to be greater than or equal to 64.

Referring to fig. 2, a schematic processing flow diagram of a feature extraction model of the present invention is shown. In the training stage of the feature extraction model, the input of the model is a registered voice sample, and the output is a feature vector of the registered voice sample. In the online use stage of the feature extraction model, the input of the model is the registered voice of the target user, and the output is the registered voice feature of the target user.

In an optional embodiment of the present invention, the number of the registered users may be greater than 1, and the performing voice activation detection on the registered voice sample, and filtering the non-voice segments in the registered voice sample to obtain filtered voice includes:

In the embodiment of the invention, the registered voice samples of n (n is more than 1) registered users can be trained simultaneously. In the embodiment of the invention, n is 3. Referring to fig. 3, a schematic flow chart of feature extraction performed on registered voice samples of 3 registered users in the embodiment of the present invention is shown.

As shown in fig. 3, assume that there are 3 registered users, user 1, user 2, and user 3. And carrying out voice activation detection on the registered voice sample of each registered user to obtain the filtered voice of each registered voice sample. And segmenting the filtered voice of each registered voice sample according to the length of a preset frame to obtain a voice frame sequence corresponding to each registered voice sample. For example, the registered voice sample of each registered user is divided into 5 voice frames, and each voice frame has a length of 2 seconds to 3 seconds. That is, the registered speech sample of user 1 is segmented into a sequence of speech frames comprising 5 speech frames. The registered speech sample of user 2 is segmented into a sequence of speech frames comprising 5 speech frames. The registered speech samples of user 3 are segmented into a sequence of speech frames comprising 5 speech frames. The three registered speech samples contain a total of 15 speech frames. And carrying out short-time Fourier transform on each voice frame in the voice frame sequence of each registered voice sample to obtain a voice frame magnitude spectrum corresponding to each voice frame in the voice frame sequence of each registered voice sample. Inputting the voice frame magnitude spectrum corresponding to each voice frame in the voice frame sequence of each registered voice sample into a feature extraction network of the feature extraction model, and outputting a feature vector of each voice frame magnitude spectrum. That is, 15 feature vectors are extracted for the 15 speech frames, respectively. Finally, the feature vector of the voice frame magnitude spectrum corresponding to each voice frame in the voice frame sequence of each registered voice sample (hereinafter referred to as the feature vector of the voice frame) is subjected to averaging calculation to obtain the feature vector of each registered voice sample. For example, the feature vector of the registered voice sample of the user 1 can be obtained by averaging the feature vectors of the voice frame magnitude spectra corresponding to 5 voice frames in the voice frame sequence of the user 1. Similarly, the feature vector of the registered voice sample of the user 2 and the feature vector of the registered voice sample of the user 3 can be calculated.

The embodiment of the invention calculates the arithmetic mean of the feature vectors of 5 voice frames of each registered voice sample, and can obtain 3 final feature vectors for the 3 registered users, such as the three feature vectors at the rightmost side in fig. 3.

In one example, the loss function of the feature extraction model may be calculated by using cosine similarity, specifically, the cosine similarity between the feature vector x and the feature vector y may be calculated, where the cosine similarity calculation formula is as follows:

wherein x is_u,iIs the feature vector, y, of the ith speech frame in the registered speech sample of the u registered user_jAnd the value range of u and j is 1-n, and the value range of i is determined by the frame number of the segmented voice frame. In the embodiment of the invention, y is_jReferred to as the centroid feature vector for the jth registered user. s_((u,i),j)And expressing the cosine similarity between the feature vector of the ith voice frame of the u-th registered user and the centroid feature vector of the j-th user.

When training the registered voice samples of n (n is greater than 1) registered users at the same time, a Loss function Loss1 of the feature extraction model can be defined as follows:

Loss1＝-1*(sum1-sum2) (2)

the sum1 is used for representing the sum of cosine similarities of the same registered user, and the sum2 is used for representing the sum of cosine similarities of different registered users.

Taking the three registered users, i.e. user 1, user 2, and user 3, as an example, assuming that the registered voice sample of each registered user is segmented into 5 voice frames, then:

in this example, i is 1,2,3,4, 5.

Referring to fig. 4, a cosine similarity matrix between user 1, user 2, and user 3 calculated using formula (1) in this example is shown. The cosine similarity matrix is a 15 × 3 matrix. Based on the cosine similarity matrix, the value of the Loss function Loss1 of the feature extraction model can be calculated using equation (2). As shown in fig. 4, with iterative training of the feature extraction model, the Loss function Loss1 makes the similarity of the same user continuously increase (sum1), and makes the similarity of different users smaller and smaller (sum 2). And after the feature extraction model is converged, the trained feature extraction model can be obtained.

After obtaining the registered voice feature of the target user, the to-be-processed voice and the registered voice feature of the target user can be input into a speaker extraction model, and the speaker extraction model extracts the target voice of the target user in the to-be-processed voice according to the registered voice feature of the target user and outputs the target voice. The embodiment of the invention trains a speaker extraction model used for extracting the voice of a target user in advance.

In an optional embodiment of the invention, the method may further comprise:

step S41, collecting a user voice sample of a registered user;

step S42, obtaining the registered voice feature and the pure voice of the registered user;

step S43, inputting the user voice sample, the registered voice characteristics and the pure voice into an initial speaker extraction model, wherein the speaker extraction model extracts the target voice of the registered user in the user voice sample according to the registered voice characteristics of the registered user;

and step S44, iteratively optimizing the model parameters of the speaker extraction model according to the feature difference between the extracted target voice of the registered user and the pure voice of the registered user, and obtaining the trained speaker extraction model when reaching the preset convergence condition.

The speaker extraction model can be obtained by performing supervised training on the existing neural network according to a large number of training samples and a machine learning method. It should be noted that, the embodiment of the present disclosure does not limit the model structure and the training method of the speaker extraction model. The feature extraction model may fuse multiple neural networks. The neural network includes, but is not limited to, at least one or a combination, superposition, nesting of at least two of the following: CNN, LSTM, RNN, attention neural network, etc.

The user voice samples are a large amount of collected registered user voices and can be historical voices, such as historical call voices, historical voice instructions and the like. It can be understood that the embodiment of the present invention does not limit the source of the user voice sample. The clean voice refers to a voice which is recorded by a registered user in an environment without noise. It should be noted that, the embodiment of the present invention does not limit the contents of the registered user voice and the pure voice.

For the speaker extraction model, a mode of combined training with the feature extraction model can be adopted, and a mode of independently training the speaker extraction model can also be adopted.

In an optional embodiment of the present invention, the speaker extraction model includes a first processing network and a second processing network, and the speaker extraction model extracts a target voice of a registered user in the user voice sample according to a registered voice feature of the registered user, which may specifically include:

step S51, carrying out short-time Fourier transform on the user voice sample to obtain a sample voice magnitude spectrum;

step S52, carrying out short-time Fourier transform on the pure voice to obtain a pure voice magnitude spectrum;

step S53, extracting noisy speech characteristics of the sample speech amplitude spectrum through the first processing network;

step S54, multiplying the registered voice feature of the registered user by the noisy voice feature through the elements of the matrix to obtain a modulated voice feature;

step S55, performing feature extraction processing on the modulated voice features through the second processing network to obtain a magnitude spectrum mask;

step S56, multiplying the sample voice amplitude spectrum by the amplitude spectrum mask by the elements of the matrix to obtain a noise reduction voice amplitude spectrum;

the iteratively optimizing the model parameters of the speaker extraction model according to the feature difference between the extracted target speech of the registered user and the extracted clean speech of the registered user may specifically include:

step S61, calculating the characteristic difference between the noise reduction voice amplitude spectrum and the pure voice amplitude spectrum according to a preset loss function;

and step S62, iteratively optimizing the model parameters of the speaker extraction model according to the feature difference.

Referring to FIG. 5, a schematic flow chart of training a speaker extraction model according to the present invention is shown. As shown in fig. 5, first, a short-time fourier transform is performed on the user speech sample to obtain a sample speech amplitude spectrum, and a short-time fourier transform is performed on the clean speech to obtain a clean speech amplitude spectrum, which is denoted as y. The user speech samples are noisy speech (containing noise, interference, or other speaker's voice) that is subjected to a short-time fourier transform (STFT) to the time-frequency domain. The same is done with pure speech.

The speaker extraction model shown in FIG. 5 includes a first processing network and a second processing network. The first processing network includes: a convolutional neural network layer, a unidirectional LSTM layer and a Linear layer. In order to ensure the real-time performance of the speaker extraction model, the embodiment of the invention adopts a streaming processing mode. The Convolutional Neural Network (CNN) performs one-dimensional feature extraction on input time-frequency domain data (sample speech magnitude spectrum), that is, performs feature extraction on input data in a frequency domain. The input of the convolutional neural network layer is a sample voice amplitude spectrum, and the output is the characteristic of the sample voice amplitude spectrum. The output of the convolutional neural network layer is further processed by a unidirectional LSTM layer and a Linear layer in sequence, and the characteristics of the noisy speech are output.

It should be noted that the number of layers of the LSTM is not limited in the embodiment of the present invention. In the speaker extraction model, the present embodiment preferably employs two single-layer unidirectional LSTM networks, taking into account the real-time causal characteristics of the networks.

After the sample voice amplitude spectrum is sequentially processed by a convolutional neural network layer, a one-way LSTM layer and a Linear layer, matrix elements of the output noisy voice features and the registered voice features are multiplied to obtain modulated voice features. Specifically, element-wise multiplication is performed on each element included in the vector of the noisy speech feature and the vector of the registered speech feature. The step is used for carrying out amplitude modulation on the noisy speech features by utilizing the vectors of the registered speech features to obtain modulated speech features. However, the result after modulation may not be ideal, and therefore, in order to improve the denoising effect, the embodiment of the present invention inputs the modulated speech features into the second processing network for further feature extraction processing.

The second network comprises: unidirectional LSTM layer, Linear layer. The modulated voice features are subjected to feature extraction processing of a second one-way LSTM layer and a Linear layer, and a final mask (amplitude spectrum mask) can be output.

Next, the sample speech magnitude spectrum is multiplied by the elements of the matrix of the magnitude spectrum mask to obtain a final noise-reduced speech magnitude spectrum, which is called a noise-reduced speech magnitude spectrum and is marked as y'.

And calculating the characteristic difference between the noise reduction voice amplitude spectrum and the pure voice amplitude spectrum according to a preset loss function, and iteratively optimizing the model parameters of the speaker extraction model according to the characteristic difference until a preset convergence condition is reached to obtain the trained speaker extraction model. It should be noted that, the embodiment of the present invention does not limit the specific form of the preset loss function of the speaker extraction model. In one example, the Loss function Loss2 that defines the speaker extraction model is as follows:

Loss2＝f(y-y') (3)

in practical applications, excessive suppression of noise or interference may occur due to the speaker extraction model. For example, when the input speech is pure speech, the speaker extraction model may damage the input speech, and reduce the speech recognition performance. To avoid this problem, the embodiment of the present invention introduces a protection measure in the stage of training the speaker extraction model. Specifically, when y-y' is 0, it represents that the speaker extraction model is just right to suppress noise; when y-y' <0, it represents that the speaker extraction model does not sufficiently suppress noise; when y-y '> 0, it represents the situation that the speaker extraction model has over-suppressed noise, and at this time, the penalty for the speaker extraction model can be increased, for example, the (y-y') is multiplied by a factor beta, which is usually selected as: 1< beta < 20. And training the speaker extraction model by using the calculated Loss2 value until the model converges to obtain the trained speaker extraction model.

After the training of the speaker extraction model is completed, the speaker extraction model can be used on line to extract the target voice of the target user in the voice to be processed. Referring to FIG. 6, a flow chart of the present invention for extracting a target voice of a target user on-line using a speaker extraction model is shown.

As shown in fig. 6, after the speech to be processed is processed layer by layer through the short-time fourier, convolutional neural network layer, unidirectional LSTM, and Linear layer, the output result is element-multiplied with the registered speech feature of the target user. That is, the elements of the two vectors are element-wise multiplied. And (5) performing feature extraction processing on the multiplied result through a unidirectional LSTM layer and a Linear layer to obtain a magnitude spectrum mask. And multiplying the amplitude spectrum mask by the short-time Fourier transform result of the voice to be processed by the elements of the matrix, and finally performing short-time inverse Fourier transform to output the target voice of the target user.

In summary, the embodiment of the present invention can perform denoising processing on the voice to be processed, and extract the target voice of the target user in the voice to be processed. When the target user is in a noisy environment, the embodiment of the invention can acquire the registered voice characteristics of the target user, input the to-be-processed voice of the target user and the registered voice characteristics of the target user into the speaker extraction model, and filter out the voice except the voice of the target user as noise through the speaker extraction model. The number of target users may be greater than or equal to 1. Thus, when the number of target users is 1, all sounds (including sounds of other users) except the sound of the target user can be filtered out as noise, and only the target voice of the target user is retained. When the number of target users is greater than 1, the target voices of a plurality of target users can be reserved. By the embodiment of the invention, the voice to be processed can be denoised, only the target voice specified by the user is reserved, the quality of the call voice can be improved, and the privacy of the user is protected.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 7, a block diagram of a speech processing apparatus according to an embodiment of the present invention is shown, where the apparatus may include:

the voice acquiring module 701 is configured to receive a voice to be processed, where the voice to be processed includes noise and target user sound, and the number of target users is greater than or equal to 1;

a feature obtaining module 702, configured to obtain a registered voice feature of the target user;

the voice processing module 703 is configured to input the to-be-processed voice and the registered voice feature of the target user into a speaker extraction model, where the speaker extraction model extracts the target voice of the target user from the to-be-processed voice according to the registered voice feature of the target user, and outputs the target voice.

Optionally, the apparatus further comprises:

the first iteration module includes:

Optionally, the feature obtaining module includes:

Optionally, the apparatus further comprises:

Optionally, the first extraction module includes:

the voice detection sub-module is used for carrying out voice activation detection on the registered voice sample and filtering a non-voice section in the registered voice sample to obtain filtered voice;

the voice segmentation submodule is used for segmenting the filtered voice according to a preset frame length to obtain a voice frame sequence corresponding to the filtered voice;

the third Fourier transform submodule is used for carrying out short-time Fourier transform on each voice frame in the voice frame sequence to obtain a voice frame magnitude spectrum corresponding to each voice frame;

the model processing submodule is used for inputting the voice frame magnitude spectrum into a feature extraction network of a feature extraction model and outputting a feature vector of each voice frame magnitude spectrum;

and the average calculation submodule is used for carrying out average calculation on the feature vector of the voice frame amplitude spectrum corresponding to each voice frame in the voice frame sequence to obtain the feature vector of the registered voice sample.

Optionally, the number of the registered users is greater than 1, and the voice detection sub-module is specifically configured to perform voice activation detection on the registered voice sample of each registered user to obtain filtered voice of each registered voice sample;

the voice segmentation submodule is specifically used for segmenting the filtered voice of each registered voice sample according to a preset frame length to obtain a voice frame sequence corresponding to each registered voice sample;

the third fourier transform submodule is specifically configured to perform short-time fourier transform on each voice frame in the voice frame sequence of each registered voice sample to obtain a voice frame magnitude spectrum corresponding to each voice frame in the voice frame sequence of each registered voice sample;

the model processing submodule is specifically used for inputting the voice frame magnitude spectrum corresponding to each voice frame in the voice frame sequence of each registered voice sample into the feature extraction network of the feature extraction model and outputting the feature vector of each voice frame magnitude spectrum;

the average calculation submodule is specifically configured to perform average calculation on the feature vector of the voice frame magnitude spectrum corresponding to each voice frame in the voice frame sequence of each registered voice sample to obtain the feature vector of each registered voice sample.

Optionally, the apparatus further comprises:

the feature library establishing module is used for establishing a voice feature library of the registered user, and the voice feature library stores the mapping relation between the registered voice feature of the registered user and the user identifier of the registered user;

the feature obtaining module is specifically configured to query the voice feature library according to a user identifier of a target user, so as to obtain a registered voice feature of the target user.

The embodiment of the invention can perform denoising processing on the voice to be processed and extract the target voice of the target user in the voice to be processed. When the target user is in a noisy environment, the embodiment of the invention can acquire the registered voice characteristics of the target user, input the to-be-processed voice of the target user and the registered voice characteristics of the target user into the speaker extraction model, and filter out the voice except the voice of the target user as noise through the speaker extraction model. The number of target users may be greater than or equal to 1. Thus, when the number of target users is 1, all sounds (including sounds of other users) except the sound of the target user can be filtered out as noise, and only the target voice of the target user is retained. When the number of target users is greater than 1, the target voices of a plurality of target users can be reserved. By the embodiment of the invention, the voice to be processed can be denoised, only the target voice specified by the user is reserved, the quality of the call voice can be improved, and the privacy of the user is protected.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for speech processing, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors comprises instructions for: receiving a voice to be processed, wherein the voice to be processed comprises noise and target user sound, and the number of the target users is more than or equal to 1; acquiring the registered voice characteristics of the target user; and inputting the voice to be processed and the registered voice characteristics of the target user into a speaker extraction model, wherein the speaker extraction model extracts the target voice of the target user from the voice to be processed according to the registered voice characteristics of the target user and outputs the target voice.

Fig. 8 is a block diagram illustrating an apparatus 800 for speech processing according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the apparatus 800, the change in position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and the change in temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 9 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the voice processing method shown in fig. 1.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a speech processing method, the method comprising: receiving a voice to be processed, wherein the voice to be processed comprises noise and target user sound, and the number of the target users is more than or equal to 1; acquiring the registered voice characteristics of the target user; and inputting the voice to be processed and the registered voice characteristics of the target user into a speaker extraction model, wherein the speaker extraction model extracts the target voice of the target user from the voice to be processed according to the registered voice characteristics of the target user and outputs the target voice.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims. The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The foregoing has described in detail a speech processing method, a speech processing apparatus and a speech processing apparatus provided by the present invention, and the present disclosure has applied specific examples to explain the principles and embodiments of the present invention, and the descriptions of the foregoing examples are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of speech processing, the method comprising:

acquiring the registered voice characteristics of the target user;

2. The method of claim 1, further comprising:

collecting a user voice sample of a registered user;

3. The method of claim 2, wherein the speaker extraction model comprises a first processing network and a second processing network, and wherein the speaker extraction model extracts the target speech of the registered user in the user speech sample according to the registered speech features of the registered user, comprising:

4. The method of claim 1, wherein the obtaining the registered voice characteristics of the target user comprises:

acquiring the registration voice of the target user;

5. The method of claim 4, further comprising:

collecting a registered voice sample of a registered user;

6. The method of claim 5, wherein inputting the registered voice sample into an initial feature extraction model, and extracting a feature vector of the registered voice sample comprises:

7. The method of claim 6, wherein the number of registered users is greater than 1, and wherein performing voice activity detection on the registered voice samples and filtering non-voice segments in the registered voice samples to obtain filtered voice comprises:

8. The method of claim 1, further comprising:

the acquiring of the registered voice feature of the target user includes:

9. The method of claim 1, wherein the number of target users is greater than 1, and the inputting the speech to be processed and the registered speech features of the target users into the speaker extraction model comprises:

the outputting the target voice includes:

10. A speech processing apparatus, characterized in that the apparatus comprises:

11. The apparatus of claim 10, further comprising:

12. The apparatus of claim 11, wherein the speaker extraction model comprises a first processing network and a second processing network, and wherein the feature extraction module comprises:

the first iteration module includes:

13. The apparatus of claim 10, wherein the feature obtaining module comprises:

14. The apparatus of claim 13, further comprising:

15. The apparatus according to claim 10, wherein the number of the target users is greater than 1, and the speech processing module is specifically configured to input the speech to be processed and the registered speech features of each target user into the speaker extraction model, and output the target speech corresponding to each target user respectively, or output a mixed speech including the target speech of all the target users.

16. An apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein the one or more programs configured to be executed by the one or more processors comprise instructions for performing the method of speech processing according to any one of claims 1-9.

17. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the speech processing method of any of claims 1 to 9.