CN113782006A

CN113782006A - Voice extraction method, device and equipment

Info

Publication number: CN113782006A
Application number: CN202111033767.0A
Authority: CN
Inventors: 史慧宇; 尹首一; 韩慧明; 刘雷波; 魏少军
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2021-12-10

Abstract

The embodiment of the specification provides a voice extraction method, a voice extraction device and voice extraction equipment. The method comprises the following steps: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object; constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal. The method accurately and effectively extracts the voice from the single-channel voice, and meets the related requirements of subsequent voice processing.

Description

Voice extraction method, device and equipment

Technical Field

The embodiment of the specification relates to the technical field of voice signal processing, in particular to a voice extraction method, a device and equipment.

Background

With the development of computer and artificial intelligence technologies, automatic speech recognition based on intelligent devices has also been widely used. In practical application, the intelligent device collects the voice of the target object and often receives interference signals such as the voice of other objects and noise in the environment. Therefore, before performing speech recognition, a speech signal corresponding to a target object is first extracted from the acquired speech signal.

At present, when a multi-channel voice signal is processed, voice extraction can be performed by comparing voice signals of different channels. However, when a single-channel speech signal is processed, because the single-channel speech separation mainly utilizes the signal collected by a single microphone, modeling is performed by means of the difference of time-frequency domain acoustics and statistical characteristics between target speech and an interference signal, so that the corresponding sound source is extracted from a noisy and reverberant environment with greater difficulty. At present, if the mainstream method for processing the multi-channel speech signal is directly applied to processing the single-channel speech signal, in an actual application scene, due to the interference of factors such as background noise and reverberation, the accuracy of separating the single-channel speech signal is greatly reduced, and the use experience of a user is seriously influenced. Therefore, a technical solution capable of accurately and effectively extracting a single-channel speech is needed.

Disclosure of Invention

An object of the embodiments of the present specification is to provide a method, an apparatus, and a device for extracting speech, so as to solve a problem of how to accurately and effectively extract a single-channel speech.

In order to solve the above technical problem, an embodiment of the present specification provides a speech extraction method, including: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object; constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain target voice; inputting the mixed voice sample data into a voice separation network to obtain a predicted target voice; updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

An embodiment of this specification further provides a speech extraction apparatus, including: the mixed voice sample data acquisition module is used for acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object; the voice separation network construction module is used for constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain target voice; the mixed voice sample data input module is used for inputting the mixed voice sample data into a voice separation network to obtain predicted target voice; the voice separation network updating module is used for updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model; the target object voice signal extraction module is used for extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

The embodiment of the present specification further provides a speech extraction device, including a memory and a processor; the memory to store computer program instructions; the processor to execute the computer program instructions to implement the steps of: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object; constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain target voice; inputting the mixed voice sample data into a voice separation network to obtain a predicted target voice; updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

As can be seen from the technical solutions provided in the embodiments of the present specification, a voice separation network including an encoder, a global encoder, a guidance module, a separation module, and a decoder is constructed, a predicted target voice corresponding to mixed voice sample data is obtained through the voice separation network of the above structure, and then the voice separation network is updated by using a preset loss function in combination with the predicted target voice, so as to obtain a voice extraction model, thereby extracting a corresponding target object voice signal from voice data to be processed. The method ensures that the single-channel voice signal is accurately and effectively extracted, and further can perform voice recognition and other utilization on the extracted voice in the subsequent process, thereby improving the use experience of users.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the specification, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method for extracting speech according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a voice separation network according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an encoder according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a structure of a wizard module according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a separation module according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a speech extraction apparatus according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a speech extraction device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort shall fall within the protection scope of the present specification.

In order to solve the above technical problem, a speech extraction method according to an embodiment of the present specification is introduced. The execution main body of the voice extraction method is voice extraction equipment, and the voice extraction equipment comprises but is not limited to a server, an industrial personal computer, a Personal Computer (PC) and the like. As shown in fig. 1, the speech extraction method may include the following implementation steps.

S110: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object.

In this embodiment, in order to achieve the purpose of extracting the voices of one or more objects from a single-channel voice, a corresponding model needs to be constructed and trained, and finally, the purpose of extracting the voices can be achieved by using the model.

The mixed speech sample data is the data utilized in training the model. For the purpose of separating the voice of the target object from the voice signal, the target voice signal and at least one of a noise signal, an interference voice signal, a reverberation signal may be included in the mixed voice sample data.

The noise signal can be a signal which causes interference to the original voice due to the reasons of incompleteness of the voice signal collected by the microphone, loss in the signal transmission process and the like in the voice collection process. The interfering speech signal may be speech generated by an object other than the target object, for example, if the speech acquisition region includes a plurality of objects emitting sounds, and only one of the objects is the target object required for extracting speech this time, the acquired speech signal corresponding to the other object is the interfering speech signal. The reverberation signal may be a signal received by the sound collection device after the sound emitted by the target object itself is reflected by an object such as a surrounding obstacle or barrier. Since there is a certain delay in the acquisition of these sounds compared to the sounds directly emitted by the target object, certain interference is also caused to the speech extraction.

The guide voice may be a signal corresponding to a voice generated by the target object. The number of the target objects may be one or more. In order to perform processing such as voice recognition on the guide voice in a subsequent process, the guide voice needs to be separated from the mixed voice sample data.

Specifically, for the purpose of performing speech extraction on a single-channel speech in conjunction with the embodiments of the present specification, the mixed speech sample data may also be a single-channel speech signal. The single-channel speech signal may be a sound signal collected by only one microphone.

Specifically, under the condition that the process of training the model is realized based on supervised learning, the mixed voice sample data may further correspond to a corresponding mark for identifying the guide voice therein, so that when the voice separation network is used for processing the data, the guide voice therein can be directly used for specific training. The specific identification manner may be set based on the requirement of the actual application, which is not limited to this.

In some embodiments, the mixed voice sample data may be prepared by: firstly, at least two human voice signals are mixed in a first signal-to-noise ratio range to obtain a human voice mixed voice signal, the human voice signal can be an independent voice signal which is acquired or separated in advance and corresponds to human voice, and the first signal-to-noise ratio range is used for limiting a signal-to-noise ratio interval for mixing the human voice signals, and can be 0dB to 5dB, for example. Secondly, the voice mixed voice signal and the noise signal are mixed in a second signal-to-noise ratio range to obtain a comprehensive voice signal, the noise signal can be an additionally generated signal which causes interference to the voice signal, and the second signal-to-noise ratio range is used for limiting a signal-to-noise ratio interval for mixing the two signals, and can be-6 dB to 3dB, for example. And finally, processing the comprehensive voice signal by using a voice signal generating function to obtain mixed voice sample data, wherein the voice signal generating function can generate a corresponding voice signal based on corresponding data so as to achieve the effect of simulating the voice of practical application, specifically a pyroomics function, and the pyroomics function can quickly construct simulation scenes of single/multiple sound sources and microphones in a 2D/3D room, so as to help construct simulated voice sample data.

The above process is explained in detail by using a specific example, when preparing mixed voice sample data, firstly, resampling time domain signals of a WSJ0 voice signal sample and a WHAM noise sample at 8kHz, randomly mixing two different speaker voices between 0dB and 5dB of signal-to-noise ratio, mixing the mixed voice and the randomly extracted noise sample within the range of signal-to-noise ratio of-6 dB to 3dB, and obtaining room pulse correspondence based on a pyroomics function and finally mixing the obtained voice based on room configuration parameters in table 1 to obtain the final mixed voice sample data y containing noise, reverberation and other speaker interference.

TABLE 1

Based on the above embodiment, after a certain number of mixed voice sample data is obtained, the mixed voice sample data may be further divided. Specifically, the mixed voice sample data may be divided into training sample data, verification sample data, and test sample data. Wherein the training sample data is used for training for a model in a subsequent step; the test sample data and the verification sample data can be used for respectively testing and verifying the model after the model training is finished so as to ensure the effect of the model.

To illustrate by using a specific example, assuming that the total number of sample data generated based on the above steps is 28000, 20000 sample data, 3000 sample data and 5000 sample data can be divided into training sample data, and then used in the subsequent model training and model verification processes, respectively. In practical applications, the sample number may be set to other ratios according to the application requirements, and is not limited to the above examples.

S120: constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain the target voice.

A speech separation network is a network for separating speech signals containing noise, reverberation, and target object speech. The voice separation network may be a deep neural network, and in particular, may be a single-guide and graph-convolution voice separation network.

The voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder, and the encoder, the global encoder, the guide module, the separation module and the decoder are used for processing voice signals based on corresponding data processing modes and data transmission sequences. Specifically, the structure of the voice separation network is described with reference to fig. 2. As shown in fig. 2, the voice separation network accepts mixed voice and guide voice, respectively, during a training phase. The mixed speech may include noise, reverberation, and other interference signals, and includes speech corresponding to a target speech signal of a target object, and the guide speech may be identified speech corresponding to the target object. For the description of the two speeches, reference may be made to the description in step S110, and details are not described here.

The mixed voice is transmitted to the encoder, the data is transmitted to the global encoder after being processed by the encoder, and the guide module synthesizes the global encoder and the guide voice to generate a corresponding weight value; and after processing the voice data output by the global encoder, the separation module combines with the weight value generated by the guide module, and the combined result is input into a decoder to obtain estimated target voice through processing.

The structure of the voice separation network is described in detail by using a specific example. As shown in fig. 3, it is a schematic structural diagram of an encoder, wherein the encoder is composed of two layers of convolution networks; the global encoder is composed of a layer of graph convolution network; fig. 4 is a schematic structural diagram of a guide module, where the guide module includes a guide encoding part and an attention mechanism part, including two convolutional layers and a full connection layer, where a global feature may be a feature generated by a global encoder, and the guide encoding part is used to receive guide speech, and process data according to the data stream relationship shown in the diagram to output corresponding data; fig. 5 is a schematic structural diagram of a separation module, wherein the separation module includes four convolutional layers and one pooling layer; the decoder includes two layers of deconvolution networks.

It should be noted that the above-mentioned structure is only a specific implementation manner provided in the embodiments of the present disclosure, and in practical applications, the adaptive adjustment may be performed based on the above-mentioned requirements, and the obtained model structure is still within the protection scope of the present application, which is not limited thereto.

In some embodiments, an activation function may be set between neuron nodes in the voice separation net. The activation function is used for increasing the nonlinear relation between the networks through the activation function in the process of forward propagation of the neural network, and finally, the nonlinear mapping between the input result and the output result can be generated. The specific type of the activation function may be set according to application requirements, and may be, for example, an activation function such as a PReLU, which is not limited in this respect.

In some embodiments, after the voice separation network is constructed, network parameters in the voice separation network may be initialized. The network parameters may mainly include weight values and bias values between the neuron nodes, and the specific initialization process may be to initialize weights and biases between the neuron nodes in the voice separation network.

S130: and inputting the mixed voice sample data into a voice separation network to obtain a predicted target voice.

After the mixed voice sample data is input into the voice separation network, the data can be correspondingly processed based on the structures of all modules in the voice separation network and the data flow relation among different structures so as to obtain the final predicted target voice.

And further describing the process of obtaining the prediction target voice by combining the specific structure of the voice separation network, firstly, coding mixed voice sample data by using a coder to obtain a coded voice signal, then obtaining the coded signal characteristic of the coded voice signal by using a global coder, then obtaining a guide signal characteristic corresponding to the guide voice signal based on a guide module, determining an attention weight according to the comparison result of the coded signal characteristic and the guide signal characteristic, then obtaining a high-dimensional mapping characteristic corresponding to the attention weight, the guide signal characteristic and the coded signal characteristic by using a separation module, and finally decoding the high-dimensional mapping characteristic by using a decoder to obtain the prediction target voice.

The above process is described in detail with an embodiment in which for the encoder section, the mixed audio y is input to the network input and the signal is then encoded via a two-layer delayed convolutional network to obtain G ═ G₀,…,g_N-1And N is the output length corresponding to the second layer network of the encoder.

For the global decoder part, the output G of the encoder is input into the module as the characteristic input, and then the full-connection boundary relation of all the characteristic points is established as a graph to carry out the characteristic extraction of the graph convolution network to obtain G_g。

For the guide module, a guide coding part is utilized to input the guide voice p corresponding to each mixed audio target voice into the input end of the module, then the guide voice p is coded by a two-layer convolution network, and the high-dimensional characteristics are extracted to obtain H ═ H₀,…,h_M-1M is the output length corresponding to the second layer network of the encoder; using an attention mechanism part, simultaneously receiving a guide encoder and a guide-encoded output G_gAnd H, all H are scored by an internal scoring mechanism_mAnd g_gnCalculating to obtain the attention weight alpha_n,m。

For the separation module part, a_n,mMultiplication of the result of the multiplication with H and G_gInputting the data into a separation module to obtain a high-dimensional mapping characteristic S of the target_m。

For the decoder part, S_mInputting the target speech to the input end of a decoder, and obtaining estimated target speech after two deconvolution network layers

The process of obtaining the predicted target speech is described based on the speech separation network in the embodiment in step S120, and in practical applications, the calculation process may be adaptively adjusted according to the specific structure of the speech separation network, which is not described herein again.

S140: and updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model.

After the predicted target voice is obtained, the voice separation network can be optimized by using a loss function according to the predicted target voice. The loss function may be a preset loss function corresponding to the voice separation network, and is used for evaluating the loss of the voice separation network according to the prediction result, and further correcting the voice separation network by combining the calculation result so as to obtain a more accurate prediction result.

Specifically, the predetermined loss function may be

Wherein L is a predetermined loss function,

is used for representing effective signals in the voice signals, wherein s is ideal target voice, and specifically, can be embodied by labeling in the mixed voice sample data in advance,

in order to predict the target voice,

for representing a noise signal in a speech signal, < -, a>Represents the dot product between two vectors, and | · | | | non-woven phosphor²Representing the euclidean distance. Where SNR is the signal-to-noise ratio.

In some embodiments, the voice separation network may be optimized by using a gradient descent method in combination with the preset loss function. The optimization process may be to calculate a first gradient of a loss function corresponding to an output layer of the speech separation network, sequentially calculate gradients corresponding to each layer in the speech separation network based on the first gradient, and finally update the weight and the offset of the initial speech extraction model in combination with the gradients of each layer.

Specifically, the updating of the parameters of the multi-scale extraction deep neural network by using the gradient descent method may be fixing the parameters of the speech separation network within a certain time, calculating the gradient of the loss function of the output layer by using the above formula, then taking the initial network level as the L-th layer, and sequentially calculating the gradient corresponding to each layer when the number of network layers is L-1, L-2, …, 2, wherein L is the number of layers of the neural network. And after all the gradients are obtained through calculation, updating the weight and the bias of the whole network according to the calculated gradients, thereby completing the optimization of the model and obtaining the pre-training voice separation model.

Accordingly, because the difference of the voice separation network is mainly embodied by the network parameters, the optimization process for the voice separation network can be mainly optimized for the network parameters. The specific optimization process may be adjusted based on the requirements of the actual application, and will not be described herein.

S150: extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

After the voice extraction model is obtained, the voice data to be processed is input into the network model after the quantization fine adjustment, and the separation result of the target voice can be obtained through the calculation of the model, so that the requirements of subsequent voice recognition and the like are met. In some embodiments, after the speech extraction model is obtained by training, the model may be tested and verified to ensure the training effect of the model. Specifically, based on the implementation in step S110, after the mixed voice sample data is obtained, test sample data and verification sample data may be obtained therefrom.

And extracting a test target voice signal in the test sample data by using the trained voice extraction model, comparing the extracted test target voice signal with the verification sample data, and optimizing the voice extraction model according to a comparison result. By analyzing the consistency of the prediction result and the original result, the prediction accuracy of the model can be effectively judged, so that whether the model can be directly applied or trained again is determined, and the training effect of the model is effectively ensured.

After the voice extraction model is obtained, the voice of the target object in the single-channel voice can be accurately and effectively extracted, so that the subsequent application process is effectively ensured. The specific process of extracting the speech may be set based on the requirements of the actual application, and is not described herein again.

Based on the introduction of the above embodiment, it can be seen that, in the above method, by constructing a voice separation network including an encoder, a global encoder, a guide module, a separation module, and a decoder, a predicted target voice corresponding to mixed voice sample data is obtained through the voice separation network of the above structure, and then a preset loss function is used to update the voice separation network in combination with the predicted target voice, so as to obtain a voice extraction model, thereby being capable of extracting a corresponding target object voice signal from voice data to be processed. The method ensures that the single-channel voice signal is accurately and effectively extracted, and further can perform voice recognition and other utilization on the extracted voice in the subsequent process, thereby improving the use experience of users.

A speech extraction device according to an embodiment of the present description is introduced based on the speech extraction method corresponding to fig. 1. The voice extraction device is arranged on the voice extraction equipment. As shown in fig. 6, the speech extraction apparatus includes the following modules.

A mixed voice sample data obtaining module 610, configured to obtain mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object.

A voice separation network construction module 620, configured to construct a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain the target voice.

A mixed voice sample data input module 630, configured to input the mixed voice sample data into a voice separation network to obtain a predicted target voice.

And the voice separation network updating module 640 is configured to update the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model.

A target object voice signal extraction module 650, configured to extract a target object voice signal from the voice data to be processed by using the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

Based on the speech extraction method corresponding to fig. 1, an embodiment of the present specification provides a speech extraction device. As shown in fig. 7, the speech extraction device may include a memory and a processor.

In this embodiment, the memory may be implemented in any suitable manner. For example, the memory may be a read-only memory, a mechanical hard disk, a solid state disk, a U disk, or the like. The memory may be used to store computer program instructions.

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The processor may execute the computer program instructions to perform the steps of: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object; constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain target voice; inputting the mixed voice sample data into a voice separation network to obtain a predicted target voice; updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

While the process flows described above include operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The embodiments of this specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The described embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of speech extraction, comprising:

acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object;

constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain target voice;

inputting the mixed voice sample data into a voice separation network to obtain a predicted target voice;

updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model;

extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

2. The method of claim 1, wherein the mixed voice sample data is obtained by:

mixing at least two voice signals in a first signal-to-noise ratio range to obtain a voice mixed voice signal;

mixing the voice mixed voice signal with a noise signal within a second signal-to-noise ratio range to obtain a comprehensive voice signal;

and processing the comprehensive voice signal by utilizing a voice signal generating function to obtain mixed voice sample data.

3. The method of claim 1, in which the mixed voice sample data comprises training sample data, verification sample data, and test sample data; inputting the mixed voice sample data into a voice separation network to obtain a predicted target voice, wherein the method comprises the following steps:

inputting the training sample data into a voice separation network to obtain a predicted target voice;

before extracting the target object voice signal from the voice data to be processed by using the voice extraction model, the method further comprises the following steps:

extracting a test target voice signal in the test sample data by using the voice extraction model;

optimizing the voice extraction model according to the comparison result of the test target voice signal and the verification sample data;

correspondingly, the extracting the target object voice signal from the voice data to be processed by using the voice extraction model includes:

and extracting the target object voice signal from the voice data to be processed by using the optimized voice extraction model.

4. The method of claim 1, wherein the encoder is configured to receive mixed speech sample data and to pass the processed mixed speech sample data to a global encoder; the global encoder transmits the processed data to the separation module and the guide module respectively; the guide module synthesizes guide voice and data received by the global encoder to obtain a weight value; after the separation module processes the data, the output data is combined with the weight value and then is input into a decoder;

the encoder comprises two layers of convolutional networks; the global encoder comprises a layer of graph convolution network; the guide module comprises two convolution layers and a full connection layer; the separation module comprises four convolution layers and a pooling layer; the decoder includes two layers of deconvolution networks.

5. The method of claim 1, wherein said inputting the mixed voice sample data into a voice separation network to obtain a predicted target voice comprises:

encoding the mixed voice sample data by using an encoder to obtain an encoded voice signal;

acquiring coded signal characteristics of the coded voice signal through a global coder;

acquiring a guide signal characteristic corresponding to a guide voice signal based on a guide module;

determining attention weight according to the comparison result of the coding signal characteristic and the guide signal characteristic;

obtaining, by a separation module, high-dimensional mapping features corresponding to the attention weights, guide signal features, and encoded signal features;

and decoding the high-dimensional mapping characteristics by using a decoder to obtain the predicted target voice.

6. The method of claim 1, wherein said updating the speech separation network based on the pre-set loss function and the predicted target speech results in a speech extraction model comprising:

initializing neural network parameters in the voice separation network so as to enable the neural network to carry out forward propagation; wherein, include: initializing weights and biases between neuron nodes in the voice separation network;

calculating a loss function of the voice separation network based on the mixed voice sample data;

and updating the neural network parameters by using a gradient descent method according to the loss function.

7. The method of claim 6, wherein activation functions are set between neuron nodes in the voice separation network; the activation function is used to generate a non-linear mapping between the inputs and outputs corresponding to the neuron nodes during the neural network forward propagation.

8. The method of claim 6, wherein said calculating a loss function for the voice separation network based on the mixed voice sample data comprises:

acquiring a predicted target voice corresponding to the mixed voice sample data by utilizing a voice separation network;

using formulas

Calculating a loss function, wherein L is the loss function,

where s is an ideal target voice,

in order to predict the target voice,

9. a speech extraction device, comprising:

the mixed voice sample data acquisition module is used for acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object;

the voice separation network construction module is used for constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain target voice;

the mixed voice sample data input module is used for inputting the mixed voice sample data into a voice separation network to obtain predicted target voice;

the voice separation network updating module is used for updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model;

the target object voice signal extraction module is used for extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.

10. A speech extraction device comprising a memory and a processor;

the memory to store computer program instructions;

the processor to execute the computer program instructions to implement the steps of: acquiring mixed voice sample data; the mixed voice sample data is a single-channel voice signal; the mixed voice sample data comprises at least one of a noise signal, an interference voice signal and a reverberation signal and a guide voice; the guide voice includes a voice corresponding to a target object; constructing a voice separation network; the voice separation network comprises an encoder, a global encoder, a guide module, a separation module and a decoder; the encoder and the global encoder are used for outputting the characteristics of the voice signal; the guide module is used for outputting a weight value according to a comparison result of guide voice and mixed voice sample data; the separation module is used for obtaining high-dimensional mapping of the target voice; the decoder is used for decoding the data to obtain target voice; inputting the mixed voice sample data into a voice separation network to obtain a predicted target voice; updating the voice separation network based on a preset loss function and the predicted target voice to obtain a voice extraction model; extracting a target object voice signal from voice data to be processed by utilizing the voice extraction model; the voice data to be processed comprises a single-channel voice signal.