CN115662463A

CN115662463A - Voice separation method, device, equipment and storage medium

Info

Publication number: CN115662463A
Application number: CN202211313547.8A
Authority: CN
Inventors: 姜彦吉; 邱友利; 郑四发
Original assignee: Suzhou Automotive Research Institute of Tsinghua University
Current assignee: Suzhou Automotive Research Institute of Tsinghua University
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-31

Abstract

The invention discloses a voice separation method, a voice separation device, voice separation equipment and a storage medium. The method comprises the following steps: acquiring voice data to be separated, and determining audio features of the voice data to be separated and auxiliary features of the voice data to be separated; wherein the assistant features comprise voiceprint features, emotion features and deep features; determining fusion characteristics according to the audio characteristics and the auxiliary characteristics; and determining a separation result of the voice data to be separated according to the fusion characteristic and the audio characteristic. The technical scheme solves the problem of low voice separation accuracy, and can ensure the stability and the anti-interference performance of the separation effect while improving the voice separation accuracy.

Description

Voice separation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of voice data processing technologies, and in particular, to a voice separation method, apparatus, device, and storage medium.

Background

Deep learning based speech separation systems typically employ an end-to-end architecture of an encoder, a separator, and a decoder. Extracting potential features from the mixed speech by the encoder makes it difficult to fully and accurately represent the necessary features in the mixed speech. Considering multi-modal features for speech separation is a good solution.

At present, a voice separation system is used for assisting in improving voice separation performance by extracting visual features of images of expressions, mouth shapes and the like of speakers. However, the adoption of visual features requires the use of a large amount of image data for support, which is costly. In addition, in an actual scene, the images acquired in real time are easily unusable due to reasons such as shading and insufficient light, and even the quality of voice separation is influenced, so that the application of the voice separation system in the actual scene is greatly limited. Therefore, how to improve the accuracy of speech separation by adding targeted assistant features becomes an urgent problem to be solved.

Disclosure of Invention

The invention provides a voice separation method, a voice separation device, voice separation equipment and a storage medium, which are used for solving the problem of low voice separation accuracy and ensuring the stability and the anti-interference performance of a separation effect while improving the voice separation accuracy.

According to an aspect of the present invention, there is provided a voice separation method, the method including:

acquiring voice data to be separated, and determining audio features of the voice data to be separated and auxiliary features of the voice data to be separated; wherein the assistant features comprise voiceprint features, emotion features and deep features;

determining fusion characteristics according to the audio characteristics and the auxiliary characteristics;

and determining a separation result of the voice data to be separated according to the fusion characteristic and the audio characteristic.

According to another aspect of the present invention, there is provided a voice separating apparatus, including:

the characteristic determining module is used for acquiring the voice data to be separated and determining the audio characteristic of the voice data to be separated and the auxiliary characteristic of the voice data to be separated; wherein the assistant features comprise voiceprint features, emotion features and deep features;

the fusion characteristic determining module is used for determining fusion characteristics according to the audio characteristics and the auxiliary characteristics;

and the separation result determining module is used for determining the separation result of the voice data to be separated according to the fusion characteristic and the audio characteristic.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a speech separation method according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the speech separation method according to any one of the embodiments of the present invention when the computer instructions are executed.

According to the technical scheme of the embodiment of the invention, the problem of low voice separation accuracy is solved by fusing auxiliary features such as voiceprint features, emotional features and deep features on the basis of the audio features, and the stability and the anti-interference performance of the separation effect can be ensured while the voice separation accuracy is improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a voice separation method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a speech separation method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech separation apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device implementing the voice separation method according to the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.

Example one

Fig. 1 is a flowchart of a voice separation method according to an embodiment of the present invention, where the present embodiment is applicable to a hybrid voice separation scenario, and the method may be executed by a voice separation apparatus, which may be implemented in a form of hardware and/or software, and the apparatus may be configured in an electronic device. As shown in fig. 1, the method includes:

s110, voice data to be separated are obtained, and the audio features of the voice data to be separated and the auxiliary features of the voice data to be separated are determined.

The scheme can be executed by a voice separation system, and the voice separation system can comprise voice acquisition equipment such as a microphone and is used for acquiring voice data to be separated in a deployment scene. The voice separation system can also directly read the generated voice data to be separated. The voice data to be separated may be mixed voice data including voice data of at least two speakers. After the voice data to be separated is obtained, the voice separation system can perform operations such as pre-emphasis, framing, windowing, transformation, filtering and the like on the voice data to be separated, and extract the audio features of the voice data to be separated. The voice separation system can also extract the characteristics of the voice data to be separated based on a deep learning algorithm to obtain the audio characteristics. In a preferred approach, the speech separation system may encode the speech data to be separated into a frame-based multi-dimensional embedded sequence by an audio encoder.

The voice separation system may further extract an auxiliary feature of the voice data to be separated, where the auxiliary feature may be other features of the voice data to be separated besides the audio feature, and may include features such as a voiceprint feature, an emotion feature, a deep feature, and a timing feature. In the scheme, optionally, the voiceprint features are determined by performing feature extraction on the voice data to be separated through a voiceprint feature extractor; the emotional characteristics are determined based on frequency spectrum information of the voice data to be separated; the deep features are obtained by performing domain conversion based on the features of the voice data to be separated output by the deep feature extractor.

The voice separation system may employ the spaker dialization open source toolkit, pyannate, as a voiceprint feature extractor to extract speaker-embedded vectors as voiceprint features of the voice data to be separated. After obtaining the voiceprint features, the voice separation system can encode the voiceprint features through a one-dimensional convolutional neural network to facilitate feature fusion.

The emotional feature is a feature which is difficult to be described quantitatively, and is used for more comprehensively describing the emotional feature of the voice data to be separated. The voice separation system may determine the spectrum information of the voice data to be separated after performing signal processing operations such as framing, windowing, transforming, and filtering on the voice data to be separated. The frequency spectrum information may include information such as an average value of the flatness of the frequency spectrum, an average value of the zero crossing rate of the audio time series, a p-order frequency spectrum bandwidth, amplitude statistics, frequency spectrum centroid statistics, tuning offset statistics, root mean square statistics, and mel-frequency cepstral coefficient feature statistics. The average value of the spectral flatness, the average value of the zero crossing rate of the audio time series and the p-order spectral bandwidth can be used for representing the spectral characteristics of the voice data to be separated. Information such as spectral centroid statistics can be used to measure the luminance characteristics of the speech data to be separated. Information such as root mean square statistics may be used to describe loudness characteristics of the speech data to be separated. Besides the static characteristics such as the spectrum characteristic, the brightness characteristic, the loudness characteristic and the like, the voice separation system can also describe the dynamic characteristics of the voice data to be separated through the characteristics such as the first order difference, the second order difference and the like of the Mel frequency cepstrum coefficient characteristic.

It is easily understood that the deep features may be deeper features of the extracted voice data to be separated than the audio features. The voice separation system can take the pre-training voice models such as the WavLM as a deep feature extractor, extract deep features from voice data to be separated, and then adaptively adjust the deep features extracted by the pre-training models through a domain conversion network so as to meet the application of the current voice separation scene. The voice separation system can also extract the time sequence characteristics of the voice data to be separated through a cyclic convolution neural network so as to obtain the information of different producers on the sounding time sequence and further assist the accurate separation of the voice data to be separated.

And S120, determining fusion characteristics according to the audio characteristics and the auxiliary characteristics.

The audio features and the assistant features can have the same dimensionality, and the voice separation system can directly combine and splice the audio features and the assistant features to obtain the fusion features. If the dimensionality of the audio features and the dimensionality of the auxiliary features are different, the voice separation system can adjust the audio features and the auxiliary features into consistent dimensionality in the modes of feature screening, feature copying and the like so as to facilitate combination and splicing.

And S130, determining a separation result of the voice data to be separated according to the fusion characteristic and the audio characteristic.

After the fusion feature is obtained, the voice separation system may input the fusion feature to the separation network, and determine a separation result of the voice data to be separated according to an output result of the separation network and the audio feature. Wherein the separation network may be constructed based on a Global Attention Local Recursion (GALR) network.

According to the technical scheme, the voiceprint feature, the emotion feature, the deep feature and other auxiliary features are fused on the basis of the audio features, so that the problem of low voice separation accuracy is solved, and the stability and the anti-interference performance of the separation effect can be guaranteed while the voice separation accuracy is improved.

Example two

Fig. 2 is a flowchart of a speech separation method according to a second embodiment of the present invention, which is detailed based on the second embodiment. As shown in fig. 2, the method includes:

s210, voice data to be separated are obtained, and the audio features of the voice data to be separated and the auxiliary features of the voice data to be separated are determined.

In this scheme, the assistant features may include voiceprint features, emotion features, and deep features. The voiceprint features are determined by extracting features of the voice data to be separated through a voiceprint feature extractor; the emotional characteristics are determined based on frequency spectrum information of voice data to be separated; the deep features are obtained by domain conversion based on the features of the voice data to be separated output by the deep feature extractor. In one possible approach, the emotional features include static features and dynamic features; wherein the static features include spectral characteristic features, luminance features, and loudness features.

The spectral characteristic features may include an average value of spectral flatness, an average value of audio time series zero crossing rate, and a p-order spectral bandwidth, among others. The luminance characteristics may include a mean, a standard deviation, and a maximum of the spectral centroid. The loudness features may include a mean, a standard deviation, and a maximum of the root mean square. The dynamic features may include mel-frequency cepstral coefficient features, an average of the mel-frequency cepstral coefficient features, a standard deviation of the mel-frequency cepstral coefficient features, a maximum of the mel-frequency cepstral coefficient features, a first order difference of the mel-frequency cepstral coefficient features, a second order difference of the mel-frequency cepstral coefficient features, and the like.

In a specific example, the emotional characteristics may be characteristics formed by 276-dimensional parameters, and the parameter content of each dimension may be as shown in the following table 1:

table 1:

feature number	Feature name
		0	Average value of spectral flatness
1	Average value of zero crossing rate of audio time series
		2～4	Mean, standard deviation, maximum of amplitude
5～7	Mean, standard deviation, maximum of spectral centroid
		8～11	Tuning offsetAmount and its average, standard deviation, maximum
12～14	Mean, standard deviation, maximum of Root Mean Square (RMS)
		15～86	0-24 order MFCC characteristics and average value, standard deviation and maximum value thereof
87～134	First-order difference and second-order difference of 0-24 order MFCC
		135～146	Chromatogram map
147～274	Mel frequency
		275	p-order spectral bandwidth (default p = 2)

According to the scheme, the multidimensional characteristics of the voice data to be separated can be acquired, and accurate voice separation can be realized according to the multidimensional characteristics.

It should be noted that, in the present scheme, the audio feature and the auxiliary feature may each include three dimensions of frame, time, and channel. To ensure correspondence of features, the audio features and the auxiliary features may have the same frame dimension. The audio features and the assistant features may differ in time dimension and channel dimension due to the different extraction manner of the audio features and the assistant features. Typically, to cover more comprehensive time span information, the time dimension of the audio features may be greater than or equal to the time dimension of the auxiliary features.

And S220, adjusting the auxiliary features to have the same time dimension as the audio features.

If the audio features are larger than the time dimension of the assistant features, the voice separation system can copy each assistant feature to obtain the features consistent with the time dimension of the audio features.

And S230, splicing the audio features and the auxiliary features in a time dimension or a channel dimension, and determining fusion features according to a splicing result.

After the audio features and the auxiliary features with uniform time dimension are obtained, the voice separation system can splice the audio features and the auxiliary features in the time dimension or in the channel dimension, and the splicing result is used as the fusion feature.

In a possible solution, the splicing the audio features and the auxiliary features in the time dimension, and determining the fusion features according to the splicing result includes:

and splicing the audio features and the auxiliary features in the time dimension, and performing remodeling operation on a splicing result to obtain fusion features matched with the time dimension of the audio features.

It should be noted that, if the audio feature and the auxiliary feature are spliced in the time dimension, the speech separation system may perform a reshaping operation on the splicing result to keep the time dimension unchanged while the features are fused, so as to obtain a fused feature consistent with the time dimension of the audio feature. For example, the audio feature, the voiceprint feature, the emotion feature and the depth feature after the time dimension adjustment are all features with dimensions [ B, N, T ], wherein a first element represents a frame dimension, a second element represents a channel dimension, and a third dimension represents a time dimension. After the audio features, the voiceprint features, the emotion features and the depth features are spliced in the time dimension, and the obtained splicing result is the features with the dimensions of [ B, N, T + T + T + T ]. The voice separation system can perform remodeling operation on the splicing result to obtain the fusion characteristic of the dimension [ B, N + N + N + N, T ].

S240, inputting the fusion characteristics into a voice separation network, and determining the separated voice prediction characteristics.

The speech separation system may input the fused features into a speech separation network to obtain separated speech prediction features. Wherein the voice separation network may employ a scale-invariant signal-to-noise ratio as a loss function. The calculation formula of the loss function can be expressed as:

wherein, the first and the second end of the pipe are connected with each other,

representing the characteristics of the prediction of the isolated speech,

representing an audio feature.

And S250, determining a separation result of the audio data to be detected according to the separated voice prediction characteristic and the audio characteristic.

Optionally, determining a separation result of the audio data to be detected according to the separated speech prediction feature and the audio feature includes:

determining a dot product result of separating the voice prediction characteristic and the audio characteristic;

and taking the dot product result as the input of an audio decoder, and determining the separation result of the audio data to be detected according to the output of the audio decoder.

The voice separation system can carry out point multiplication operation on the separated voice prediction characteristics and the voice frequency characteristics, and inputs the point multiplication result into the voice frequency decoder to obtain the pure voice of each speaker. The audio decoder may have a structure reverse to the feature extractor for restoring the features to the separated voice data. The audio decoder may include a deconvolution, an inverse pooling, etc. structure. It should be noted that, in order to ensure consistency of restoration, parameter settings in the audio decoder are generally consistent with those in the feature extractor, such as convolution kernel size, convolution step size, and the like.

According to the technical scheme, the voiceprint features, the emotion features, the deep features and other auxiliary features are fused on the basis of the audio features, so that the problem of low voice separation accuracy is solved, and the stability and the anti-interference performance of a separation effect can be guaranteed while the voice separation accuracy is improved.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a speech separation apparatus according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes:

the feature determining module 310 is configured to obtain the voice data to be separated, and determine an audio feature of the voice data to be separated and an auxiliary feature of the voice data to be separated; wherein the assistant features comprise voiceprint features, emotion features and deep features;

a fusion feature determination module 320, configured to determine a fusion feature according to the audio feature and the auxiliary feature;

and a separation result determining module 330, configured to determine a separation result of the to-be-separated voice data according to the fusion feature and the audio feature.

In this scheme, optionally, the voiceprint feature is determined by performing feature extraction on the voice data to be separated through a voiceprint feature extractor; the emotional characteristics are determined based on frequency spectrum information of the voice data to be separated; the deep features are obtained by performing domain conversion based on the features of the voice data to be separated output by the deep feature extractor.

On the basis of the scheme, the emotional characteristics comprise static characteristics and dynamic characteristics; wherein the static features include spectral characteristic features, luminance features, and loudness features.

In one possible approach, the time dimension of the audio features is greater than or equal to the time dimension of the auxiliary features;

the fused feature determination module 320 includes:

a dimension adjustment unit for adjusting the auxiliary feature to have the same time dimension as the audio feature;

and the fusion characteristic determining unit is used for splicing the audio characteristic and the auxiliary characteristic in a time dimension or a channel dimension and determining the fusion characteristic according to a splicing result.

On the basis of the above scheme, the fusion feature determining unit is specifically configured to:

In this embodiment, optionally, the separation result determining module 330 includes:

a prediction feature determination unit for inputting the fusion feature into a speech separation network to determine a separated speech prediction feature;

the separation result determining unit is used for determining the separation result of the audio data to be detected according to the separated voice prediction characteristic and the audio characteristic;

on the basis of the above scheme, the separation result determining unit is specifically configured to:

The voice separation device provided by the embodiment of the invention can execute the voice separation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

FIG. 4 illustrates a block diagram of an electronic device 410 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 410 includes at least one processor 411, and a memory communicatively connected to the at least one processor 411, such as a Read Only Memory (ROM) 412, a Random Access Memory (RAM) 413, and the like, wherein the memory stores computer programs executable by the at least one processor, and the processor 411 may perform various appropriate actions and processes according to the computer programs stored in the Read Only Memory (ROM) 412 or the computer programs loaded from the storage unit 418 into the Random Access Memory (RAM) 413. In the RAM 413, various programs and data required for the operation of the electronic device 410 can also be stored. The processor 411, the ROM 412, and the RAM 413 are connected to each other through a bus 414. An input/output (I/O) interface 415 is also connected to bus 414.

A number of components in the electronic device 410 are connected to the I/O interface 415, including: an input unit 416 such as a keyboard, a mouse, or the like; an output unit 417 such as various types of displays, speakers, and the like; a storage unit 418, such as a magnetic disk, optical disk, or the like; and a communication unit 419 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 419 allows the electronic device 410 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Processor 411 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 411 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 411 performs the various methods and processes described above, such as the voice separation method.

In some embodiments, the speech separation method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 418. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 410 via ROM 412 and/or communications unit 419. When the computer program is loaded into RAM 413 and executed by processor 411, one or more steps of the speech separation method described above may be performed. Alternatively, in other embodiments, the processor 411 may be configured to perform the voice separation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of speech separation, the method comprising:

2. The method according to claim 1, wherein the voiceprint features are determined by performing feature extraction on the voice data to be separated through a voiceprint feature extractor; the emotional characteristics are determined based on frequency spectrum information of voice data to be separated; the deep features are obtained by domain conversion based on the features of the voice data to be separated output by the deep feature extractor.

3. The method of claim 2, wherein the emotional features comprise static features and dynamic features; wherein the static features include spectral characteristic features, luminance features, and loudness features.

4. The method of claim 1, wherein the time dimension of the audio feature is greater than or equal to the time dimension of the auxiliary feature;

determining a fusion feature according to the audio feature and the auxiliary feature includes:

adjusting the assist features to have the same time dimension as the audio features;

and splicing the audio features and the auxiliary features in a time dimension or a channel dimension, and determining fusion features according to a splicing result.

5. The method according to claim 4, wherein the splicing the audio feature and the assistant feature in the time dimension and determining the fusion feature according to the splicing result comprises:

6. The method of claim 1, wherein determining a separation result of the voice data to be separated according to the fusion feature and the audio feature comprises:

inputting the fusion characteristics into a voice separation network, and determining separated voice prediction characteristics;

and determining a separation result of the audio data to be detected according to the separated voice prediction characteristic and the audio characteristic.

7. The method as claimed in claim 6, wherein the determining the separation result of the audio data to be detected according to the separated speech prediction feature and the audio feature comprises:

8. A speech separation apparatus, comprising:

the characteristic determining module is used for acquiring voice data to be separated and determining audio characteristics of the voice data to be separated and auxiliary characteristics of the voice data to be separated; wherein the assistant features comprise voiceprint features, emotion features and deep features;

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech separation method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a processor to perform the speech separation method of any one of claims 1-7 when executed.