CN111863020A

CN111863020A - Voice signal processing method, device, equipment and storage medium

Info

Publication number: CN111863020A
Application number: CN202010754241.0A
Authority: CN
Inventors: 陈日林; 汤欣钰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-10-30
Anticipated expiration: 2040-07-30
Also published as: CN111863020B

Abstract

The application discloses a voice signal processing method, a device, equipment and a storage medium, and belongs to the technical field of voice. The method and the device have the advantages that different probability distributions are corresponding to the shape parameters, multiple probability distributions are provided for selection, corresponding probability distributions can be flexibly selected according to the shape parameters when the voice signals are processed, and then the separation parameters which enable the target signals to be independent from each other are determined to separate the voice signals, so that the method and the device can adapt to acoustic scenes comprising a plurality of different sound sources, and the target voice signals of all the sound sources can be accurately separated in a changeable scene with multiple sound sources mixed.

Description

Voice signal processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing a speech signal.

Background

With the development of voice technology, the application of voice technology is more and more extensive. In more and more scenes, the voice signals can be collected, processed, and the content or intention contained in the voice signals can be analyzed to execute the corresponding control command or realize the functions of machine translation and the like.

The cocktail party effect reveals the masking effect of the human ear, i.e. the natural ability to extract a desired sound source from a complex noisy auditory scene (an acoustic scene where multiple sound sources are present simultaneously). In such a complex acoustic scene, the speech signals are separated, the speech signals of each sound source are separated, and then the speech signals of each sound source can be processed.

At present, a speech signal processing method generally separates speech signals by using a single sound source probability model, such as multivariate laplacian distribution, complex gaussian distribution, and the like, however, an acoustic scene has complexity, different signals have different probability distributions, for example, regular mechanical spindle acoustic signals are more prone to sub-gaussian distribution, and human speech signals are more prone to super-gaussian distribution, and the like. The method for processing the signals of all sound sources by adopting the same probability model cannot be widely suitable for a variable scene of multi-sound-source mixing, and the target speech signals of all the sound sources cannot be accurately separated in the scene, so that the method has poor accuracy and adaptability.

Disclosure of Invention

The embodiment of the application provides a voice signal processing method, a voice signal processing device, voice signal processing equipment and a storage medium, and accuracy and adaptability of voice signal processing are improved. The technical scheme is as follows:

in one aspect, a method for processing a speech signal is provided, the method comprising:

separating the voice signals based on the separation parameters to obtain at least one target signal;

determining corresponding shape parameters for the at least one target signal, respectively, different shape parameters corresponding to different probability distributions;

obtaining the probability density distribution of the at least one target signal according to the shape parameter corresponding to the at least one target signal, wherein the probability density distribution of one target signal obeys the probability distribution corresponding to the shape parameter of the target signal;

determining independence between the at least one target signal based on the probability density distribution and the separation parameter determination;

and updating the shape parameters and the separation parameters according to the independence until the shape parameters and the separation parameters meet target conditions, and obtaining at least one target voice signal separated based on the updated separation parameters.

In one aspect, a speech signal processing apparatus is provided, the apparatus comprising:

the separation module is used for separating the voice signals based on the separation parameters to obtain at least one target signal;

a determining module for determining corresponding shape parameters for the at least one target signal, respectively, different shape parameters corresponding to different probability distributions;

an obtaining module, configured to obtain a probability density distribution of the at least one target signal according to a shape parameter corresponding to the at least one target signal, where the probability density distribution of one target signal obeys a probability distribution corresponding to the shape parameter of the target signal;

the determining module is further configured to determine independence between the at least one target signal according to the probability density distribution and the separation parameter;

and the updating module is used for updating the shape parameters and the separation parameters according to the independence until the shape parameters and the separation parameters meet target conditions, and obtaining at least one target voice signal separated based on the updated separation parameters.

In one possible implementation manner, the obtaining module includes a first obtaining unit, a second obtaining unit, a third obtaining unit, and a weighting unit;

the first acquisition unit is used for acquiring a mixing coefficient of at least two signal components of at least one frame in a target signal;

the second obtaining unit is used for obtaining the mixing coefficient and the variance information of at least two signal components of the target signal according to the mixing coefficient of at least two signal components of at least one frame in the target signal;

the third obtaining unit is used for obtaining the probability density distribution of the at least two signal components according to the variance information of the at least two signal components;

the weighting unit is configured to weight the probability density distribution of the at least two signal components according to a mixing coefficient of the at least two signal components of the target signal, so as to obtain the probability density distribution of the target signal.

In one possible implementation manner, the first obtaining unit is configured to:

for any signal component of any frame in the target signal, acquiring the product of a mixing coefficient and probability density distribution determined in the last iteration process of the signal component;

and acquiring the proportion of the product of any signal component and the sum of the products of at least two signal components of any frame, and taking the proportion as a mixing coefficient of any signal component.

In a possible implementation manner, the second obtaining unit is configured to obtain, for any signal component of the target signal, an average value of a mixing coefficient of the any signal component in at least one frame, and use the average value as the mixing coefficient of the any signal component.

In a possible implementation manner, the variance information of any one of the at least two signal components includes a variance corresponding to the signal component at any frequency point and a variance weight corresponding to the signal component in any frame.

In one possible implementation, the separation parameter is at least one separation matrix;

the update module is to:

updating a weight matrix of the at least one separation matrix according to the at least one target signal obtained from the last iteration, a mixing coefficient of at least two signal components of at least one frame in the at least one target signal, and the variance information of the at least two components of the at least one target signal;

and updating the at least one separation matrix according to the updated weight matrix.

In a possible implementation manner, the determining module is configured to randomly select a shape parameter corresponding to the at least one target signal from at least two preset shape parameters.

In one possible implementation, the update module is configured to:

in response to not meeting the target condition, updating the shape parameter and the separation parameter;

and repeating the steps of separating the signals, obtaining a probability density distribution and determining independence based on the updated shape parameters and separation parameters until the target condition is met, and stopping the updating step.

In a possible implementation manner, the target condition is that the independence converges, or the target condition is that a difference between the independence and the independence obtained in the last iteration is smaller than a target difference, or the target condition is that the number of iterations reaches a target number.

In one possible implementation, the at least one target signal is at least one target frequency domain signal;

the separation module is configured to:

carrying out short-time Fourier transform on a voice signal to obtain a frequency domain signal of the voice signal;

separating the frequency domain signals based on the separation parameters to obtain the at least one target frequency domain signal;

the update module is to:

updating the shape parameters and the separation parameters according to the independence between the at least one target signal determined by the probability density distribution and the separation parameters until the independence meets a target condition, and obtaining at least one target frequency domain signal separated based on the updated separation parameters;

and performing short-time Fourier inverse transformation on the at least one target frequency domain signal to obtain the at least one target voice signal.

In one aspect, an electronic device is provided, which includes one or more processors and one or more memories, wherein at least one program code is stored in the one or more memories, and the at least one program code is loaded into and executed by the one or more processors to implement the above-mentioned voice signal processing method.

In one aspect, a computer-readable storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to implement the above-mentioned speech signal processing method.

In one aspect, a computer program product or computer program is provided that includes one or more program codes stored in a computer readable storage medium. The one or more program codes can be read by one or more processors of the computer device from a computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute the voice signal processing method of any one of the above-mentioned possible embodiments.

The method and the device provided by the embodiment of the application correspond to different probability distributions by the shape parameters, provide multiple probability distributions for selection, flexibly select the corresponding probability distributions according to the shape parameters when processing the voice signals, and further determine the separation parameters which enable the target signals to be independent from each other to separate the voice signals, so that the method and the device can adapt to acoustic scenes comprising a plurality of different sound sources, and can accurately separate the target voice signals of all the sound sources in a changeable scene mixed by a plurality of sound sources, therefore, the method has good accuracy and adaptability.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a speech signal processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a speech signal processing flow of a speech signal processing system according to an embodiment of the present application;

fig. 3 is a schematic diagram of an intelligent home system provided in an embodiment of the present application;

fig. 4 is a schematic diagram of a video conference system provided in an embodiment of the present application;

fig. 5 is a flowchart of a speech signal processing method according to an embodiment of the present application;

fig. 6 is a flow chart of an IVA-based blind source separation method provided in the related art;

fig. 7 is a flowchart of a speech signal processing method according to an embodiment of the present application;

fig. 8 is a flowchart of a speech signal processing method according to an embodiment of the present application;

FIG. 9 is a speech spectrogram of a two-source mixed speech signal provided by an embodiment of the present application before processing;

FIG. 10 is a spectrogram after processing a mixed speech signal of two sound sources according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present application;

fig. 12 is a block diagram of a terminal according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first image may be referred to as a second image, and similarly, a second image may be referred to as a first image, without departing from the scope of the various examples. Both the first image and the second image may be images, and in some cases, may be separate and distinct images.

The term "at least one" is used herein to mean one or more, and the term "plurality" is used herein to mean two or more, e.g., a plurality of packets means two or more packets.

It is to be understood that the terminology used in the description of the various examples herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of the various examples and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The term "and/or" is an associative relationship that describes an associated object, meaning that three relationships may exist, e.g., A and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present application generally indicates that the former and latter related objects are in an "or" relationship.

It should also be understood that, in the embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should also be understood that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It will be further understood that the terms "Comprises," "Comprising," "inCludes" and/or "inCluding," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also understood that the term "if" may be interpreted to mean "when" ("where" or "upon") or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined." or "if [ a stated condition or event ] is detected" may be interpreted to mean "upon determining.. or" in response to determining. "or" upon detecting [ a stated condition or event ] or "in response to detecting [ a stated condition or event ]" depending on the context.

Blind signal separation, which may also be referred to as blind source separation, sound source separation, or speech signal separation, refers to the separation of a source signal originating from a sound source from a plurality of observed mixed signals. The mixed signal is typically observed from the outputs of multiple sensors, and the output signals of the sensors are independent (linearly uncorrelated). That is, a plurality of observation signals are acquired by a plurality of sensors, and the plurality of observation signals are mixed to obtain the mixed signal. The "blind" word for the blind signal emphasizes two points: 1) the original signal is not known; 2) the method for signal mixing is also unknown.

Specifically, blind source separation refers to separating sound from multiple sound sources. In the case where the sound of the sound source is a voice signal uttered by the user, the sound source separation is also called voice signal separation. The voice signal separation means that the mixed voice signal is separated into n voice signals by using a blind source separation technology.

Blind Source Separation (Blind Source Separation) is a technique for separating independent Source signals from mixed signals measured by a group of sensors by using only the weak known condition that the Source signals are independent under the condition that the transfer function of a system, the mixing coefficient of the Source signals and the probability distribution of the Source signals are unknown.

The blind signal separation technique may be implemented as a blind acoustic signal separation system that may mimic the human auditory system for identifying and enhancing sounds from a particular sound source.

Independent Vector Analysis (IVA) is an effective blind source separation method, combines all frequency points, minimizes statistical correlation between output signals, and avoids the problem of fuzzy arrangement in an Independent Component Analysis (ICA) method.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (Text To Speech, TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to technologies such as an artificial intelligence voice technology and a natural language processing technology, and is specifically explained by the following embodiment.

The following describes an embodiment of the present application.

Fig. 1 is a schematic diagram of an implementation environment of a speech signal processing method according to an embodiment of the present application. The implementation environment comprises a terminal 101 or the implementation environment comprises a terminal 101 and a speech signal processing platform 102. The terminal 101 is connected to the voice signal processing platform 102 through a wireless network or a wired network.

The terminal 101 may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3(Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, and a laptop computer. The terminal 101 is installed and operated with an application program supporting voice signal processing, for example, the application program may be a system application, an instant messenger application, a news push application, a shopping application, an online video application, a social application, a music application, and the like.

Illustratively, the terminal 101 can have a voice signal collecting function and a voice signal processing function, and the terminal 101 can collect a voice signal and process the collected voice signal. The terminal 101 can independently complete the work, and can also provide a background service for voice signal processing through the voice signal processing platform 102, which is not limited in the embodiment of the present application.

The voice signal processing platform 102 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The speech signal processing platform 102 is used to provide background services for applications that support speech signal processing. Optionally, the voice signal processing platform 102 undertakes primary processing, and the terminal 101 undertakes secondary processing; or, the voice signal processing platform 102 undertakes the secondary processing work, and the terminal 101 undertakes the primary processing work; alternatively, the speech signal processing platform 102 or the terminal 101, respectively, may undertake the processing separately. Or, the voice signal processing platform 102 and the terminal 101 adopt a distributed computing architecture for performing collaborative computing.

Optionally, the speech signal processing platform 102 includes a plurality of servers 1021, and the plurality of servers 1021 can cooperate to provide background services for speech signal processing for the terminal 101.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like.

Those skilled in the art will appreciate that the number of the terminals 101 and the servers 1021 may be greater or smaller. For example, there may be only one terminal 101 or one server 1021, or several tens or hundreds of the terminals 101 or the servers 1021, or a larger number, and the number of the terminals or the servers and the device types are not limited in the embodiment of the present application.

A specific application scenario is provided below.

The embodiment of the application provides a front-end voice enhancement technology, in the technology, an expected signal can be enhanced, interference can be suppressed, and the technology can be applied to various systems, such as an intelligent home system, a video conference system, a voice recognition system and the like. These three application scenarios are explained below with reference to fig. 2, 3 and 4. Of course, the method provided in the embodiment of the present application can also be applied to other scenarios, and the embodiment of the present application does not limit the application scenarios and the application fields.

Fig. 2 is a schematic diagram of a speech signal processing flow of a speech signal processing system according to an embodiment of the present application, and fig. 2 shows a front-end signal processing process before a speech recognition system. Specifically, one or more users 201 send out voices, one or more microphones 202 collect target voices and interfering voices, output one or more observation signals 203, respectively obtain an enhanced clean target voice signal 208 after echo cancellation 204, dereverberation 205, sound source separation 206 and post-processing 207, input the clean target voice signal 208 into a recognition system 209 for voice recognition, and obtain text contents corresponding to the clean target voice signal 208. The speech signal processing method provided by the embodiment of the present application corresponds to the sound source separation 206.

Specifically, after echo cancellation 204 and dereverberation 205 are performed on one or more observed signals 203, a preprocessed one or more observed signals are obtained. In this step of sound source separation 206, iteration can be performed for a plurality of times, a separation parameter with a good separation effect is determined to separate the mixed signal of the one or more observation signals, one or more clean target speech signals are obtained by separation, each clean target speech signal corresponds to one sound source, so that the clean target speech signal of each sound source is post-processed and then input into the recognition system 209, and the recognition system 209 performs speech recognition on each clean speech signal to determine the text content corresponding to the speech emitted by each sound source. By this sound source separation 206, the accuracy of the subsequent recognition system in recognizing the text content corresponding to the speech uttered by each user can be improved.

In the case of echo cancellation 204 and dereverberation 205, in the case of voice signal acquisition or recording, the microphone receives not only the part of the sound wave emitted by the desired sound source and directly arriving, but also the sound wave emitted by the sound source and arriving by passing through other paths, and the unwanted sound wave (i.e. background noise) generated by other sound sources in the environment. Acoustically, reflected waves with a delay time of about 50 milliseconds (ms) or more are called echoes, and the effect of the remaining reflected waves is called reverberation.

Echo cancellation generally uses an echo cancellation method, that is, the magnitude of an echo signal is estimated by an adaptive method, which is called an estimated value, and the estimated value is subtracted from a received signal to cancel the echo.

For dereverberation 205, the dereverberation technique of the single-microphone system can utilize the characteristics of the transform domain and the acoustic signal time of the receiving location in the sound field for reverberation removal. The multi-microphone array system can utilize the spatial characteristics of a sound field, and can acquire the spatial characteristics of the sound field according to the receiving directivity of the array, so that the ratio of signals to reverberation energy can be directly improved, and the multi-microphone array system also has a remarkable inhibiting effect on background noise (equivalent noise level, electric noise generated by a microphone). The dereverberation 205 may be implemented by a dereverberation algorithm or a dereverberation model, such as a WPE (Weighted Prediction Error for speech dereverberation), a Schroeder reverberation model, and the like.

The post-processing 206 is used to perform further noise reduction processing and the like on the clean target speech signal 208 obtained by the sound source separation, so that the speech signal entering the recognition system 209 is clean enough to improve the accuracy of speech recognition.

For the recognition system 209, the recognition system 209 is operable to employ speech technology to convert speech into text or commands. Alternatively, the recognition system 209 may be implemented by a speech recognition model.

The speech signal processing system can be applied to any speech recognition scenario, for example, the speech signal processing system is applied to a speech recognition scenario, in which the speech signal processing system can be called a speech recognition system, and in a multi-person scenario, a "speech input" function is used, the "speech input" function refers to that a user makes speech, the speech signal of the user is collected by a device and converted into text content, and the text content is input into an input box. Of course, the multi-person scene may also include functions relying on voice recognition, such as voice search, voice wakeup, machine translation, etc., for example, in a court, voice signals emitted by a plurality of sound sources can be separated and recognized respectively, so as to obtain text content of the voice signal emitted by each sound source, and the text content can be recorded as a court trial record of each speaker.

In a specific application scenario, the speech signal processing system may be applied to an intelligent home scenario, and in the scenario, the speech signal processing system may be referred to as an intelligent home system. Fig. 3 is a schematic diagram of an intelligent home system according to an embodiment of the present application, as shown in fig. 3, in the intelligent home system, a plurality of users 301 speak, where one or more users 301 issue a voice instruction to a device 302 of the intelligent home system to control the device 302 to execute a function 303 corresponding to the voice instruction.

The device 302 may be an intelligent household appliance such as a toilet lamp, a desk lamp, an air conditioner, and a television, and only a few intelligent household appliances are listed here as an example for description, and the embodiment of the present application does not limit the intelligent household appliances.

The functions 303 performed by different intelligent appliances are different. For example, the function of a toilet lamp, a desk lamp, or the like is to turn on/off the lamp, or to adjust the brightness, or the like. Also for example, the air conditioner can be turned on/off, or adjust the temperature, or adjust the wind power or steering, etc. Also for example, a television set can be turned on/off, or a channel change, or a volume adjustment, etc. For example, when a user a, a user B, a user C, and a user D speak, the device 302 of the smart home system collects a mixed signal obtained by mixing speech signals of four users, and can perform echo cancellation, dereverberation, and other processing on the mixed signal, and then separates the mixed signal by the method provided by the present application to separate the speech signals of four sound sources, and then identifies the speech signal of each sound source to determine the text content of each speech signal. By analyzing the semantics of the text content of each voice signal, it can be determined that the user A and the user B are in chatting, the user C sends a voice instruction of turning on the air conditioner, and the user D sends a voice instruction of turning down the volume of the television. After the analysis, the air conditioner and the television in the intelligent home system execute corresponding functions: the air conditioner starts, and the volume of the television is reduced.

In another particular application scenario, the speech signal processing system may be applied in a video conferencing scenario, in which the speech signal processing system may be referred to as a video conferencing system. Fig. 4 is a schematic diagram of a video conference system provided in an embodiment of the present application, as shown in fig. 4 (a), in the video conference system, a plurality of users 401 are performing video calls through respective devices, the devices of the plurality of users 401 may display a video call interface 402, and a video screen of each user 401 may be displayed on the video call interface.

When the multiple users 401 speak in the video call process, after the equipment collects or receives voice signals sent by the multiple users 401, the equipment can perform echo cancellation, dereverberation and other processing on a mixed signal of the multiple voice signals, then the mixed signal is separated by the method provided by the application, voice signals of four sound sources (namely, the users 401) are separated, then the voice signals of each sound source are identified, text content of each voice signal is determined, and the text content is used as speaking content of each user 401.

Alternatively, as shown in fig. 4 (b), after the recognition is completed, the speech content of each user 401 may be displayed as a subtitle 403 in the video screen of each user 401.

Alternatively, as shown in fig. 4 (c), after the recognition is completed, the device may display the speech contents of the plurality of users 401 in the chat box 404.

Fig. 5 is a flowchart of a speech signal processing method provided in an embodiment of the present application, where the method is applied to an electronic device, and the electronic device is a terminal or a server, and referring to fig. 5, taking the application of the method to a terminal as an example, the method includes the following steps.

501. And the terminal separates the voice signals based on the separation parameters to obtain at least one target signal.

The voice signal is a mixed signal acquired based on a plurality of sensors, and can be called as a mixed observation signal or an observed multi-sound-source mixed signal.

The separation parameter is used for separating the voice signals to separate at least one target signal, wherein each target signal is a signal emitted by each sound source. The separation parameter may be an initial value at first, and the terminal may continue to perform subsequent steps to update the separation parameter according to subsequent data, so as to improve the separation effect of the separation parameter on the voice signal and obtain an accurate and clean target signal.

502. The terminal determines corresponding shape parameters for the at least one target signal, respectively, with different shape parameters corresponding to different probability distributions.

The shape parameter is used to determine a probability distribution of the target signal. As the probability distribution of the signals of different sound sources is possibly different, the method provides several optional probability distributions, and the probability distribution of each target signal can be flexibly determined from the several optional probability distributions by determining the shape parameters, so that the method can adapt to the complicated acoustic scene of multiple sound sources, and the adaptability of the speech signal processing method is improved.

Specifically, the shape parameters can be updated based on the subsequent steps, the shape parameters of each target signal are preferentially selected, accurate probability distribution can be determined for each target signal, the probability distribution obtained by fitting is more consistent with the real probability distribution condition of the target signal, and therefore the accuracy of the voice signal processing method is improved.

503. And the terminal acquires the probability density distribution of the at least one target signal according to the shape parameter corresponding to the at least one target signal, and the probability density distribution of one target signal obeys the probability distribution corresponding to the shape parameter of the target signal.

After the terminal determines the shape parameters for each target signal, the probability density distribution of the target signal can be determined according to the shape parameters. The probability density distribution is related to the shape parameter, for example, in one specific example, two probability distributions may be provided: gaussian and super-gaussian distributions. Suppose that

In order to be a parameter of the shape,

the corresponding target signal obeys a super-gaussian distribution,

the corresponding target signal obeys a gaussian distribution. In particular, it is possible to use, for example,

the corresponding target signal obeys a laplacian distribution. Of course, in another specific example, a sub-gaussian distribution may also be provided, for which the shape parameters are set.

504. The terminal determines independence between the at least one target signal based on the probability density distribution and the separation parameter.

After the terminal determines the probability density distribution of each target signal, the terminal can set a target function, and how to obtain the effect of at least one target signal separated by the current separation parameter is measured according to the value of the target function. The objective function is used to represent independence between the at least one target signal. Understandably, the better the independence, the better the separation is demonstrated; the independence is not good, which indicates that the separation effect is not good.

505. And the terminal updates the shape parameters and the separation parameters according to the independence until the shape parameters and the separation parameters meet target conditions, and obtains at least one target voice signal separated based on the updated separation parameters.

If the independence is not good, the terminal can update the separation parameters and the shape parameters and then repeatedly execute the steps, the repeatedly executed process is called an iterative process, the optimal separation parameters and the optimal shape parameters can be found through the iterative processes, and then the optimal target voice signals can be separated through the optimal separation parameters.

As shown in fig. 6, in the related art, an IVA-based blind source separation method is adopted, assuming that convolution mixed signals of N sources are collected by M sensors and observation signals of M channels are output, STFT (Short-Time fourier transform) is used to transform the observation signals to a frequency domain, and an Independent Vector Analysis (IVA) -based separation method is adopted to establish a source signal model to obtain an objective function, where the objective function is an objective function in a nonlinear parameter optimization method, such as a natural gradient method or an adaptive variable step size method. But the method uses a single probabilistic model,

the method and the device correspond to different probability distributions by the shape parameters, provide multiple probability distributions for selection, flexibly select corresponding probability distributions according to the shape parameters when the voice signals are processed, and further determine separation parameters which enable the target signals to be independent from each other to separate the voice signals, so that the method and the device can adapt to acoustic scenes comprising a plurality of different sound sources, and can accurately separate the target voice signals of all the sound sources in a changeable scene with multiple sound sources mixed.

Fig. 7 is a flowchart of a speech signal processing method provided by an embodiment of the present application, and referring to fig. 7, the method includes the following steps.

701. The terminal collects voice signals.

The terminal may have a voice signal acquisition function and a voice signal processing function. The terminal can collect voice signals, process the voice signals and separate the voice signals into at least one target voice signal. The speech signal is usually a signal generated by one or more sound sources, and the speech signals generated by the respective sound sources can be separated by a separation step, and the target speech signal is a clean, low-noise or even noiseless, single-sound-source speech signal.

In one possible implementation, one or more microphones may be configured on the terminal, for example, the number of microphones may be 4, 6, 8, and so on. The terminal may collect voice signals based on the microphone. One microphone is also a sensor, and each microphone outputs a voice signal of one channel. If the number of the microphones is multiple, the voice signals of multiple channels can be mixed to obtain the voice signal.

The method is exemplified as being applied to a terminal, alternatively, the method may also be applied to a server, and when the method is applied to the server, the voice signal may be collected by the terminal and sent to the server in step 701. Optionally, the terminal and the server may cooperate to complete the voice signal processing process, and the embodiment of the present application does not limit what kind of method is specifically adopted.

702. The terminal carries out short-time Fourier transform on the voice signal to obtain a frequency domain signal of the voice signal.

It can be understood that the speech signal is a time domain signal, and when the time domain signal is processed, the calculation complexity is high, and the calculation amount is large, and when the time domain signal is converted into a frequency domain for calculation, the calculation complexity can be reduced, and the calculation amount is greatly reduced. Therefore, the terminal can convert the voice signal into a frequency domain signal and then perform subsequent processing based on the frequency domain signal.

Because the voice signal is non-stationary, the voice signal can not be processed directly by using a standard Fourier transform mode, and the voice signal has short-time characteristics, the short-time Fourier transform is adopted to transform each short-time segment of the voice signal, and a frequency domain signal of the voice signal can be obtained. The short-time fourier transform may also be referred to as a finite-length fourier transform, and the resulting frequency spectrum of the frequency-domain signal may also be referred to as a short-time spectrum.

Specifically, the process of converting the time domain signal into the frequency domain signal may include: and the terminal performs windowing, framing and Fourier transformation on the voice signal to obtain a frequency domain signal of the voice signal. Optionally, the terminal may further perform logarithm processing on the fourier transformed signal to obtain a corresponding frequency spectrum (e.g., mel frequency spectrum), and perform analysis processing based on the frequency spectrum.

703. And the terminal separates the frequency domain signal based on the separation parameter to obtain the at least one target frequency domain signal.

After the terminal obtains the frequency domain signal, the frequency domain signal is also a mixed signal, the terminal can separate the signal therein into at least one target frequency domain signal, and each target frequency domain signal is considered to be a frequency domain signal corresponding to a voice signal sent by a sound source.

The separation parameter can be regarded as a transform coefficient, the speech signal is a mixed signal obtained by mixing source signals, and naturally, a frequency domain signal of the speech signal is also a mixed signal. The mixing manner may include various manners, such as linear instantaneous mixing, linear convolution mixing, nonlinear mixing, and the like. The separation parameter is used to unmix the mixed signal.

In a possible implementation manner, the separation parameter is at least one separation matrix, the number of the separation matrices is one or more, and each frequency point corresponds to one separation matrix. If the number of the separation matrixes is one, the terminal can separate the frequency domain signals of the single frequency point. If the number of the separation matrixes is multiple, the terminal can perform separation processing on the frequency domain signals of multiple frequency points. Accordingly, the terminal can separate the frequency domain signal according to the at least one separation matrix to obtain the at least one target frequency domain signal. The frequency domain signals can be unmixed according to the comprehensive separation matrix, and at least one target frequency domain signal is separated.

For example, assume that N source signals, M observed signals, and estimated N source signals are respectively represented as

And

the method comprises the steps of obtaining a frequency point mark F belonging to {1,. eta., F }, obtaining a frame mark T belonging to {1,. eta., T }, and requiring M to be larger than or equal to N. Let M × N order mixing matrix be A_fThe NxM order separation matrix is W_f. M and N are both positive integers. Alternatively, it may be assumed that M is equal to N, an estimated source signal (i.e., at least one target frequency domain signal) is obtained by processing the observed signals (i.e., the frequency domain signals), the number of the source signals is analyzed, and some observed signals are discarded. The relationship of the three signals can be shown as the following equations (1) and (2).

x_ft＝A_fs_ftFormula (1)

y_ft＝W_fx_ftFormula (2)

It should be noted that, in the

steps

702 and 703, a process of separating the speech signal based on the separation parameter to obtain at least one target signal is performed, and in the above process, the at least one target signal is at least one target frequency domain signal, and the speech signal is converted into a frequency domain calculation, so that the calculation complexity is reduced, and the calculation amount is reduced. Optionally, the terminal may also directly process the time domain signal, which is not limited in this embodiment of the present application.

704. The terminal determines corresponding shape parameters for the at least one target signal, respectively, with different shape parameters corresponding to different probability distributions.

After the terminal separates at least one target signal, the probability density distribution of each signal can be determined, so that the independence between each target signal can be determined based on the probability density distribution, and the separation effect can be judged.

Optionally, the shape parameter includes a plurality of determination manners, for example, a shape parameter that best matches the probability distribution of the target signal is determined from preset shape parameters. As another example, a waveform or energy distribution analysis is performed on the target signal to determine a shape parameter of the target signal.

In one possible implementation manner, the terminal may randomly select a shape parameter corresponding to the at least one target signal from at least two preset shape parameters. The terminal can randomly determine the shape parameters in each iteration process, when the shape parameters are different, the corresponding probability distributions are different, the probability density distributions obtained subsequently are also different, and naturally, the independence between at least one target signal obtained by the terminal is also different. Therefore, when the shape parameters are adjusted according to the independence subsequently, the optimal shape parameters can be adjusted, so that the independence between at least one target signal obtained subsequently is better.

By presetting at least two shape parameters and determining the optimal shape parameter for each target signal in iteration, the situation that the probability distribution of the voice signals sent by one or more sound sources in the voice signals is possibly different is considered, instead of analyzing the voice signals of all the sound sources by adopting the same probability distribution, the target signals which are more in line with the probability distribution situation of the signals sent by each sound source can be fitted. And by optimizing the preset property parameters, the shape parameters can be determined quickly, and the efficiency is higher.

The preset shape parameters may be set by a relevant technician as required, which is not limited in the embodiment of the present application. For example, the preset shape parameter

May be a value from 0 to 2. The process of processing the speech signal can be regarded as a process of separating the speech signal using a hybrid model, when

The time-domain model has a super-gaussian property,

the temporal model has gaussian properties.

In another possible implementation manner, the terminal may also determine, according to information such as a waveform or an energy distribution of each at least one target signal, a corresponding shape parameter for the at least one target signal, and the embodiment of the present application does not limit what specific implementation manner is adopted.

705. And the terminal acquires the probability density distribution of the at least one target signal according to the shape parameter corresponding to the at least one target signal, and the probability density distribution of one target signal obeys the probability distribution corresponding to the shape parameter of the target signal.

After the terminal determines the shape parameter corresponding to each target signal, the probability distribution obeyed by each target signal is determined, and the probability density distribution of each target signal can be obtained.

Optionally, the probability density distribution may be determined by a target probability density distribution function, where the target probability density distribution function includes a shape parameter, and the probability density distribution functions of different probability distributions may be obtained by assigning different values to the shape parameter.

In one possible implementation, the target signal may be divided into at least two signal components, and the probability density distribution of the target signal is analyzed according to the condition that the signal components are mixed to obtain the target signal. Specifically, the step 705 can be implemented by the following steps one to four:

step one, for a target signal, acquiring a mixing coefficient of at least two signal components of at least one frame in the target signal.

The mixing coefficient is used for representing the proportion or weight of each signal component of at least one frame in the target signal in all signal components. The mixing coefficient in the first step is the mixing coefficient of each signal component in each frame, and each frame of the target signal can be obtained by mixing each signal component and the corresponding ratio or weight thereof by determining the ratio or weight of each signal component in all the signal components, so that the probability density distribution of the target signal can be determined.

The mixing coefficient can be determined in various ways, and in one possible implementation manner, for any signal component of any frame in the target signal, the terminal may obtain a product of the mixing coefficient determined in the last iteration of the any signal component and the probability density distribution, obtain a ratio of the product of the any signal component and a sum of the products of at least two signal components of the any frame, and use the ratio as the mixing coefficient of the any signal component.

For example, the obtaining process of the mixing coefficient can be realized by the following formula (3):

wherein the content of the first and second substances,

is the posterior probability (mixing coefficient) of the ith mixed component of the t-th frame, i.e. the mixing coefficient of the ith mixed component of the t-th frame,

is the mixing probability (mixing coefficient) of the i components in the nth source, p (y)^[n]) Is a probability density function of the nth source signal. I and j are used to identify the mixed component.

Of course, the mixing coefficient can also be determined in other manners, for example, a plurality of mixing coefficients may be preset, the terminal selects one mixing coefficient for each signal component from the preset plurality of mixing coefficients, and in subsequent iterations, the mixing coefficient for each signal component can be preferentially selected from the preset plurality of mixing coefficients. The embodiment of the present application does not specifically limit the manner of determining the mixing coefficient.

And step two, acquiring the mixing coefficient and the variance information of at least two signal components of the target signal according to the mixing coefficient of at least two signal components of at least one frame in the target signal.

After the mixing coefficient of each signal component in each frame is determined through the first step, the mixing coefficient of each signal component in all frames of the target signal can be determined, and the local difference information related to the target signal can also be determined.

Optionally, for the mixing coefficients of at least two signal components of the target signal, the mixing coefficient is a mixing coefficient that integrates the same signal component of all frames. For any signal component of the target signal, the terminal may obtain an average value of the mixing coefficients of the any signal component in at least one frame, and use the average value as the mixing coefficient of the any signal component.

Alternatively, the terminal may determine the mixing coefficient of at least two signal components of the target signal in other manners, for example, a corresponding weight may be set for each frame, and the terminal may perform weighting according to the mixing coefficient of each signal component in at least one frame and the weight of each frame to obtain the mixing coefficient of at least two signal components of the target signal. Wherein the weight can be set by a person skilled in the art according to requirements or determined by analyzing each frame of the target signal. The embodiment of the present application does not limit what specific implementation manner is adopted.

For the variance information, in a possible implementation manner, the variance information of any one of the at least two signal components includes a variance corresponding to the signal component at any frequency point and a variance weight corresponding to the signal component in any frame. Correspondingly, in the second step, after the terminal acquires the mixing coefficient of at least two signal components of the target signal, the variance of the signal component at any frequency point and the variance weight of the signal component in any frame can be determined according to the mixing coefficient. Alternatively, the variance may be in the form of a variance matrix and the variance weight may be in the form of a weight coefficient matrix of the variance.

For example, the second step can be realized by the following formulas (4) to (6), wherein the formula (4) is used for determining the mixing coefficient of any one signal component, and the formulas (5) and (6) are used for determining the variance information.

Wherein the content of the first and second substances,

as variance, weight coefficient of variance

To adapt to power fluctuations due to non-stationarity of the acoustic signal.

In a specific embodiment, the first Step and the second Step may be implemented by using a maximum Expectation (EM) algorithm, and the first Step may correspond to an E Step (E-Step) and the second Step may correspond to an M Step (M-Step). If the above processing procedure for the speech signal is implemented by a model, it is referred to as a complex generalized Gaussian mixture model. The EM algorithm has great advantages in estimating model parameters, and the model parameters of the model can be determined quickly and accurately through the EM algorithm, namely the mixing coefficients and the variance information.

And thirdly, acquiring the probability density distribution of the at least two signal components according to the variance information of the at least two signal components.

And step four, weighting the probability density distribution of the at least two signal components according to the mixing coefficient of the at least two signal components of the target signal to obtain the probability density distribution of the target signal.

In the third step and the fourth step, after the terminal determines the parameters required for obtaining the probability density distribution, the probability density distribution of each target signal can be determined based on the parameters. When the probability density distribution of the target signal is determined, the probability density distribution of each signal component can be determined, and then the probability density distributions of at least two signal components are mixed according to the mixing coefficient of each signal component to obtain the probability density distribution of the target signal.

Alternatively, the probability density distribution may be represented by a probability density distribution function. For example, it is considered that the frequency points of the same frame are mutually independent, the target signal has I mixed components, the mixed components are signal components, I is a positive integer, I is the number of the signal components, the number of the signal components may be set by a relevant technician according to a requirement, and the terminal may randomly select from a plurality of preset signal components, and determine an optimal number in an iteration process, which is not limited in the embodiment of the present application. Specifically, the third step and the fourth step can be realized by the following formula (7):

wherein, p (y)^[n]) Is a function of the probability density of the nth source signal (i.e., the nth target signal). Wherein, (. cndot.) is a gamma function,

and

respectively, the mixing probability (i.e. mixing coefficient) and the shape parameter of the i signal components in the nth source.

As variance, weight coefficient of variance

To adapt to power fluctuations due to non-stationarity of the acoustic signal.

706. The terminal determines independence between the at least one target signal based on the probability density distribution and the separation parameter.

In determining the above parameters, an objective function can be set, the objective function being to maximize the independence between the channel signals. Thus, by setting the objective function, the parameters determined and given in the above steps can be adjusted through the value of the objective function to find the optimal parameters, and the voice signals are separated through the optimal parameters to obtain the optimal separation result.

For example, the objective function is shown in equation (8) below:

wherein, p (y)^[1],...,y^[N]) As a joint probability density function, p (y)^[n]) As a function of the probability density of the nth source signal,

det is a computer function used to determine the Determinant (Determinant) of a square matrix (matrix) in FreeMat, Matlab.

707. The terminal responds to the condition that the target condition is not met, updates the shape parameter and the separation parameter, and repeatedly executes the steps 703 to 706 based on the updated shape parameter and separation parameter until the target condition is met, and stops the updating step to obtain at least one target frequency domain signal separated based on the updated separation parameter.

Through the steps, whether the current separation result is accurate and can meet the requirements or not is judged according to the obtained independence, if not, a better separation parameter needs to be determined to separate the voice signals, therefore, the terminal can update the shape parameter and the separation parameter when the target condition is not met, the steps 703 to 706 are executed again based on the updated parameter, and whether the updated parameter meets the target condition or not is determined.

Each time the process from step 703 to step 706 is executed as an iterative process, the process is repeatedly executed, that is, the iterative process is executed for a plurality of times, and the shape parameter and the separation parameter that meet the target condition can be determined through a plurality of iterations. The separation parameters according with the target conditions can accurately separate the voice signals to obtain clean target voice signals.

Optionally, the separation parameter is at least one separation matrix. The update process for the separation parameter may be: updating the weight matrix of the at least one separation matrix according to the at least one target signal obtained by the last iteration, the mixing coefficient of at least two signal components of at least one frame in the at least one target signal and the variance information of the at least two components of the at least one target signal, and updating the at least one separation matrix according to the updated weight matrix. That is, the terminal can update the weight matrix of the separation matrix as well as the separation matrix.

For example, in step 707, the terminal may obtain the separation matrix by using an auxiliary function method in AuxIVA, and through the auxiliary function, compared with a natural gradient method, a self-adaptive variable-step method, and the like, without being constrained by a step parameter, the terminal may more accurately find an approximate solution of the separation matrix and a model parameter (e.g., a shape parameter, variance information, and the like) while reducing the calculation amount, and may accurately obtain the separation matrix without performing a pre-whitening process on the observation signal (i.e., the acquired voice signal). Specifically, with the auxiliary function method, each iteration can update the following two kinds of information:

updating of the first type of information: the weighted variance matrix is updated. The weighted variance matrix can be expressed by the following equation (9):

wherein the content of the first and second substances,

is the last iteration is right

An estimate of (d).

Updating of the second type of information: the separation matrix is updated. The separation matrix can be expressed by the following equation (10):

wherein e is_nIs a unit vector of order N × 1, in which the nth element is 1 and the other elements are 0.

Is the separation filter of the nth source, and the separation matrix is obtained after the separation filter is respectively updated for the N sources, namely

Each frequency point corresponds to a separation matrix. Wherein N and N are positive integers.

For the target condition, the target condition is that the independence converges, or the target condition is that a difference between the independence and the independence obtained by the last iteration is smaller than a target difference, or the target condition is that the iteration number reaches a target number, of course, the target condition may also be other conditions, and the target condition may be set by a related technician as required, which is not limited in the embodiment of the present application.

Step 707 is a process of updating the shape parameter and the separation parameter according to the independence between the at least one target signal determined by the probability density distribution and the separation parameter, and stopping until the independence meets a target condition, and through multiple iterations, a separation parameter that makes the independence between the at least one target signal good can be determined, and a speech signal is separated according to such a separation parameter, so that at least one clean target speech signal can be obtained.

In the foregoing process, taking at least one target frequency domain signal obtained by the last iteration as an example for explanation, in another possible implementation manner, the terminal may repeat the step 703 after obtaining the separation parameter through multiple iterations, separate the frequency domain signal to obtain at least one target frequency domain signal, and perform the following step 708 based on the at least one target frequency domain signal, which is not limited in this embodiment of the present application.

708. And the terminal performs short-time inverse Fourier transform on the at least one target frequency domain signal to obtain the at least one target voice signal.

Through the multiple iteration processes, the terminal obtains at least one target frequency domain signal obtained through separation of the separation parameters with good separation effect, and the target voice signal corresponding to each target frequency domain signal can be obtained through short-time inverse Fourier transform.

It should be noted that, step 707 and step 708 are processes of updating the shape parameter and the separation parameter according to the independence between the at least one target signal determined by the probability density distribution and the separation parameter, and stopping until the independence meets a target condition, so as to obtain at least one target speech signal separated based on the updated separation parameter. In another possible implementation manner, the terminal may also directly process the time-domain signal, and without converting the time-domain signal into a frequency-domain signal, the at least one target speech signal may be obtained through step 707, and the embodiment of the present application is not limited to which specific implementation manner is adopted.

A specific example is provided below, by which comparative analysis is performed with the related art.

As shown in fig. 6, in the related art, a single sound source probability model is used to separate a speech signal, which cannot adapt to an acoustic scene with multiple sound sources, the separation performance is limited by a step parameter, and the amount of computation of a gradient descent algorithm is large, so that the efficiency of processing the speech signal is low. And it is difficult to estimate the variance of the source signal, requiring pre-whitening of the observed signal, and thus difficult to implement in real-time in the product.

In the specific example shown in fig. 8, the present application adopts complex generalized gaussian mixture distribution as a sound source model, and can perform step 801 of STFT on the collected and mixed speech signal x (t) to obtain a frequency domain signal x_ftThen performing the step of hybrid model parameter initialization802, further processing the frequency domain signals through the hybrid model by using an IVA algorithm 803, after each processing result, executing a step 804 of updating the objective function, executing a step 805 of updating parameters of the hybrid model through an EM algorithm, then continuously executing the IVA algorithm 803 based on the updated parameters of the model, obtaining a separation matrix W (806) through multiple iterations, and obtaining at least one target frequency domain signal y based on the separation matrix W in a separating way_ftFinally to y_ftThe ispft step 807 is performed to obtain at least one target speech signal y (t).

In the specific example, the statistical models of the super-gaussian and gaussian signals can be flexibly selected by determining the shape parameters, and the introduction of the mixing mechanism realizes the adaptive matching of the sound source model. Secondly, the hybrid model parameters (such as variance) can be conveniently estimated by adopting the maximum Expectation (EM) algorithm, and more accurate model parameters can be obtained. Finally, the auxiliary function optimization method based on the MM (maximum and minimum algorithm) framework is adopted and improved, the constraint of step length parameters is avoided, the calculation amount is reduced, the approximate solutions of the separation matrix and the mixed model parameters can be accurately solved, the collected voice signals do not need to be subjected to pre-whitening, and the signal separation efficiency is improved.

The following experiment of a two-sound-source mixed scene is performed on the sound source separation effect of the present application and the related art. In a room 4.45 meters (m) long, 3.55m wide, 2.5m high, and about 130ms reverberation time, two speakers speak simultaneously, each of which is 1m away from a microphone and has a direction angle of 45 ° and 135 ° with respect to the microphone, respectively, and two microphones spaced 0.1m apart are used to receive a mixed speech signal, which is a speech signal obtained by mixing speech signals of two sound sources, and may also be referred to as a two-sound-source mixed speech signal.

As shown in fig. 9 and 10, in the embodiment of the present application, a spectrogram is obtained before and after processing a two-sound-source mixed speech signal. By comparing the two spectrogram, the speech signals generated by the two sound sources are mixed together in the spectrogram shown in fig. 9, and the speech signals generated by the two sound sources are respectively on the left and right channels in the spectrogram shown in fig. 10.

An experimental example is provided below, in which the method provided in the present application and other methods are respectively used to process a received voice Signal, and the processing effects of the different methods are compared and analyzed through a Signal-to-Interference Ratio (SIR) and a Signal-to-Distortion Ratio (SDR) of the processed Signal.

The complex generalized gaussian mixture distribution-based AuxIVA technique provided in the present application is represented by (1), the laplacian distribution-based AuxIVA technique by (2), and the complex generalized gaussian distribution-based AuxIVA technique by (3).

The dual-voice source speech signals are processed by three techniques, respectively, and the SIR of the processed signals may be as shown in table 1, and the SDR of the processed signals may be as shown in table 2.

TABLE 1

Sound source	Technique (1)	Technique (2)	Technique (3)
				45 degree sound source	23.28	16.49	21.40
135 degree sound source	23.21	18.10	23.01

The calculated SIR values for each sound source using each technique are shown in table 1 and are expressed in decibels (dB), and the values in table 1 are the average SIR values obtained after 50 experiments, which are only used for comparison.

As shown in table 1, it can be seen that the SIR value of the signal obtained by the technique (1) is greater than that obtained by other techniques, and thus the speech signal can be more accurately separated by the technique (1), and the separation effect is better.

TABLE 2

Sound source	Technique (1)	Technique (2)	Technique (3)
				45 degree sound source	15.98	11.26	14.76
135 degree sound source	16.00	15.17	11.75

The calculated SDR values in decibels (dB) for each sound source obtained using each technique are shown in table 2, and the values in table 2 are average SDR values obtained from 50 experiments, which are only used as an example for comparison.

As shown in table 2, it can be seen that the SDR value of the signal obtained by the technique (1) is greater than the SDR values obtained by other techniques, and thus, the speech signal can be separated more accurately by the technique (1), and the separation effect is better.

By combining the data in the above table 1 and table 2, the method provided by the present application can successfully separate the mixed speech of multiple sound sources, and compared with other methods, the quality of the front-end signal is significantly improved.

The method provided by the embodiment of the application corresponds to different probability distributions by the shape parameters, simultaneously provides multiple probability distributions for selection, can flexibly select corresponding probability distributions according to the shape parameters when processing the voice signals, and further determines the separation parameters which enable the target signals to be independent from each other to separate the voice signals, so that the method can adapt to acoustic scenes comprising a plurality of different sound sources, and can accurately separate the target voice signals of all the sound sources in a changeable scene mixed by multiple sound sources, therefore, the method has good accuracy and adaptability.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 11 is a schematic structural diagram of a speech signal processing apparatus according to an embodiment of the present application. Referring to fig. 11, the apparatus includes:

a separation module 1101, configured to separate the voice signal based on the separation parameter to obtain at least one target signal;

a determining module 1102, configured to determine corresponding shape parameters for the at least one target signal respectively, where different shape parameters correspond to different probability distributions;

an obtaining module 1103, configured to obtain a probability density distribution of the at least one target signal according to the shape parameter corresponding to the at least one target signal, where the probability density distribution of one target signal obeys the probability distribution corresponding to the shape parameter of the target signal;

the determining module 1102 is further configured to determine independence between the at least one target signal according to the probability density distribution and the separation parameter;

and an updating module 1104, configured to update the shape parameter and the separation parameter according to the independence, and stop the updating until a target condition is met, so as to obtain at least one target speech signal separated based on the updated separation parameter.

In one possible implementation, the obtaining module 1103 includes a first obtaining unit, a second obtaining unit, a third obtaining unit, and a weighting unit;

the weighting unit is used for weighting the probability density distribution of the at least two signal components according to the mixing coefficient of the at least two signal components of the target signal to obtain the probability density distribution of the target signal.

In one possible implementation, the first obtaining unit is configured to:

In a possible implementation manner, the second obtaining unit is configured to obtain, for any signal component of the target signal, an average value of the mixing coefficients of the any signal component in at least one frame, and use the average value as the mixing coefficient of the any signal component.

In one possible implementation manner, the variance information of any one of the at least two signal components includes a variance corresponding to the signal component at any frequency point and a variance weight corresponding to the signal component in any frame.

the update module 1104 is configured to:

updating a weight matrix of the at least one separation matrix according to the at least one target signal obtained from the last iteration, the mixing coefficient of at least two signal components of at least one frame in the at least one target signal, and the variance information of the at least two components of the at least one target signal;

In a possible implementation manner, the determining module 1102 is configured to randomly select a shape parameter corresponding to the at least one target signal from at least two preset shape parameters.

In one possible implementation, the update module 1104 is configured to:

the steps of separating the signals, obtaining a probability density distribution, and determining independence are repeatedly performed based on the updated shape parameters and separation parameters until the target condition is met, and the updating step is stopped.

In one possible implementation, the target condition is that the independence converges, or the target condition is that a difference between the independence and the independence obtained in the last iteration is smaller than a target difference, or the target condition is that the number of iterations reaches a target number.

the separation module 1101 is configured to:

separating the frequency domain signal based on the separation parameter to obtain the at least one target frequency domain signal;

the update module 1104 is configured to:

updating the shape parameter and the separation parameter according to the independence between the at least one target signal determined by the probability density distribution and the separation parameter until the independence meets a target condition, and obtaining at least one target frequency domain signal separated based on the updated separation parameter;

The device provided by the embodiment of the application corresponds to different probability distributions by the shape parameters, simultaneously provides multiple probability distributions for selection, can flexibly select corresponding probability distributions according to the shape parameters when processing the voice signals, and further determines that the target signals are separated from each other by the separation parameters, so that the device can adapt to acoustic scenes comprising a plurality of different sound sources, and can accurately separate the target voice signals of all the sound sources in a changeable scene mixed by multiple sound sources.

It should be noted that: in the voice signal processing apparatus provided in the foregoing embodiment, when processing a voice signal, only the division of the functional modules is exemplified, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the voice signal processing apparatus is divided into different functional modules to complete all or part of the functions described above. In addition, the speech signal processing apparatus and the speech signal processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

The electronic device in the above method embodiment may be implemented as a terminal. For example, fig. 12 is a block diagram of a terminal according to an embodiment of the present disclosure. The terminal 1200 may be: a smart phone, a tablet computer, an MP3(Moving picture Experts Group Audio Layer III, motion picture Experts compression standard Audio Layer 3) player, an MP4(Moving picture Experts Group Audio Layer IV, motion picture Experts compression standard Audio Layer 4) player, a notebook computer or a desktop computer. Terminal 1200 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so forth.

In general, terminal 1200 includes: one or more processors 1201 and one or more memories 1202.

The processor 1201 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1201 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1201 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1201 may be integrated with a GPU (Graphics Processing Unit) for rendering and drawing content required to be displayed by the display screen. In some embodiments, the processor 1201 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.

Memory 1202 may include one or more computer-readable storage media, which may be non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1202 is used to store at least one program code for execution by the processor 1201 to implement the speech signal processing methods provided by the method embodiments herein.

In some embodiments, the terminal 1200 may further optionally include: a peripheral interface 1203 and at least one peripheral. The processor 1201, memory 1202, and peripheral interface 1203 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1203 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1204, display 1205, camera assembly 1206, audio circuitry 1207, positioning assembly 1208, and power supply 1209.

The peripheral interface 1203 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1201 and the memory 1202. In some embodiments, the processor 1201, memory 1202, and peripheral interface 1203 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1201, the memory 1202 and the peripheral device interface 1203 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1204 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1204 communicates with a communication network and other communication devices by electromagnetic signals. The radio frequency circuit 1204 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1204 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1204 may communicate with other terminals through at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1204 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1205 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1205 is a touch display screen, the display screen 1205 also has the ability to acquire touch signals on or over the surface of the display screen 1205. The touch signal may be input to the processor 1201 as a control signal for processing. At this point, the display 1205 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1205 may be one, providing the front panel of the terminal 1200; in other embodiments, the display 1205 can be at least two, respectively disposed on different surfaces of the terminal 1200 or in a folded design; in other embodiments, the display 1205 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 1200. Even further, the display screen 1205 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display panel 1205 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.

Camera assembly 1206 is used to capture images or video. Optionally, camera assembly 1206 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1206 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1207 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals into the processor 1201 for processing or inputting the electric signals into the radio frequency circuit 1204 to achieve voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided at different locations of terminal 1200. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1201 or the radio frequency circuit 1204 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1207 may also include a headphone jack.

The positioning component 1208 is used to locate a current geographic location of the terminal 1200 to implement navigation or LBS (location based Service). The positioning component 1208 can be a positioning component based on the GPS (global positioning System) in the united states, the beidou System in china, or the galileo System in russia.

The power supply 1209 is used to provide power to various components within the terminal 1200. The power source 1209 may be alternating current, direct current, disposable or rechargeable. When the power source 1209 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1200 also includes one or more sensors 1210. The one or more sensors 1210 include, but are not limited to: acceleration sensor 1211, gyro sensor 1212, pressure sensor 1213, fingerprint sensor 1214, optical sensor 1215, and small sensor 1216.

The acceleration sensor 1211 can detect magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1200. For example, the acceleration sensor 1211 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1201 may control the display screen 1205 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1211. The acceleration sensor 1211 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1212 may detect a body direction and a rotation angle of the terminal 1200, and the gyro sensor 1212 may collect a 3D motion of the user on the terminal 1200 in cooperation with the acceleration sensor 1211. The processor 1201 can implement the following functions according to the data collected by the gyro sensor 1212: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 1213 may be disposed on the side frames of terminal 1200 and/or underlying display 1205. When the pressure sensor 1213 is disposed on the side frame of the terminal 1200, the user's holding signal of the terminal 1200 can be detected, and the processor 1201 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1213. When the pressure sensor 1213 is disposed at a lower layer of the display screen 1205, the processor 1201 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1205. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1214 is used for collecting a fingerprint of the user, and the processor 1201 identifies the user according to the fingerprint collected by the fingerprint sensor 1214, or the fingerprint sensor 1214 identifies the user according to the collected fingerprint. When the user identity is identified as a trusted identity, the processor 1201 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 1214 may be provided on the front, back, or side of the terminal 1200. When a physical button or vendor Logo is provided on the terminal 1200, the fingerprint sensor 1214 may be integrated with the physical button or vendor Logo.

The optical sensor 1215 is used to collect the ambient light intensity. In one embodiment, the processor 1201 may control the display brightness of the display 1205 according to the ambient light intensity collected by the optical sensor 1215. Specifically, when the ambient light intensity is high, the display luminance of the display panel 1205 is increased; when the ambient light intensity is low, the display brightness of the display panel 1205 is turned down. In another embodiment, processor 1201 may also dynamically adjust the camera head 1206 shooting parameters based on the ambient light intensity collected by optical sensor 1215.

A small sensor 1216, also known as a distance sensor, is typically disposed on the front panel of the terminal 1200. The undersize sensor 1216 is used to collect the distance between the user and the front of the terminal 1200. In one embodiment, when the small sensor 1216 detects that the distance between the user and the front surface of the terminal 1200 gradually decreases, the processor 1201 controls the display 1205 to switch from the bright screen state to the dark screen state; when the small sensor 1216 detects that the distance between the user and the front surface of the terminal 1200 is gradually increased, the processor 1201 controls the display 1205 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 12 is not intended to be limiting of terminal 1200 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The electronic device in the above method embodiment may be implemented as a server. For example, fig. 13 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1300 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1301 and one or more memories 1302, where the memory 1302 stores at least one program code, and the at least one program code is loaded and executed by the processor 1301 to implement the voice signal Processing method provided by each method embodiment. Of course, the server may also have a wired or wireless network interface, an input/output interface, and other components to facilitate input and output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory, including at least one program code, the at least one program code being executable by a processor to perform the speech signal processing method in the above-described embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, the computer program product or the computer program comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more processors of the computer apparatus can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer apparatus can perform the voice signal processing method shown in the above-described embodiment.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should be understood that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is intended only to be an alternative embodiment of the present application, and not to limit the present application, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of speech signal processing, the method comprising:

2. The method according to claim 1, wherein the obtaining a probability density distribution of the at least one target signal according to a shape parameter corresponding to the at least one target signal comprises:

for a target signal, acquiring a mixing coefficient of at least two signal components of at least one frame in the target signal;

acquiring a mixing coefficient and variance information of at least two signal components of the target signal according to the mixing coefficient of at least two signal components of at least one frame in the target signal;

acquiring probability density distribution of the at least two signal components according to the variance information of the at least two signal components;

and weighting the probability density distribution of the at least two signal components according to the mixing coefficient of the at least two signal components of the target signal to obtain the probability density distribution of the target signal.

3. The method of claim 2, wherein obtaining the mixing coefficients of at least two signal components of at least one frame of the target signal comprises:

4. The method according to claim 2, wherein the obtaining the mixing coefficient of at least two signal components of the target signal according to the mixing coefficient of at least two signal components of at least one frame of the target signal comprises:

and for any signal component of the target signal, acquiring an average value of the mixing coefficient of the signal component in at least one frame, and taking the average value as the mixing coefficient of the signal component.

5. The method according to claim 2, wherein the variance information of any one of the at least two signal components comprises a variance of the signal component at any frequency point and a variance weight of the signal component in any frame.

6. The method of claim 2, wherein the separation parameter is at least one separation matrix;

the updating process of the separation parameter comprises the following steps:

7. The method of claim 1, wherein the determining the corresponding shape parameters for the at least one target signal comprises:

and randomly selecting the shape parameter corresponding to the at least one target signal from at least two preset shape parameters.

8. The method of claim 1, wherein said updating the shape parameter and the separation parameter according to the independence comprises:

9. The method according to claim 1, wherein the target condition is that the independence converges, or the target condition is that a difference between the independence and an independence obtained from a previous iteration is smaller than a target difference, or the target condition is that a target number of iterations is reached.

10. The method of claim 1, wherein the at least one target signal is at least one target frequency domain signal;

the separating the voice signal based on the separation parameter to obtain at least one target signal includes:

the updating the shape parameter and the separation parameter according to the independence between the at least one target signal determined by the probability density distribution and the separation parameter until the independence meets a target condition to obtain at least one target voice signal separated based on the updated separation parameter, includes:

11. A speech signal processing apparatus, characterized in that the apparatus comprises:

12. The apparatus of claim 11, wherein the obtaining module comprises a first obtaining unit, a second obtaining unit, a third obtaining unit, and a weighting unit;

13. The apparatus of claim 11, wherein the update module is configured to:

14. An electronic device, comprising one or more processors and one or more memories having at least one program code stored therein, the at least one program code being loaded and executed by the one or more processors to implement the speech signal processing method according to any one of claims 1 to 10.

15. A computer-readable storage medium, characterized in that at least one program code is stored in the storage medium, which is loaded and executed by a processor to implement the speech signal processing method according to any one of claims 1 to 10.