CN111554316A

CN111554316A - Speech processing apparatus, method and medium

Info

Publication number: CN111554316A
Application number: CN201910066430.6A
Authority: CN
Inventors: 石自强; 林慧镔; 刘柳; 刘汝杰
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2020-08-18
Also published as: JP2020118967A

Abstract

Disclosed is a speech processing apparatus including: a generator configured to separate a mixed voice including two or more original single voices into two or more separated single voices; and a discriminator configured to discriminate whether the separated single speech is the original single speech, wherein the generator and the discriminator are trained until the discriminator is no longer able to discriminate whether the separated single speech is the original single speech. The apparatus according to the present disclosure not only aims to maximize the signal to distortion ratio for better speech quality, it also integrates speech separation and improved speech quality into a single model. Furthermore, the apparatus according to the present disclosure performs generative confrontation training through this process, which makes it difficult to distinguish the separated speech from the real speech.

Description

Speech processing apparatus, method and medium

Technical Field

The present disclosure relates to the field of speech processing, and in particular to speech processing apparatus and methods employing combined machine learning techniques.

Background

This section provides background information related to the present disclosure, which is not necessarily prior art.

Multi-voice mono speech separation has wide application. For example, in a home environment or conference environment where many people speak, the human auditory system can easily track and follow the speech of the target speaker from the mixed speech of multiple speakers. In this case, if automatic speech recognition and speaker recognition are to be performed, a clean speech signal of the target speaker needs to be separated from the mixed speech to complete the subsequent recognition work. Therefore, in order to achieve satisfactory performance in the speech or speaker recognition task, the problem must be solved.

Disclosure of Invention

This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.

It is an object of the present disclosure to provide an efficient end-to-end arrangement for automatic speech separation. The apparatus according to the present disclosure is not only intended to maximize the Signal-to-distortion ratio (SDR) to obtain better speech quality, it also integrates speech separation and improved speech quality into a single model. The technical solution according to the present disclosure performs generative confrontation training by this process, which makes it difficult to distinguish the separated speech from the real speech.

According to an aspect of the present disclosure, there is provided a voice processing apparatus including: a generator configured to separate a mixed voice including two or more original single voices into two or more separated single voices; and a discriminator configured to discriminate whether the separated single speech is the original single speech, wherein the generator and the discriminator are trained until the discriminator is no longer able to discriminate whether the separated single speech is the original single speech.

According to another aspect of the present disclosure, there is provided a speech processing method including: separating, by a generator, a mixed speech including two or more original single speeches into two or more separated single speeches; and distinguishing, by a discriminator, whether the separated single speech is the original single speech, wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.

According to another aspect of the present disclosure, there is provided a program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform a speech processing method according to the present disclosure.

According to another aspect of the present disclosure, a machine-readable storage medium is provided, having embodied thereon a program product according to the present disclosure.

The quality of separated voices can be improved while separating mixed voices by using the voice processing device and the method.

Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.

Drawings

The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the drawings:

FIG. 1 is a block diagram of a speech processing apparatus 100 according to one embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of speech processing according to one embodiment of the present disclosure; and

fig. 3 is a block diagram of an exemplary structure of a general-purpose personal computer in which a voice processing apparatus and a voice processing method according to an embodiment of the present disclosure can be implemented.

While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. It is noted that throughout the several views, corresponding reference numerals indicate corresponding parts.

Detailed Description

Examples of the present disclosure will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.

Example embodiments are provided so that this disclosure will be thorough, and will fully convey the scope to those skilled in the art. Numerous specific details are set forth such as examples of specific components, devices, and methods to provide a thorough understanding of embodiments of the present disclosure. It will be apparent to those skilled in the art that specific details need not be employed, that example embodiments may be embodied in many different forms, and that neither should be construed to limit the scope of the disclosure. In certain example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.

The apparatus according to the present disclosure not only aims to maximize SDR for better speech quality, it also integrates speech separation and improved speech quality into a single model. The technical solution according to the present disclosure performs generative confrontation training by this process, which makes it difficult to distinguish the separated speech from the real speech.

According to one embodiment of the present disclosure, a speech processing apparatus is provided. The speech processing apparatus includes a generator and a discriminator. The generator may be configured to separate a mixed voice including two or more original single voices into two or more separated single voices. The discriminator may be configured to distinguish whether the separated single speech is the original single speech. Wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.

As shown in fig. 1, a speech processing apparatus 100 according to the present disclosure may include a generator 101 and a discriminator 102.

The generator 101 may be configured to separate a mixed voice including two or more original single voices into two or more separated single voices. For example, in the context of a two person conversation, when two persons, a and b, speak simultaneously, the generator 101 according to the present disclosure can separate the speech of the two persons a and b mixed together into a single speech for a and a single speech for b. Here, it should be clear to those skilled in the art that the above-described environment of the two-person conversation between a person a and b is merely exemplary, and the present disclosure is not limited thereto. For ease of understanding, embodiments of the present disclosure will be described in detail below in such an exemplary environment.

Next, the discriminator 102 may be configured to distinguish whether the separated single speech is the original single speech. For example, the discriminator 102 may be configured to distinguish whether the separated single speech of a is a real speech of a, and whether the separated single speech of b is a real speech of b.

Wherein the generator 101 and the discriminator 102 may be trained until the discriminator 102 is no longer able to distinguish whether the separated single speech is the original single speech. For example, the generator 101 and the discriminator 102 may be trained until the discriminator 102 is no longer able to distinguish whether the separated single speech of a is a real speech of a and whether the separated single speech of b is a real speech of b.

According to one embodiment of the present disclosure, training the generator may include minimizing a loss function of a signal-to-distortion ratio of the separated single voices. For example, training the generator 101 may include minimizing a loss function of signal-to-distortion ratios for the separated first and second single voices.

According to one embodiment of the present disclosure, training the generator may further include transforming the original single speech to have the same number product as the separated single speech. For example, training the generator 101 may also include transforming the real speech of A and the real speech of B into the same mapping space as the separated single speech of A and the separated single speech of B.

According to an embodiment of the present disclosure, training the generator may further include calculating an error between the separated single speech and the transformed original single speech. For example, training the generator 101 may also include calculating an error between the separated single speech of A and the transformed real speech of A, and calculating an error between the separated single speech of B and the transformed real speech of B.

For example, according to an embodiment of the present disclosure, a negative signal-to-distortion ratio may be used as a training target for the loss function of the generator 101. And the signal-to-distortion ratio can be calculated as:

where t is the original single voice, i.e., the real voice of a or b according to the present embodiment, and s is the separated single voice of the generator 101, i.e., the separated single voice of a or b according to the present embodiment. Here, it should be apparent to those skilled in the art that the loss function of the generator 101 described above is merely exemplary, and the present disclosure is not limited thereto.

Next, according to an embodiment of the present disclosure, training the discriminator may include maximizing the discriminator to distinguish the separated single speech from the original single speech. That is, according to the embodiment of the present disclosure, the discriminator 102 is trained so that it can maximally distinguish the real voice of a from the single voice of a separated by the generator 101, or maximally distinguish the real voice of b from the single voice of b separated by the generator 101.

According to an embodiment of the present disclosure, training the discriminator may further include bringing a result output by the discriminator close to a predetermined threshold for the original single speech; and minimizing the result output by the discriminator for the separated single speech. For example, for the sake of calculation convenience, the predetermined threshold may be set to 1, and the minimization may be represented as 0. That is, according to the embodiment of the present disclosure, the discriminator 102 is trained, and the output result of the discriminator 102 may be made close to 1 for the real voice of a or the real voice of b, and the output result of the discriminator 102 may be made close to 0 for the single voice of a or the separated single voice of b separated by the generator 101. Here, it should be apparent to those skilled in the art that the selection of the predetermined threshold is merely exemplary, and the present disclosure is not limited thereto as long as the selection of the predetermined threshold reflects maximally distinguishing the original voice from the generated single voice.

According to an embodiment of the present disclosure, training the generator may further include causing the discriminator to judge that a result of the single voice separated by the generator is close to the predetermined threshold. For example, for the sake of calculation convenience, the predetermined threshold value may be set to 1 here. That is, according to the embodiment of the present disclosure, training the generator 101 may cause the discriminator 102 to judge that the decision result of the separated single voice of a or the separated single voice of b generated by the generator 101 is close to 1. Likewise, it should be clear to those skilled in the art that the selection of the predetermined threshold is merely exemplary and the disclosure is not limited thereto.

Thus, the loss function for generative confrontation training can be defined as:

min_DL(D)＝E[(D(t)-1)²]+E[(D(G(m)))²](4)，

min_GL(G)＝E[(D(G(m))-1)²]+μL_SDR(5)，

wherein G represents a generator, D represents a discriminator, t is an original single voice, i.e., a real voice of A or a real voice of B according to the embodiment, m is a mixed voice, i.e., a mixed voice of A and B according to the embodiment, and L is_SDRIs the loss function of the generator and μ is the equilibrium coefficient.

Using the speech processing apparatus according to the present disclosure can improve the quality of separated speech while separating mixed speech.

According to one embodiment of the present disclosure, the parameters of the generator and the parameters of the discriminator may be alternately updated when the generator and the discriminator are trained together. For example, the parameters of the generator may be kept constant and the parameters of the arbiter may be trained m times. The parameters of the arbiter can then be left unchanged and the generator parameters trained k times. Alternately training generators and discriminators in this way until the discriminators are no longer able to distinguish whether the separated single speech is the original single speech.

The speech processing apparatus according to the present disclosure makes it difficult to distinguish the separated speech from the real speech by performing generative confrontation training. Furthermore, the speech processing apparatus according to the present disclosure not only aims to maximize SDR for better speech quality, it also enables speech separation and improved speech quality to be integrated into a single model.

A speech processing method according to an embodiment of the present disclosure will be described below with reference to fig. 2. As shown in fig. 2, a speech processing method according to an embodiment of the present disclosure starts at step S210.

In step S210, a mixed voice including two or more original single voices is separated into two or more separated single voices by a generator.

Next, in step S220, it is discriminated by the discriminator whether the separated single voice is the original single voice.

Next, in step S230, the generator and the discriminator are trained until the discriminator can no longer distinguish whether the separated single speech is the original single speech. When the discriminator can distinguish the separated single speech from the original single speech (yes in S230), the training of the generator and the discriminator is repeated, that is, the step S230 returns to re-execute the step S210 and the step S220 until the discriminator can no longer distinguish whether the separated single speech is the original single speech (no in S230), and the process is ended.

The speech processing method according to an embodiment of the present disclosure further includes the step of minimizing a loss function of a signal-to-distortion ratio of the separated single speech.

The speech processing method according to an embodiment of the present disclosure, further includes a step of transforming the original single speech to have the same number product as the separated single speech.

The speech processing method according to an embodiment of the present disclosure further includes the step of calculating an error between the separated single speech and the transformed original single speech.

The speech processing method according to an embodiment of the present disclosure further includes a step of maximizing the discriminator to discriminate the separated single speech from the original single speech.

The speech processing method according to an embodiment of the present disclosure further includes the steps of bringing the result output by the discriminator close to a predetermined threshold value for the original single speech and minimizing the result output by the discriminator for the separated single speech.

The speech processing method according to an embodiment of the present disclosure further includes a step of causing the discriminator to judge that a result of the single speech separated by the generator is close to the predetermined threshold.

According to the speech processing method of one embodiment of the present disclosure, when the generator and the discriminator are trained together, the parameters of the generator and the parameters of the discriminator are alternately updated.

Various embodiments of the above steps of the voice processing method according to the embodiment of the present disclosure have been described in detail above, and a description thereof will not be repeated here.

It is apparent that the respective operational procedures of the voice processing method according to the present disclosure can be implemented in the form of computer-executable programs stored in various machine-readable storage media.

Moreover, the object of the present disclosure can also be achieved by: a storage medium storing the above executable program code is directly or indirectly supplied to a system or an apparatus, and a computer or a Central Processing Unit (CPU) in the system or the apparatus reads out and executes the program code. At this time, as long as the system or the apparatus has a function of executing a program, the embodiments of the present disclosure are not limited to the program, and the program may also be in any form, for example, an object program, a program executed by an interpreter, a script program provided to an operating system, or the like.

Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic, and magneto-optical disks, and other media suitable for storing information, etc.

In addition, the computer can also implement the technical solution of the present disclosure by connecting to a corresponding website on the internet, downloading and installing the computer program code according to the present disclosure into the computer and then executing the program.

Fig. 3 is a block diagram of an exemplary structure of a general-purpose personal computer 1300 in which a voice processing method according to an embodiment of the present disclosure can be implemented.

As shown in fig. 3, the CPU 1301 executes various processes in accordance with a program stored in a Read Only Memory (ROM)1302 or a program loaded from a storage section 1308 to a Random Access Memory (RAM) 1303. In the RAM 1303, data necessary when the CPU 1301 executes various processes and the like is also stored as necessary. The CPU 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output interface 1305 is also connected to bus 1304.

The following components are connected to the input/output interface 1305: an input portion 1306 (including a keyboard, a mouse, and the like), an output portion 1307 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like), a storage portion 1308 (including a hard disk, and the like), a communication portion 1309 (including a network interface card such as a LAN card, a modem, and the like). The communication section 1309 performs communication processing via a network such as the internet. A driver 1310 may also be connected to the input/output interface 1305, as desired. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as needed, so that a computer program read out therefrom is installed in the storage portion 1308 as needed.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1311.

It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1311 shown in fig. 3 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1311 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 1302, a hard disk contained in the storage section 1308, or the like, in which programs are stored and which are distributed to users together with the apparatus containing them.

In the systems and methods of the present disclosure, it is apparent that individual components or steps may be broken down and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

Although the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, it should be understood that the above-described embodiments are merely illustrative of the present disclosure and do not constitute a limitation of the present disclosure. It will be apparent to those skilled in the art that various modifications and variations can be made in the above-described embodiments without departing from the spirit and scope of the disclosure. Accordingly, the scope of the disclosure is to be defined only by the claims appended hereto, and by their equivalents.

With respect to the embodiments including the above embodiments, the following remarks are also disclosed:

supplementary note 1. a speech processing apparatus comprising:

a generator configured to separate a mixed voice including two or more original single voices into two or more separated single voices; and

a discriminator configured to discriminate whether the separated single speech is the original single speech,

wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.

Supplementary note 2. the apparatus of supplementary note 1, wherein training the generator comprises minimizing a loss function of a signal-to-distortion ratio of the separated single voices.

Supplementary note 3. the apparatus of supplementary note 2, wherein training the generator further comprises transforming the original single speech to have the same number product as the separated single speech.

Supplementary 4. the apparatus according to supplementary 3, wherein training the generator further comprises calculating an error between the separated single speech and the transformed original single speech.

Supplementary note 5. the apparatus of supplementary note 1, wherein training the discriminator comprises maximizing the discriminator for distinguishing the separated single speech from the original single speech.

Supplementary notes 6. the apparatus of supplementary notes 5, wherein training the discriminator comprises bringing the result output by the discriminator close to a predetermined threshold for the original single speech; and minimizing the result output by the discriminator for the separated single speech.

Supplementary note 7. the apparatus of supplementary note 6, wherein training the generator further comprises causing the discriminator to judge that the result of the single voice separated by the generator is close to the predetermined threshold.

Note 8. the apparatus according to note 1, wherein the parameters of the generator and the parameters of the discriminator are alternately updated while training the generator and the discriminator together.

Supplementary note 9. a speech processing method, comprising:

separating, by a generator, a mixed speech including two or more original single speeches into two or more separated single speeches; and

distinguishing by a discriminator whether the separated single speech is the original single speech,

Supplementary notes 10. the method of supplementary notes 9, wherein training the generator comprises minimizing a loss function of signal to distortion ratio of the separated single voices.

Supplementary notes 11. the method of supplementary notes 10 wherein training the generator further comprises transforming the original single speech to have the same number product as the separated single speech.

Reference 12. the method of reference 11 wherein training the generator further comprises calculating an error between the separated single speech and the transformed original single speech.

Supplementary notes 13. the method of supplementary notes 9, wherein training the discriminator comprises maximizing the discriminator for distinguishing the separated single speech from the original single speech.

Supplementary note 14. the method according to supplementary note 13, wherein training the discriminator comprises bringing the result output by the discriminator close to a predetermined threshold for the original single speech; and minimizing the result output by the discriminator for the separated single speech.

Supplementary notes 15. the method of supplementary notes 14, wherein training the generator further comprises causing the discriminator to determine that the result of the single utterance isolated by the generator is close to the predetermined threshold.

Supplementary note 16. the method according to supplementary note 9, wherein the parameters of the generator and the parameters of the discriminator are alternately updated while training the generator and the discriminator together.

Reference 17. a program product comprising machine readable instruction code stored therein, wherein said instruction code, when read and executed by a computer, is capable of causing said computer to perform a method according to any of the references 9-16.

Claims

1. A speech processing apparatus comprising:

2. The apparatus of claim 1, wherein training the generator comprises minimizing a loss function of a signal-to-distortion ratio of the separated single voices.

3. The apparatus of claim 2, wherein training the generator further comprises transforming the original single speech to have the same number product as the separated single speech.

4. The apparatus of claim 3, wherein training the generator further comprises calculating an error between the separated single speech and a transformed original single speech.

5. The apparatus of claim 1, wherein training the discriminator comprises maximizing the discriminator distinguishing the separated single speech from the original single speech.

6. The apparatus of claim 5, wherein training the discriminator comprises bringing a result output by the discriminator close to a predetermined threshold for the original single speech; and minimizing the result output by the discriminator for the separated single speech.

7. The apparatus of claim 6, wherein training the generator further comprises causing the discriminator to determine that a result of a single speech separated by the generator is close to the predetermined threshold.

8. The apparatus of claim 1, wherein the generator parameters and the arbiter parameters are alternately updated while training the generator and the arbiter together.

9. A method of speech processing comprising:

10. A machine-readable storage medium having a program product embodied thereon, the program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform the method of claim 9.