CN111554316A - Speech processing apparatus, method and medium - Google Patents

Speech processing apparatus, method and medium Download PDF

Info

Publication number
CN111554316A
CN111554316A CN201910066430.6A CN201910066430A CN111554316A CN 111554316 A CN111554316 A CN 111554316A CN 201910066430 A CN201910066430 A CN 201910066430A CN 111554316 A CN111554316 A CN 111554316A
Authority
CN
China
Prior art keywords
speech
discriminator
generator
separated
single speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910066430.6A
Other languages
Chinese (zh)
Inventor
石自强
林慧镔
刘柳
刘汝杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201910066430.6A priority Critical patent/CN111554316A/en
Priority to JP2020004983A priority patent/JP2020118967A/en
Publication of CN111554316A publication Critical patent/CN111554316A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

Disclosed is a speech processing apparatus including: a generator configured to separate a mixed voice including two or more original single voices into two or more separated single voices; and a discriminator configured to discriminate whether the separated single speech is the original single speech, wherein the generator and the discriminator are trained until the discriminator is no longer able to discriminate whether the separated single speech is the original single speech. The apparatus according to the present disclosure not only aims to maximize the signal to distortion ratio for better speech quality, it also integrates speech separation and improved speech quality into a single model. Furthermore, the apparatus according to the present disclosure performs generative confrontation training through this process, which makes it difficult to distinguish the separated speech from the real speech.

Description

Speech processing apparatus, method and medium
Technical Field
The present disclosure relates to the field of speech processing, and in particular to speech processing apparatus and methods employing combined machine learning techniques.
Background
This section provides background information related to the present disclosure, which is not necessarily prior art.
Multi-voice mono speech separation has wide application. For example, in a home environment or conference environment where many people speak, the human auditory system can easily track and follow the speech of the target speaker from the mixed speech of multiple speakers. In this case, if automatic speech recognition and speaker recognition are to be performed, a clean speech signal of the target speaker needs to be separated from the mixed speech to complete the subsequent recognition work. Therefore, in order to achieve satisfactory performance in the speech or speaker recognition task, the problem must be solved.
Disclosure of Invention
This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
It is an object of the present disclosure to provide an efficient end-to-end arrangement for automatic speech separation. The apparatus according to the present disclosure is not only intended to maximize the Signal-to-distortion ratio (SDR) to obtain better speech quality, it also integrates speech separation and improved speech quality into a single model. The technical solution according to the present disclosure performs generative confrontation training by this process, which makes it difficult to distinguish the separated speech from the real speech.
According to an aspect of the present disclosure, there is provided a voice processing apparatus including: a generator configured to separate a mixed voice including two or more original single voices into two or more separated single voices; and a discriminator configured to discriminate whether the separated single speech is the original single speech, wherein the generator and the discriminator are trained until the discriminator is no longer able to discriminate whether the separated single speech is the original single speech.
According to another aspect of the present disclosure, there is provided a speech processing method including: separating, by a generator, a mixed speech including two or more original single speeches into two or more separated single speeches; and distinguishing, by a discriminator, whether the separated single speech is the original single speech, wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.
According to another aspect of the present disclosure, there is provided a program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform a speech processing method according to the present disclosure.
According to another aspect of the present disclosure, a machine-readable storage medium is provided, having embodied thereon a program product according to the present disclosure.
The quality of separated voices can be improved while separating mixed voices by using the voice processing device and the method.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
Drawings
The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the drawings:
FIG. 1 is a block diagram of a speech processing apparatus 100 according to one embodiment of the present disclosure;
FIG. 2 is a flow diagram of a method of speech processing according to one embodiment of the present disclosure; and
fig. 3 is a block diagram of an exemplary structure of a general-purpose personal computer in which a voice processing apparatus and a voice processing method according to an embodiment of the present disclosure can be implemented.
While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. It is noted that throughout the several views, corresponding reference numerals indicate corresponding parts.
Detailed Description
Examples of the present disclosure will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.
Example embodiments are provided so that this disclosure will be thorough, and will fully convey the scope to those skilled in the art. Numerous specific details are set forth such as examples of specific components, devices, and methods to provide a thorough understanding of embodiments of the present disclosure. It will be apparent to those skilled in the art that specific details need not be employed, that example embodiments may be embodied in many different forms, and that neither should be construed to limit the scope of the disclosure. In certain example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.
The apparatus according to the present disclosure not only aims to maximize SDR for better speech quality, it also integrates speech separation and improved speech quality into a single model. The technical solution according to the present disclosure performs generative confrontation training by this process, which makes it difficult to distinguish the separated speech from the real speech.
According to one embodiment of the present disclosure, a speech processing apparatus is provided. The speech processing apparatus includes a generator and a discriminator. The generator may be configured to separate a mixed voice including two or more original single voices into two or more separated single voices. The discriminator may be configured to distinguish whether the separated single speech is the original single speech. Wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.
As shown in fig. 1, a speech processing apparatus 100 according to the present disclosure may include a generator 101 and a discriminator 102.
The generator 101 may be configured to separate a mixed voice including two or more original single voices into two or more separated single voices. For example, in the context of a two person conversation, when two persons, a and b, speak simultaneously, the generator 101 according to the present disclosure can separate the speech of the two persons a and b mixed together into a single speech for a and a single speech for b. Here, it should be clear to those skilled in the art that the above-described environment of the two-person conversation between a person a and b is merely exemplary, and the present disclosure is not limited thereto. For ease of understanding, embodiments of the present disclosure will be described in detail below in such an exemplary environment.
Next, the discriminator 102 may be configured to distinguish whether the separated single speech is the original single speech. For example, the discriminator 102 may be configured to distinguish whether the separated single speech of a is a real speech of a, and whether the separated single speech of b is a real speech of b.
Wherein the generator 101 and the discriminator 102 may be trained until the discriminator 102 is no longer able to distinguish whether the separated single speech is the original single speech. For example, the generator 101 and the discriminator 102 may be trained until the discriminator 102 is no longer able to distinguish whether the separated single speech of a is a real speech of a and whether the separated single speech of b is a real speech of b.
According to one embodiment of the present disclosure, training the generator may include minimizing a loss function of a signal-to-distortion ratio of the separated single voices. For example, training the generator 101 may include minimizing a loss function of signal-to-distortion ratios for the separated first and second single voices.
According to one embodiment of the present disclosure, training the generator may further include transforming the original single speech to have the same number product as the separated single speech. For example, training the generator 101 may also include transforming the real speech of A and the real speech of B into the same mapping space as the separated single speech of A and the separated single speech of B.
According to an embodiment of the present disclosure, training the generator may further include calculating an error between the separated single speech and the transformed original single speech. For example, training the generator 101 may also include calculating an error between the separated single speech of A and the transformed real speech of A, and calculating an error between the separated single speech of B and the transformed real speech of B.
For example, according to an embodiment of the present disclosure, a negative signal-to-distortion ratio may be used as a training target for the loss function of the generator 101. And the signal-to-distortion ratio can be calculated as:
Figure BDA0001955846810000041
Figure BDA0001955846810000042
Figure BDA0001955846810000043
where t is the original single voice, i.e., the real voice of a or b according to the present embodiment, and s is the separated single voice of the generator 101, i.e., the separated single voice of a or b according to the present embodiment. Here, it should be apparent to those skilled in the art that the loss function of the generator 101 described above is merely exemplary, and the present disclosure is not limited thereto.
Next, according to an embodiment of the present disclosure, training the discriminator may include maximizing the discriminator to distinguish the separated single speech from the original single speech. That is, according to the embodiment of the present disclosure, the discriminator 102 is trained so that it can maximally distinguish the real voice of a from the single voice of a separated by the generator 101, or maximally distinguish the real voice of b from the single voice of b separated by the generator 101.
According to an embodiment of the present disclosure, training the discriminator may further include bringing a result output by the discriminator close to a predetermined threshold for the original single speech; and minimizing the result output by the discriminator for the separated single speech. For example, for the sake of calculation convenience, the predetermined threshold may be set to 1, and the minimization may be represented as 0. That is, according to the embodiment of the present disclosure, the discriminator 102 is trained, and the output result of the discriminator 102 may be made close to 1 for the real voice of a or the real voice of b, and the output result of the discriminator 102 may be made close to 0 for the single voice of a or the separated single voice of b separated by the generator 101. Here, it should be apparent to those skilled in the art that the selection of the predetermined threshold is merely exemplary, and the present disclosure is not limited thereto as long as the selection of the predetermined threshold reflects maximally distinguishing the original voice from the generated single voice.
According to an embodiment of the present disclosure, training the generator may further include causing the discriminator to judge that a result of the single voice separated by the generator is close to the predetermined threshold. For example, for the sake of calculation convenience, the predetermined threshold value may be set to 1 here. That is, according to the embodiment of the present disclosure, training the generator 101 may cause the discriminator 102 to judge that the decision result of the separated single voice of a or the separated single voice of b generated by the generator 101 is close to 1. Likewise, it should be clear to those skilled in the art that the selection of the predetermined threshold is merely exemplary and the disclosure is not limited thereto.
Thus, the loss function for generative confrontation training can be defined as:
minDL(D)=E[(D(t)-1)2]+E[(D(G(m)))2](4),
minGL(G)=E[(D(G(m))-1)2]+μLSDR(5),
wherein G represents a generator, D represents a discriminator, t is an original single voice, i.e., a real voice of A or a real voice of B according to the embodiment, m is a mixed voice, i.e., a mixed voice of A and B according to the embodiment, and L isSDRIs the loss function of the generator and μ is the equilibrium coefficient.
Using the speech processing apparatus according to the present disclosure can improve the quality of separated speech while separating mixed speech.
According to one embodiment of the present disclosure, the parameters of the generator and the parameters of the discriminator may be alternately updated when the generator and the discriminator are trained together. For example, the parameters of the generator may be kept constant and the parameters of the arbiter may be trained m times. The parameters of the arbiter can then be left unchanged and the generator parameters trained k times. Alternately training generators and discriminators in this way until the discriminators are no longer able to distinguish whether the separated single speech is the original single speech.
The speech processing apparatus according to the present disclosure makes it difficult to distinguish the separated speech from the real speech by performing generative confrontation training. Furthermore, the speech processing apparatus according to the present disclosure not only aims to maximize SDR for better speech quality, it also enables speech separation and improved speech quality to be integrated into a single model.
A speech processing method according to an embodiment of the present disclosure will be described below with reference to fig. 2. As shown in fig. 2, a speech processing method according to an embodiment of the present disclosure starts at step S210.
In step S210, a mixed voice including two or more original single voices is separated into two or more separated single voices by a generator.
Next, in step S220, it is discriminated by the discriminator whether the separated single voice is the original single voice.
Next, in step S230, the generator and the discriminator are trained until the discriminator can no longer distinguish whether the separated single speech is the original single speech. When the discriminator can distinguish the separated single speech from the original single speech (yes in S230), the training of the generator and the discriminator is repeated, that is, the step S230 returns to re-execute the step S210 and the step S220 until the discriminator can no longer distinguish whether the separated single speech is the original single speech (no in S230), and the process is ended.
The speech processing method according to an embodiment of the present disclosure further includes the step of minimizing a loss function of a signal-to-distortion ratio of the separated single speech.
The speech processing method according to an embodiment of the present disclosure, further includes a step of transforming the original single speech to have the same number product as the separated single speech.
The speech processing method according to an embodiment of the present disclosure further includes the step of calculating an error between the separated single speech and the transformed original single speech.
The speech processing method according to an embodiment of the present disclosure further includes a step of maximizing the discriminator to discriminate the separated single speech from the original single speech.
The speech processing method according to an embodiment of the present disclosure further includes the steps of bringing the result output by the discriminator close to a predetermined threshold value for the original single speech and minimizing the result output by the discriminator for the separated single speech.
The speech processing method according to an embodiment of the present disclosure further includes a step of causing the discriminator to judge that a result of the single speech separated by the generator is close to the predetermined threshold.
According to the speech processing method of one embodiment of the present disclosure, when the generator and the discriminator are trained together, the parameters of the generator and the parameters of the discriminator are alternately updated.
Various embodiments of the above steps of the voice processing method according to the embodiment of the present disclosure have been described in detail above, and a description thereof will not be repeated here.
It is apparent that the respective operational procedures of the voice processing method according to the present disclosure can be implemented in the form of computer-executable programs stored in various machine-readable storage media.
Moreover, the object of the present disclosure can also be achieved by: a storage medium storing the above executable program code is directly or indirectly supplied to a system or an apparatus, and a computer or a Central Processing Unit (CPU) in the system or the apparatus reads out and executes the program code. At this time, as long as the system or the apparatus has a function of executing a program, the embodiments of the present disclosure are not limited to the program, and the program may also be in any form, for example, an object program, a program executed by an interpreter, a script program provided to an operating system, or the like.
Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic, and magneto-optical disks, and other media suitable for storing information, etc.
In addition, the computer can also implement the technical solution of the present disclosure by connecting to a corresponding website on the internet, downloading and installing the computer program code according to the present disclosure into the computer and then executing the program.
Fig. 3 is a block diagram of an exemplary structure of a general-purpose personal computer 1300 in which a voice processing method according to an embodiment of the present disclosure can be implemented.
As shown in fig. 3, the CPU 1301 executes various processes in accordance with a program stored in a Read Only Memory (ROM)1302 or a program loaded from a storage section 1308 to a Random Access Memory (RAM) 1303. In the RAM 1303, data necessary when the CPU 1301 executes various processes and the like is also stored as necessary. The CPU 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output interface 1305 is also connected to bus 1304.
The following components are connected to the input/output interface 1305: an input portion 1306 (including a keyboard, a mouse, and the like), an output portion 1307 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like), a storage portion 1308 (including a hard disk, and the like), a communication portion 1309 (including a network interface card such as a LAN card, a modem, and the like). The communication section 1309 performs communication processing via a network such as the internet. A driver 1310 may also be connected to the input/output interface 1305, as desired. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as needed, so that a computer program read out therefrom is installed in the storage portion 1308 as needed.
In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1311.
It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1311 shown in fig. 3 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1311 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 1302, a hard disk contained in the storage section 1308, or the like, in which programs are stored and which are distributed to users together with the apparatus containing them.
In the systems and methods of the present disclosure, it is apparent that individual components or steps may be broken down and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.
Although the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, it should be understood that the above-described embodiments are merely illustrative of the present disclosure and do not constitute a limitation of the present disclosure. It will be apparent to those skilled in the art that various modifications and variations can be made in the above-described embodiments without departing from the spirit and scope of the disclosure. Accordingly, the scope of the disclosure is to be defined only by the claims appended hereto, and by their equivalents.
With respect to the embodiments including the above embodiments, the following remarks are also disclosed:
supplementary note 1. a speech processing apparatus comprising:
a generator configured to separate a mixed voice including two or more original single voices into two or more separated single voices; and
a discriminator configured to discriminate whether the separated single speech is the original single speech,
wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.
Supplementary note 2. the apparatus of supplementary note 1, wherein training the generator comprises minimizing a loss function of a signal-to-distortion ratio of the separated single voices.
Supplementary note 3. the apparatus of supplementary note 2, wherein training the generator further comprises transforming the original single speech to have the same number product as the separated single speech.
Supplementary 4. the apparatus according to supplementary 3, wherein training the generator further comprises calculating an error between the separated single speech and the transformed original single speech.
Supplementary note 5. the apparatus of supplementary note 1, wherein training the discriminator comprises maximizing the discriminator for distinguishing the separated single speech from the original single speech.
Supplementary notes 6. the apparatus of supplementary notes 5, wherein training the discriminator comprises bringing the result output by the discriminator close to a predetermined threshold for the original single speech; and minimizing the result output by the discriminator for the separated single speech.
Supplementary note 7. the apparatus of supplementary note 6, wherein training the generator further comprises causing the discriminator to judge that the result of the single voice separated by the generator is close to the predetermined threshold.
Note 8. the apparatus according to note 1, wherein the parameters of the generator and the parameters of the discriminator are alternately updated while training the generator and the discriminator together.
Supplementary note 9. a speech processing method, comprising:
separating, by a generator, a mixed speech including two or more original single speeches into two or more separated single speeches; and
distinguishing by a discriminator whether the separated single speech is the original single speech,
wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.
Supplementary notes 10. the method of supplementary notes 9, wherein training the generator comprises minimizing a loss function of signal to distortion ratio of the separated single voices.
Supplementary notes 11. the method of supplementary notes 10 wherein training the generator further comprises transforming the original single speech to have the same number product as the separated single speech.
Reference 12. the method of reference 11 wherein training the generator further comprises calculating an error between the separated single speech and the transformed original single speech.
Supplementary notes 13. the method of supplementary notes 9, wherein training the discriminator comprises maximizing the discriminator for distinguishing the separated single speech from the original single speech.
Supplementary note 14. the method according to supplementary note 13, wherein training the discriminator comprises bringing the result output by the discriminator close to a predetermined threshold for the original single speech; and minimizing the result output by the discriminator for the separated single speech.
Supplementary notes 15. the method of supplementary notes 14, wherein training the generator further comprises causing the discriminator to determine that the result of the single utterance isolated by the generator is close to the predetermined threshold.
Supplementary note 16. the method according to supplementary note 9, wherein the parameters of the generator and the parameters of the discriminator are alternately updated while training the generator and the discriminator together.
Reference 17. a program product comprising machine readable instruction code stored therein, wherein said instruction code, when read and executed by a computer, is capable of causing said computer to perform a method according to any of the references 9-16.

Claims (10)

1. A speech processing apparatus comprising:
a generator configured to separate a mixed voice including two or more original single voices into two or more separated single voices; and
a discriminator configured to discriminate whether the separated single speech is the original single speech,
wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.
2. The apparatus of claim 1, wherein training the generator comprises minimizing a loss function of a signal-to-distortion ratio of the separated single voices.
3. The apparatus of claim 2, wherein training the generator further comprises transforming the original single speech to have the same number product as the separated single speech.
4. The apparatus of claim 3, wherein training the generator further comprises calculating an error between the separated single speech and a transformed original single speech.
5. The apparatus of claim 1, wherein training the discriminator comprises maximizing the discriminator distinguishing the separated single speech from the original single speech.
6. The apparatus of claim 5, wherein training the discriminator comprises bringing a result output by the discriminator close to a predetermined threshold for the original single speech; and minimizing the result output by the discriminator for the separated single speech.
7. The apparatus of claim 6, wherein training the generator further comprises causing the discriminator to determine that a result of a single speech separated by the generator is close to the predetermined threshold.
8. The apparatus of claim 1, wherein the generator parameters and the arbiter parameters are alternately updated while training the generator and the arbiter together.
9. A method of speech processing comprising:
separating, by a generator, a mixed speech including two or more original single speeches into two or more separated single speeches; and
distinguishing by a discriminator whether the separated single speech is the original single speech,
wherein the generator and the discriminator are trained until the discriminator is no longer able to distinguish whether the separated single speech is the original single speech.
10. A machine-readable storage medium having a program product embodied thereon, the program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform the method of claim 9.
CN201910066430.6A 2019-01-24 2019-01-24 Speech processing apparatus, method and medium Pending CN111554316A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910066430.6A CN111554316A (en) 2019-01-24 2019-01-24 Speech processing apparatus, method and medium
JP2020004983A JP2020118967A (en) 2019-01-24 2020-01-16 Voice processing device, data processing method, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910066430.6A CN111554316A (en) 2019-01-24 2019-01-24 Speech processing apparatus, method and medium

Publications (1)

Publication Number Publication Date
CN111554316A true CN111554316A (en) 2020-08-18

Family

ID=71890712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910066430.6A Pending CN111554316A (en) 2019-01-24 2019-01-24 Speech processing apparatus, method and medium

Country Status (2)

Country Link
JP (1) JP2020118967A (en)
CN (1) CN111554316A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2513842A1 (en) * 1999-08-23 2001-03-01 Matsushita Electric Industrial Co., Ltd. Apparatus and method for speech coding
US6236862B1 (en) * 1996-12-16 2001-05-22 Intersignal Llc Continuously adaptive dynamic signal separation and recovery system
GB0524099D0 (en) * 2005-11-26 2006-01-04 Wolfson Microelectronics Plc Audio device
US20130185070A1 (en) * 2012-01-12 2013-07-18 Microsoft Corporation Normalization based discriminative training for continuous speech recognition
US20150255085A1 (en) * 2014-03-07 2015-09-10 JVC Kenwood Corporation Noise reduction device
US20180342257A1 (en) * 2017-05-24 2018-11-29 Modulate, LLC System and Method for Building a Voice Database
CN109147810A (en) * 2018-09-30 2019-01-04 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6236862B1 (en) * 1996-12-16 2001-05-22 Intersignal Llc Continuously adaptive dynamic signal separation and recovery system
CA2513842A1 (en) * 1999-08-23 2001-03-01 Matsushita Electric Industrial Co., Ltd. Apparatus and method for speech coding
GB0524099D0 (en) * 2005-11-26 2006-01-04 Wolfson Microelectronics Plc Audio device
US20130185070A1 (en) * 2012-01-12 2013-07-18 Microsoft Corporation Normalization based discriminative training for continuous speech recognition
US20150255085A1 (en) * 2014-03-07 2015-09-10 JVC Kenwood Corporation Noise reduction device
US20180342257A1 (en) * 2017-05-24 2018-11-29 Modulate, LLC System and Method for Building a Voice Database
CN109147810A (en) * 2018-09-30 2019-01-04 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HYEONG-SEOK CHOI ET AL.: "PHASE-AWARE SPEECH ENHANCEMENT WITH DEEP COMPLEX U-NET", 《PUBLISHED AS A CONFERENCE PAPER AT ICLR 2019》, pages 1 - 20 *
JONATHAN LE ROUX: "SDR – HALF-BAKED OR WELL DONE", 《ARXIV:1811.02508V1[CS.SD]》, pages 1 - 5 *
SHRIKANT VENKATARAMANI: "Adaptive Front-ends for End-to-end Source Separation", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS》, pages 1 - 5 *
ZHE-CHENG FAN.ET AL: "SVSGAN: Singing Voice Separation Via Generative Adversarial Network", pages 726 - 730 *

Also Published As

Publication number Publication date
JP2020118967A (en) 2020-08-06

Similar Documents

Publication Publication Date Title
CN110709924B (en) Audio-visual speech separation
CN108630193B (en) Voice recognition method and device
Afouras et al. The conversation: Deep audio-visual speech enhancement
US6959276B2 (en) Including the category of environmental noise when processing speech signals
US8010343B2 (en) Disambiguation systems and methods for use in generating grammars
CN107274906A (en) Voice information processing method, device, terminal and storage medium
CN111696572B (en) Voice separation device, method and medium
US8145486B2 (en) Indexing apparatus, indexing method, and computer program product
EP1705645A2 (en) Apparatus and method for analysis of language model changes
US20040254793A1 (en) System and method for providing an audio challenge to distinguish a human from a computer
JP2013167666A (en) Speech recognition device, speech recognition method, and program
EP1317749B1 (en) Method of and system for improving accuracy in a speech recognition system
JP7407190B2 (en) Speech analysis device, speech analysis method and program
CN111179903A (en) Voice recognition method and device, storage medium and electric appliance
Gogate et al. Av speech enhancement challenge using a real noisy corpus
US20010056345A1 (en) Method and system for speech recognition of the alphabet
JP3939955B2 (en) Noise reduction method using acoustic space segmentation, correction and scaling vectors in the domain of noisy speech
CN110265038B (en) Processing method and electronic equipment
CN110570838B (en) Voice stream processing method and device
CN111554316A (en) Speech processing apparatus, method and medium
CN108766429B (en) Voice interaction method and device
CN111508530A (en) Speech emotion recognition method, device and storage medium
CN115831125A (en) Speech recognition method, device, equipment, storage medium and product
WO2020195924A1 (en) Signal processing device, method, and program
KR101925253B1 (en) Apparatus and method for context independent speaker indentification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination