WO2020088153A1 - 语音处理方法、装置、存储介质和电子设备 - Google Patents
语音处理方法、装置、存储介质和电子设备 Download PDFInfo
- Publication number
- WO2020088153A1 WO2020088153A1 PCT/CN2019/107578 CN2019107578W WO2020088153A1 WO 2020088153 A1 WO2020088153 A1 WO 2020088153A1 CN 2019107578 W CN2019107578 W CN 2019107578W WO 2020088153 A1 WO2020088153 A1 WO 2020088153A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- model
- sub
- voice
- original
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 59
- 238000000034 method Methods 0.000 claims description 39
- 238000001228 spectrum Methods 0.000 claims description 32
- 230000000873 masking effect Effects 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 description 8
- 230000002093 peripheral effect Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000009499 grossing Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001351 cycling effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Definitions
- Embodiments of the present application relate to the field of voice processing technology, and in particular, to a voice processing method, device, storage medium, and electronic equipment.
- the voice signal collected by the microphone of the electronic device has reverberation, which reduces the clarity of the collected voice signal and affects the recognition rate of voiceprint information.
- WRE weighted prediction error
- the reverberation component is estimated for the first few frames of the reverberation voice, and the reverberation voice is compared with the reverberation component. To get de-reverberated speech.
- an embodiment of the present application provides a voice processing method, including:
- an embodiment of the present application provides a voice processing device, including:
- the speech processing module is used to input the original speech to a pre-trained generation sub-model of the generative adversarial network model if the original speech is reverberation speech, wherein the generation sub-model is used to analyze the original speech Voice dereverberation processing;
- an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, which is implemented when executed by a processor:
- the original speech is a reverberation speech
- the original speech is input to a pre-trained generation sub-model of the generative adversarial network model, wherein the generation sub-model is used to dereverberate the original speech ;
- the output speech of the generated sub-model is determined as the dereverberation speech.
- the original speech is a reverberation speech
- the original speech is input to a pre-trained generation sub-model of the generative adversarial network model, wherein the generation sub-model is used to dereverberate the original speech ;
- FIG. 1 is a schematic flowchart of a voice processing method according to an embodiment of this application
- FIG. 3 is a schematic flowchart of another voice processing method provided by an embodiment of the present application.
- FIG. 4 is a schematic flowchart of another voice processing method provided by an embodiment of the present application.
- FIG. 5 is a schematic structural diagram of a voice processing device according to an embodiment of the present application.
- FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
- FIG. 7 is a schematic structural diagram of another electronic device according to an embodiment of the present application.
- An embodiment of the present application provides a voice processing method, including:
- the output speech of the generated sub-model is determined as the dereverberation speech.
- the generative adversarial network model further includes a discriminant sub-model, and the discriminant sub-model is used to discriminate the speech type of the input voice;
- the original voice is input into the discriminant sub-model of the pre-trained generative adversarial network model, and it is determined whether the original voice is a reverberation voice according to the output result of the discriminant sub-model.
- the training method for generating a sub-model includes:
- the training method for discriminating sub-models includes:
- the method further includes:
- the masking the dereverberated speech to generate the processed speech includes:
- the method further includes:
- Step 101 Obtain original voice.
- the electronic devices in the embodiments of the present application may include smart devices equipped with voice collection devices, such as mobile phones, tablet computers, robots, and speakers.
- the original voice is collected based on a voice collection device provided in the electronic device, for example, a voice signal input by a user can be collected through a microphone, and the collected voice signal is analog-to-digital converted based on an analog-to-digital converter to obtain a digital voice signal , Based on the amplifier to amplify the digital voice signal to generate the original voice.
- a voice collection device provided in the electronic device, for example, a voice signal input by a user can be collected through a microphone, and the collected voice signal is analog-to-digital converted based on an analog-to-digital converter to obtain a digital voice signal , Based on the amplifier to amplify the digital voice signal to generate the original voice.
- reverberation speech is due to the fact that when the user has a large distance from the electronic device, the sound wave is reflected during the propagation process.
- the reflected sound wave signal is collected by the electronic device and overlaps with the original voice signal so that the voice signal collected by the electronic device is not Clear.
- sound waves propagate indoors and are reflected by obstacles such as walls, ceilings, and floors.
- the resulting multiple reflected sound waves are collected by the electronic device at different times to form a reverberation voice.
- the generative adversarial network model (Generative Adversarial Net, GAN) has the function of dereverberation of the reverberation speech and generating clean speech through pre-training.
- the generative adversarial network model includes a generator sub-model and a discriminant sub-model.
- the generator sub-model is used to dereverberate the input original speech
- the discriminant sub-model is used to discriminate the input speech.
- the output of the discriminant sub-model can be It is the voice type of the input voice, and the discrimination probability of the voice type, for example, the voice type of the input voice may be clean voice and reverberation voice.
- the generator sub-model and the discriminant sub-model are connected, that is, the output of the generator sub-model is used as the input of the discriminant sub-model, the generator sub-model performs dereverberation processing on the original speech, and inputs the generated voice to the discriminant sub-model, according to The output result of the discrimination sub-model verifies the generated sub-model.
- the generative adversarial network model is pre-trained, where the generator sub-model and the discriminant sub-model are trained separately.
- the discriminant sub-model is first trained based on the training samples, and the discrimination accuracy of the discriminant sub-model is improved by adjusting the network parameters
- the network parameters of the discriminant sub-model are fixed, the generator sub-model is trained, and the network parameters of the generator sub-model are adjusted, so that the discrimination probability that the output voice of the generator sub-model is reverberation is decreased. Cycling the above training process, when the output results of the discriminating sub-model and the generating sub-model satisfy the preset error, it is determined that the training of the generative adversarial network model is completed.
- the collected original voice is directly input into the generative submodel of the generative adversarial network model, and the generated voice output by the generative submodel is determined as the dereverberation voice, That is clean voice.
- the method includes: transmitting the dereverberated speech to the pre-trained generational adversarial network model for discrimination In the sub-model, obtain the output result of the discriminant sub-model; when the discriminant probability that the dereverberation speech is clean speech in the output result is less than a preset probability, input the dereverberation speech to the generation In the sub-model, the second dereverberation process is performed.
- the discriminant sub-model is used to discriminate the output result of the generated sub-model. When the output result does not meet the preset requirement, the output result is subjected to secondary dereverberation processing until the output result meets the preset requirement.
- the preset probability of the clean voice in the preset requirements may be set according to user requirements, for example, it may be 80%. Improves the accuracy of dereverberation processing of the original speech, improves the clarity of the output speech, and further improves the recognition rate of voiceprint recognition and speech matching on the output speech, avoids misoperation of electronic equipment, and improves the control precision.
- Step 201 Collect voice samples and set a type identifier for the voice type according to the voice samples, where the voice samples include clean voice samples and reverb voice samples.
- Step 202 Input the speech sample to the discriminant sub-model to be trained to obtain the discriminant result of the discriminant sub-model.
- Step 203 Adjust the network parameters of the discriminating sub-model according to the discriminating result and the type identifier of the voice sample.
- Step 204 Input the reverberation speech sample to the generator sub-model to be trained to obtain the generated speech output by the generator sub-model.
- Step 205 Input the generated speech into a pre-trained discriminant sub-model, and determine the discrimination probability that the generated voice is clean voice according to the output result of the discriminant sub-model.
- Step 206 Determine the loss information according to the discrimination probability and the expected probability of the generated speech, and adjust the network parameters of the generated sub-model based on the loss information.
- Step 207 Obtain the original voice, input the original voice into the discriminant sub-model of the pre-trained generative adversarial network model, and determine whether the original voice is a reverberation voice according to the output result of the discriminant sub-model.
- Step 208 If the original speech is a reverberation speech, input the original speech to a pre-trained generation sub-model of the generative adversarial network model, wherein the generation sub-model is used to remove the original speech Reverb processing.
- Step 209 Determine the output speech of the generated sub-model as the dereverberation speech.
- the discriminant sub-model in the generative adversarial network model is trained through steps 201 to 203.
- the clean speech may be collected through an electronic device, or may be obtained through a network search.
- the reverberation speech samples are generated by superimposing clean speech samples based on different reverberation times and / or different reverberation times.
- the reverberation speech may be generated by superimposing clean speech twice or multiple times, wherein the interval time for superimposing each speech signal may be different, generating different reverberation speech samples, and improving the reverberation speech samples
- the diversity of the model further improves the training accuracy of the generative confrontation network model.
- the type identifier of the clean speech sample may be 1, and the type identifier of the reverberation speech sample may be 0, which is used to distinguish the speech samples.
- the discriminant result includes the voice type of the sample speech and the discrimination probability.
- the discrimination result may be 60% of clean speech and 40% of reverberation speech.
- the expected probability is determined according to the type identification of the voice sample. For example, when the type identification of the input voice sample is 1, the expected probability is 100% of clean voice.
- Steps 201 to 203 are iteratively executed until the discrimination result meets the preset accuracy, and it is determined that the discriminant sub-model training is completed.
- the generator sub-model in the generative adversarial network model is trained based on the discriminant sub-model completed after the training, and the reverberation speech samples are input into the generator sub-model to be trained to obtain the generated speech output by the generator sub-model , Input the generated speech into the discriminant sub-model after the training to discriminate the generated speech, and determine the type and discrimination probability of the generated speech. For example, based on the discriminant submodel, it is determined that the generated speech is reverberation speech, and the discrimination probability is 60%, and the discrimination probability of clean speech is 40%. In this embodiment, the expected probability of generating speech is 100% of clean speech and 0% of reverberation speech. It can be known that the loss information is 60%.
- the network parameters of the submodel are adjusted in reverse, where the network parameters include but are not limited to weights Value and offset value.
- Steps 204 to 206 are iteratively executed until the judgment result of the generated speech output by the generated sub-model meets the preset precision, and it is determined that the training of the generated sub-model is completed, that is, the trained generated sub-model has the function of dereverberation of the input speech.
- steps 201 to 203 and steps 204 to 206 can be executed cyclically, that is, the discriminating sub-model and the generating sub-model are sequentially trained multiple times until both the discriminating sub-model and the generating sub-model satisfy the training conditions.
- the discriminative sub-model and generator sub-model completed after training satisfy the following formula:
- D is the discriminant sub-model
- G is the generator sub-model
- x is the signal of clean speech
- the signal distribution is p data (x)
- z is the signal of reverberation speech
- the signal distribution is p z (z).
- the speech processing method provided in this embodiment trains the discriminant sub-model and the generator sub-model in the generative confrontation network model to obtain the discriminant sub-model with the reverberation voice discrimination function and the generator sub-model with the dereverberation function , De-reverberation processing is performed on the original voice collected by the electronic device to obtain clear de-reverberation voice, with simple operation and high processing efficiency.
- Step 301 Obtain the original voice, input the original voice into the discriminant sub-model of the pre-trained generative adversarial network model, and determine whether the original voice is a reverberation voice according to the output result of the discriminant sub-model.
- Step 302 If the original speech is a reverberation speech, input the original speech to a pre-trained generation sub-model of the generative adversarial network model, wherein the generation sub-model is used to remove the original speech Reverb processing.
- Step 303 Determine the output speech of the generated sub-model as the dereverberation speech.
- Step 304 Perform masking processing on the dereverberated speech to generate processed speech.
- the masking the dereverberated speech to generate the processed speech includes: performing a short-time Fourier transform on the dereverberated speech to generate an amplitude spectrum of the dereverberated speech And phase spectrum; masking the amplitude spectrum of the dereverberated speech, recombining the processed amplitude spectrum and the phase spectrum, and performing a short-time inverse Fourier transform to generate the processed speech.
- the masking process of the amplitude spectrum of the dereverberated speech may be that, for the distortion frequency points in the amplitude spectrum of each signal frame, a smoothing process is performed according to the amplitude values of adjacent frequency points of the distortion frequency point to obtain the distortion frequency The amplitude value of the point.
- the smoothing processing according to the amplitude values of the adjacent frequency points of the distortion frequency point may be determining the amplitude value of the adjacent frequency point as the amplitude value of the distortion frequency point, or determining the average value of the amplitude values of the adjacent frequency points before and after as the distortion frequency The amplitude value of the point.
- masking the amplitude spectrum of the dereverberated speech may also be: performing the amplitude value of each frequency point of the current signal frame and the amplitude value of the corresponding frequency point of the previous signal frame that has completed the masking process Smoothing to generate the processed amplitude spectrum of the current signal frame.
- masking the amplitude spectrum of the dereverberated speech satisfies the following formula:
- the speech processing method provided in the embodiments of the present application after performing dereverberation processing on the original speech based on a pre-trained generative adversarial network model, masking the obtained dereverberation speech to eliminate signals caused during the dereverberation process Loss of body improves the signal instruction of the processed voice, which facilitates the subsequent recognition accuracy of the processed voice.
- FIG. 4 is a schematic flowchart of another voice processing method provided by an embodiment of the present application. This embodiment is an optional solution of the foregoing embodiment. Correspondingly, as shown in FIG. 4, the method of this embodiment includes the following steps:
- Step 401 Obtain original speech, input the original speech into the discriminant sub-model of the pre-trained generative adversarial network model, and determine whether the original voice is a reverberation voice according to the output result of the discriminant sub-model.
- Step 402 If the original speech is a reverberation speech, input the original speech to a pre-trained generation sub-model of the generative adversarial network model, wherein the generation sub-model is used to remove the original speech Reverb processing.
- Step 403 Determine the output speech of the generated sub-model as the dereverberation speech.
- Step 404 Perform masking processing on the dereverberated speech to generate processed speech.
- Step 405 Identify the voiceprint features of the processed speech, and compare the voiceprint features with preset voiceprint features.
- step 404 is directly performed.
- awakening the electronic device may be switching from the lock screen state to the working state, and generating a corresponding control instruction according to the keyword in the processed voice, for example, from the processed voice recognition
- the keyword can be "Hey Siri, how is the weather today", when the keyword “Hey Siri” matches the preset wake-up keyword successfully, and the extracted voiceprint feature matches the authorized user's voiceprint feature successfully, according to " "How is the weather today” generates a weather query instruction, executes the weather query instruction, and outputs the query result through voice playback or graphic display.
- the voice processing method provided in this embodiment performs wake-up on the electronic device by collecting the original voice input by the user, and performs a high-precision dereverberation process on the original voice based on the generation sub-model of the generative confrontation network model to improve the demixing
- the clarity of the loud voice further improves the accuracy and recognition rate of the voiceprint features of the de-reverberated voice, avoids the misoperation of electronic equipment, and improves the control accuracy of the electronic equipment.
- FIG. 5 is a structural block diagram of a voice processing device provided by an embodiment of the present application.
- the device may be implemented by software and / or hardware, and is generally integrated in an electronic device.
- the collected voice signal may be obtained by executing a voice processing method of the electronic device. Perform dereverberation processing.
- the device includes: a voice acquisition module 501, a voice processing module 502 and a dereverberation voice determination module 503.
- the voice acquisition module 501 is used to obtain the original voice
- the speech processing module 502 is configured to input the original speech into a pre-trained generation sub-model of the generative adversarial network model if the original speech is reverberation speech, wherein the generation sub-model is used for the The original speech is dereverberated;
- the dereverberation speech determination module 503 is used to determine the output speech of the generated sub-model as the dereverberation speech.
- the voice processing device provided in the embodiment of the present application performs dereverberation processing on the original voice input by the user based on the GAN network, without extracting the voice characteristics of the original voice, quickly obtains high-precision dereverberation voice, and improves the original voice signal Processing efficiency and processing accuracy.
- the generative adversarial network model further includes a discriminant sub-model, where the discriminant sub-model is used to discriminate the type of speech of the input speech.
- the reverberation speech discrimination module is used to input the original speech into the discriminant sub-model of the pre-trained generative adversarial network model after acquiring the original voice, and determine the original according to the output result of the discriminant sub-model Whether the speech is reverberation speech.
- the generator sub-model training module is used to input the reverberation speech samples to the generator sub-model to be trained to obtain the generated speech output by the generator sub-model; input the generated speech into the pre-trained discriminant sub-model, according to The output result of the discriminating sub-model determines the discriminating probability of the generated speech as clean voice; determining the loss information according to the discriminating probability and the expected probability of the generated speech; adjusting the network parameters of the generating sub-model based on the loss information.
- the reverberation speech samples are generated by superimposing clean speech samples based on different reverberation times and / or different reverberation times.
- the masking processing module is configured to perform masking processing on the dereverberated speech after determining the output speech of the generated sub-model as dereverberated speech to generate processed speech.
- the masking processing module is used to:
- a voiceprint recognition module used to recognize the voiceprint features of the dereverberated speech, and compare the voiceprint features with preset voiceprint features
- the device wake-up module is used to wake up the device when the comparison is successful.
- Embodiments of the present application also provide a storage medium containing computer-executable instructions, which when executed by a computer processor are used to perform a voice processing method, the method including:
- the original speech is a reverberation speech
- the original speech is input to a pre-trained generation sub-model of the generative adversarial network model, wherein the generation sub-model is used to dereverberate the original speech ;
- the output speech of the generated sub-model is determined as the dereverberation speech.
- Storage medium any kind of memory device or storage device.
- the term “storage medium” is intended to include: installation media such as CD-ROM, floppy disk or tape devices; computer system memory or random access memory such as DRAM, DDRRAM, SRAM, EDORAM, Rambus RAM, etc .; Volatile memory, such as flash memory, magnetic media (such as hard disks or optical storage); registers or other similar types of memory elements, etc.
- the storage medium may also include other types of memory or a combination thereof.
- the storage medium may be located in the first computer system in which the program is executed, or may be located in a different second computer system that is connected to the first computer system through a network such as the Internet.
- the second computer system may provide program instructions to the first computer for execution.
- storage medium may include two or more storage media that may reside in different locations (eg, in different computer systems connected through a network).
- the storage medium may store program instructions executable by one or more processors (eg, embodied as a computer program).
- a storage medium containing computer-executable instructions provided by the embodiments of the present application the computer-executable instructions are not limited to the voice processing operations as described above, and can also execute the voice processing method provided by any embodiment of the present application Related operations.
- An embodiment of the present application provides an electronic device, and the voice processing apparatus provided by the embodiment of the present application may be integrated into the electronic device.
- 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
- the electronic device 600 may include: a memory 601, a processor 602, and a computer program stored on the memory 601 and executable on the processor 602, and when the processor 602 executes the computer program, the voice as described in the embodiments of the present application is implemented Approach.
- the electronic device provided by the embodiment of the present application performs dereverberation processing on the original voice input by the user based on the GAN network, without extracting the voice characteristics of the original voice, quickly obtains high-precision dereverberation voice, and improves the processing efficiency of the original voice signal And processing accuracy.
- the electronic device may include: a housing (not shown in the figure), a memory 701, a central processing unit (CPU) 702 (also called a processor, hereinafter referred to as CPU), and a circuit board (not shown in the figure) And power circuit (not shown in the figure).
- the circuit board is disposed inside the space enclosed by the housing; the CPU 702 and the memory 701 are provided on the circuit board; and the power circuit is used to supply power to each circuit or device of the electronic device
- the memory 701 is used to store executable program code; the CPU 702 runs the computer program corresponding to the executable program code by reading the executable program code stored in the memory 701 to achieve the following steps:
- the original speech is a reverberation speech
- the original speech is input to a pre-trained generation sub-model of the generative adversarial network model, wherein the generation sub-model is used to dereverberate the original speech ;
- the output speech of the generated sub-model is determined as the dereverberation speech.
- the electronic device further includes: peripheral interface 703, RF (Radio Frequency) circuit 705, audio circuit 706, speaker 711, power management chip 708, input / output (I / O) subsystem 709, other input / control
- the device 710, the touch screen 712, other input / control devices 710, and the external port 704, these components communicate through one or more communication buses or signal lines 707.
- the illustrated electronic device 700 is only an example of the electronic device, and the electronic device 700 may have more or fewer components than shown in the figure, and two or more components may be combined, Or it can have different component configurations.
- the various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and / or application specific integrated circuits.
- the electronic device for voice processing operation provided in this embodiment will be described in detail below.
- the electronic device uses a mobile phone as an example.
- Peripheral interface 703, which can connect input and output peripherals of the device to CPU 702 and memory 701.
- a touch screen 712 which is an input interface and an output interface between the user's electronic device and the user, and displays visual output to the user, and the visual output may include graphics, text, icons, video, and the like.
- the display controller 7091 in the I / O subsystem 709 receives electrical signals from the touch screen 712 or sends electrical signals to the touch screen 712.
- the touch screen 712 detects the contact on the touch screen, and the display controller 7091 converts the detected contact into interaction with the user interface object displayed on the touch screen 712, that is, realizes human-computer interaction, and the user interface object displayed on the touch screen 712 may be running Icons for games, icons connected to the corresponding network, etc.
- the device may also include a light mouse, which is a touch-sensitive surface that does not display visual output or an extension of the touch-sensitive surface formed by a touch screen.
- the RF circuit 705 is mainly used to establish communication between the mobile phone and the wireless network (that is, the network side), and realize data reception and transmission between the mobile phone and the wireless network. For example, sending and receiving short messages, e-mail, etc. Specifically, the RF circuit 705 receives and transmits RF signals, which are also called electromagnetic signals. The RF circuit 705 converts electrical signals into electromagnetic signals or electromagnetic signals into electrical signals, and communicates with the communication network and other devices through the electromagnetic signals Communicate.
- the RF circuit 705 may include known circuits for performing these functions, including but not limited to antenna systems, RF transceivers, one or more amplifiers, tuners, one or more oscillators, digital signal processors, CODEC ( COder-DECoder (codec) chipset, subscriber identity module (Subscriber Identity Module, SIM), etc.
- CODEC COder-DECoder (codec) chipset
- subscriber identity module Subscriber Identity Module, SIM
- the audio circuit 706 is mainly used to receive audio data from the peripheral interface 703, convert the audio data into electrical signals, and send the electrical signals to the speaker 711.
- the speaker 711 is used to restore the voice signal received by the mobile phone from the wireless network through the RF circuit 705 to a sound and play the sound to the user.
- the power management chip 708 is used for power supply and power management for the hardware connected to the CPU 702, the I / O subsystem, and the peripheral interface.
- the voice processing device, storage medium, and electronic device provided in the above embodiments can execute the voice processing method provided in any embodiment of the present application, and have corresponding function modules and beneficial effects for performing the method.
- voice processing method provided in any embodiment of the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Telephone Function (AREA)
Abstract
语音处理方法、装置、存储介质和电子设备,其中语音处理方法包括获取原始语音(101),若原始语音为混响语音,则将原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,生成子模型用于对原始语音进行去混响处理(102),将生成子模型的输出语音确定为去混响语音(103)。
Description
本申请要求于2018年10月30日提交中国专利局、申请号为201811273432.4、发明名称为“语音处理方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请实施例涉及语音处理技术领域,尤其涉及一种语音处理方法、装置、存储介质及电子设备。
随着手机、机器人等电子设备的快速发展,越来越多的语音功能应用于电子设备上,例如声纹解锁、声纹唤醒等。
但是,当用户距离电子设备的较远时,电子设备的麦克风采集的语音信号存在混响,使得采集的语音信号的清晰度下降,影响声纹信息的识别率。目前常用的去混响技术为WRE(weighted prediction error,加权预测误差)技术,在频域上,对混响语音的前几帧进行估计混响成分,将混响语音与混响成分进行做差,得到去混响语音。
发明内容
本申请实施例提供语音处理方法、装置、存储介质及电子设备,提高电子设备采集语音的清晰度。
第一方面,本申请实施例提供了一种语音处理方法,包括:
获取原始语音;
若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;
将所述生成子模型的输出语音确定为去混响语音。
第二方面,本申请实施例提供了一种语音处理装置,包括:
语音获取模块,用于获取原始语音;
语音处理模块,用于若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;
去混响语音确定模块,用于将所述生成子模型的输出语音确定为去混响语音。
第三方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现:
获取原始语音;
若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;
将所述生成子模型的输出语音确定为去混响语音。
第四方面,本申请实施例提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:
获取原始语音;
若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;
将所述生成子模型的输出语音确定为去混响语音。
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种语音处理方法的流程示意图;
图2为本申请实施例提供的另一种语音处理方法的流程示意图;
图3为本申请实施例提供的另一种语音处理方法的流程示意图;
图4为本申请实施例提供的另一种语音处理方法的流程示意图;
图5为本申请实施例提供的一种语音处理装置的结构示意图;
图6为本申请实施例提供的一种电子设备的结构示意图;
图7为本申请实施例提供的另一种电子设备的结构示意图。
下面结合附图并通过具体实施方式来进一步说明本申请的技术方案。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。
在更加详细地讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各步骤描述成顺序的处理,但是其中的许多步骤可以被并行地、并发地或者同时实施。此外,各步骤的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。
本申请实施例提供一种语音处理方法,包括:
获取原始语音;
若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;
将所述生成子模型的输出语音确定为去混响语音。
在一实施例中,所述生成式对抗网络模型还包括判别子模型,所述判别子模型用于判别输入语 音的语音类型;
其中,在获取原始语音之后,还包括:
将所述原始语音输入至所述预先训练的生成式对抗网络模型的判别子模型中,根据所述判别子模型的输出结果确定所述原始语音是否为混响语音。
在一实施例中,所述生成子模型的训练方法包括:
将混响语音样本输入至待训练的生成子模型,得到所述生成子模型输出的生成语音;
将所述生成语音输入至预先训练的判别子模型中,根据所述判别子模型的输出结果确定所述生成语音为干净语音的判别概率;
根据所述生成语音为干净语音的判别概率与期望概率的确定损失信息;
基于所述损失信息调整所述生成子模型的网络参数。
在一实施例中,所述判别子模型的训练方法包括:
采集语音样本,并对根据语音样本的语音类型设置类型标识,其中,所述语音样本包括干净语音样本和混响语音样本;
将所述语音样本输入至待训练的判别子模型,得到所述判别子模型的判别结果;
根据所述判别结果与所述语音样本的类型标识,调整所述判别子模型的网络参数。
在一实施例中,所述混响语音样本是对干净语音样本基于不同的混响次数和/或不同的混响时间进行叠加生成。
在一实施例中,将所述生成子模型的输出语音确定为去混响语音之后,还包括:
将所述去混响语音传输至所述预先训练的生成式对抗网络模型的判别子模型中,获取所述判别子模型的输出结果;
当所述输出结果中所述去混响语音为干净语音的判别概率小于预设概率时,将所述去混响语音输入至所述生成子模型中,进行二次去混响处理。
在一实施例中,在将所述生成子模型的输出语音确定为去混响语音之后,还包括:
对所述去混响语音进行掩蔽处理,生成处理后的语音。
在一实施例中,所述对所述去混响语音进行掩蔽处理,生成处理后的语音,包括:
对所述去混响语音进行短时傅里叶变换,生成所述去混响语音的幅度谱和相位谱;
对所述去混响语音的幅度谱进行掩蔽处理,将处理后的幅度谱与所述相位谱进行重组,并进行短时傅里叶逆变换,生成处理后的语音。
在一实施例中,在将所述生成子模型的输出语音确定为去混响语音之后,还包括:
识别所述去混响语音的声纹特征,对所述声纹特征与预设声纹特征进行特征比对;
当比对成功时,对设备进行唤醒。
图1为本申请实施例提供的一种语音处理方法的流程示意图,该方法可以由语音处理装置执行,其中该装置可由软件和/或硬件实现,一般可集成在电子设备中。如图1所示,该方法包括:
步骤101、获取原始语音。
步骤102、若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理。
步骤103、将所述生成子模型的输出语音确定为去混响语音。
示例性的,本申请实施例中的电子设备可包括手机、平板电脑、机器人和音箱等配置有语音采集装置的智能设备。
在本实施例中,基于电子设备中设置的语音采集装置采集原始语音,例如可以是通过麦克风采集用户输入的语音信号,基于模数转换器将采集的语音信号进行模数转换,得到数字语音信号,基于放大器将数字语音信号进行信号放大,生成原始语音。
其中,混响语音是由于用户离电子设备具有较大距离时,声波在传播过程中发生反射,反射的声波信号被电子设备所采集,与原始的语音信号形成重叠使得电子设备采集的语音信号不清晰。例如,在用户在室内通过语音信号唤醒电子设备时,声波在室内传播,被墙壁、天花板、地板等障碍物反射,形成的多个反射声波,被电子设备在不同时刻采集,形成混响语音。本实施例中,生成式对抗网络模型(Generative Adversarial Net,GAN)通过预先训练,具有对混响语音去混响,生成干净语音的功能。其中,生成式对抗网络模型包括生成子模型和判别子模型,生成子模型用于对输入的原始语音进行去混响处理,判别子模型用于对输入语音进行判别,判别子模型的输出结果可以是该输入语音的语音类型,以及该语音类型的判别概率,例如输入语音的语音类型可以是干净语音和混响语音。可选的,生成子模型和判别子模型连接,即生成子模型的输出作为判别子模型的输入,生成子模型对原始语音进行去混响处理,并将生成的语音输入至判别子模型,根据判别子模型的输出结果对所述生成子模型进行验证。
生成式对抗网络模型是预先训练得到的,其中,生成子模型和判别子模型分别训练得到,示例性的,基于训练样本先对判别子模型进行训练,通过调整网络参数提高判别子模型的判别精度,当判别子模型训练完成后,固定判别子模型的网络参数,对生成子模型进行训练,调节生成子模型的网络参数,使得生成子模型输出语音为混响语音的判别概率下降。循环上述训练过程,当判别子模型和生成子模型的输出结果满足预设误差时,确定生成式对抗网络模型训练完成。
在一些实施例中,在生成式对抗网络模型训练完成后,将采集的原始语音直接输入至生成式对抗网络模型的生成子模型中,将生成子模型输出的生成语音确定为去混响语音,即干净语音。
在一些实施例中,在获取原始语音之后,还包括:将所述原始语音输入至所述预先训练的生成式对抗网络模型的判别子模型中,根据所述判别子模型的输出结果确定所述原始语音是否为混响语音。当原始语音是混响语音时,基于预先训练的生成式对抗网络模型对原始语音进行去混响处理,当原始语音是干净语音时,无需对原始语音进行去混响处理。通过对原始语音进行语音类型的判别,省略了对干净语音进行无效的处理过程,避免了该处理过程对原始语音造成的信号损失,提高了语音信号处理的针对性。
在一些实施例中,还可以是在将所述生成子模型的输出语音确定为去混响语音之后,包括:将所述去混响语音传输至所述预先训练的生成式对抗网络模型的判别子模型中,获取所述判别子模型 的输出结果;当所述输出结果中所述去混响语音为干净语音的判别概率小于预设概率时,对所述去混响语音输入至所述生成子模型中,进行二次去混响处理。通过判别子模型对生成子模型的输出结果进行判别,当输出结果不满足预设要求时,对该输出结果进行二次去混响处理,直到输出结果满足预设要求。其中,预设要求中干净语音的预设概率可以是根据用户需求设置,例如可以是80%。提高了对原始语音的去混响处理精度,提高了输出语音的清晰度,进一步提高了对输出语音进行声纹识别、语音匹配等的识别率,避免对电子设备的误操作,提高电子设备的控制精度。
本申请实施例中提供的语音处理方法,通过获取原始语音,若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理,将所述生成子模型的输出语音确定为去混响语音。通过采用上述方案,基于GAN网络对用户输入的原始语音进行去混响处理,无需提取原始语音的语音特征,快速得到高精度的去混响语音,提高了对原始语音信号处理效率和处理精度。
图2为本申请实施例提供的另一种语音处理方法的流程示意图,参见图2,本实施例的方法包括如下步骤:
步骤201、采集语音样本,并对根据语音样本的语音类型设置类型标识,其中,所述语音样本包括干净语音样本和混响语音样本。
步骤202、将所述语音样本输入至待训练的判别子模型,得到所述判别子模型的判别结果。
步骤203、根据所述判别结果与所述语音样本的类型标识,调整所述判别子模型的网络参数。
步骤204、将混响语音样本输入至待训练的生成子模型,得到所述生成子模型输出的生成语音。
步骤205、将所述生成语音输入至预先训练的判别子模型中,根据所述判别子模型的输出结果确定所述生成语音为干净语音的判别概率。
步骤206、根据所述生成语音的判别概率与期望概率的确定损失信息,基于所述损失信息调整所述生成子模型的网络参数。
步骤207、获取原始语音,将所述原始语音输入至所述预先训练的生成式对抗网络模型的判别子模型中,根据所述判别子模型的输出结果确定所述原始语音是否为混响语音。
步骤208、若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理。
步骤209、将所述生成子模型的输出语音确定为去混响语音。
本实施例中,通过步骤201至步骤203对生成式对抗网络模型中的判别子模型进行训练。其中,干净语音可以是通过电子设备采集的,还可以是通过网络搜索得到,混响语音样本是对干净语音样本基于不同的混响次数和/或不同的混响时间进行叠加生成。示例性的,混响语音可以是将干净语音进行二次叠加或者多次叠加生成,其中,每一个语音信号进行叠加的间隔时间可以是不同,生成不同的混响语音样本,提高混响语音样本的多样性,进一步提高生成式对抗网络模型的训练精度。
其中,干净语音样本的类型标识可以是1,混响语音样本的类型标识可以是0,用于对语音样本进行区别。将样本语音输入至待训练的判别子模型,获取判别子模型的判别结果,该判别结果中 包括样本语音的语音类型,以及判别概率。示例性的,判别结果可以是干净语音60%,混响语音40%,根据语音样本的类型标识确定期望概率,例如输入的语音样本的类型标识为1时,可知期望概率为干净语音100%,混响语音0%,根据判别概率和期望概率可知损失值为40%,根据损失值反向调整判别子模型的网络参数,其中网络参数包括但不限于权重值和偏移值。迭代执行步骤201至步骤203,直到判别结果满足预设精度,确定判别子模型训练完成。
通过步骤204至步骤206基于训练完成的判别子模型对生成式对抗网络模型中的生成子模型进行训练,将混响语音样本输入至待训练的生成子模型中,得到生成子模型输出的生成语音,将生成语音输入至训练完成的判别子模型中对生成语音进行判别,确定生成语音的语音类型和判别概率。例如基于判别子模型确定生成语音为混响语音,判别概率为60%,干净语音的判别概率为40%。本实施例中,生成语音的期望概率为干净语音100%、混响语音0%,可知损失信息为60%,根据损失信息反向调整生成子模型的网络参数,其中网络参数包括但不限于权重值和偏移值。迭代执行步骤204至步骤206,直到生成子模型输出的生成语音的判别结果满足预设精度,确定生成子模型训练完成,即训练完成的生成子模型具有对输入语音去混响的功能。
需要说明的是,步骤201至步骤203和步骤204至步骤206可循环执行,即依次对判别子模型和生成子模型进行多次训练,直到判别子模型和生成子模型均满足训练条件。其中,训练完成的判别子模型和生成子模型满足如下公式:
其中,D为判别子模型,G为生成子模型,x为干净语音的信号,信号分布为p
data(x),z为混响语音的信号,信号分布为p
z(z)。
本实施例提供的语音处理方法,通过分别对生成式对抗网络模型中的判别子模型和生成子模型进行训练,得到具有混响语音判别功能的判别子模型和具有去混响功能的生成子模型,对电子设备采集的原始语音进行去混响处理,得到清晰的去混响语音,操作简单、处理效率高。
图3为本申请实施例提供的另一种语音处理方法的流程示意图,参见图3,本实施例的方法包括如下步骤:
步骤301、获取原始语音,将所述原始语音输入至所述预先训练的生成式对抗网络模型的判别子模型中,根据所述判别子模型的输出结果确定所述原始语音是否为混响语音。
步骤302、若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理。
步骤303、将所述生成子模型的输出语音确定为去混响语音。
步骤304、对所述去混响语音进行掩蔽处理,生成处理后的语音。
在本实施例中,对去混响语音进行掩蔽处理,用于提高去混响语音的信号质量,避免由于去混响处理导致的信号失真,其中,掩蔽处理用于对去混响语音中的失真信号进行补偿。可选的,判断去混响语音是否存在信号失真,若是,对所述去混响语音进行掩蔽处理,若否,则直接对去混响语音进行后续处理,例如基于去混响语音对电子设备进行声纹唤醒,或者基于去混响语音生成其他控 制指令等。
可选的,所述对所述去混响语音进行掩蔽处理,生成处理后的语音,包括:对所述去混响语音进行短时傅里叶变换,生成所述去混响语音的幅度谱和相位谱;对所述去混响语音的幅度谱进行掩蔽处理,将处理后的幅度谱与所述相位谱进行重组,并进行短时傅里叶逆变换,生成处理后的语音。其中,对去混响语音的幅度谱进行掩蔽处理可以是,对于每一个信号帧的幅度谱中的失真频点,根据该失真频点相邻频点的幅度值进行平滑处理,得到该失真频点的幅度值。其中根据该失真频点相邻频点的幅度值进行平滑处理可以是将相邻频点的幅度值确定为失真频点的幅度值,或者将前后相邻频点的幅度值均值确定为失真频点的幅度值。
可选的,对去混响语音的幅度谱进行掩蔽处理还可以是,将所述当前信号帧的各频点的幅度值与已完成掩蔽处理的上一信号帧的对应频点的幅度值进行平滑处理,生成当前信号帧的处理后的幅度谱。例如对去混响语音的幅度谱进行掩蔽处理满足如下公式:
本申请实施例中提供的语音处理方法,基于预先训练的生成式对抗网络模型对原始语音进行去混响处理后,对得到的去混响语音进行掩蔽处理,消除去混响过程中导致的信号失身,提高处理后语音的信号指令,便于后续对处理后语音的识别精度。
图4为本申请实施例提供的另一种语音处理方法的流程示意图,本实施例是上述实施例的一个可选方案,相应的,如图4所示,本实施例的方法包括如下步骤:
步骤401、获取原始语音,将所述原始语音输入至所述预先训练的生成式对抗网络模型的判别子模型中,根据所述判别子模型的输出结果确定所述原始语音是否为混响语音。
步骤402、若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理。
步骤403、将所述生成子模型的输出语音确定为去混响语音。
步骤404、对所述去混响语音进行掩蔽处理,生成处理后的语音。
步骤405、识别所述处理后的语音的声纹特征,对所述声纹特征与预设声纹特征进行特征比对。
步骤406、当比对成功时,对设备进行唤醒。
示例性的,当采集的原始语音为干净语音时,直接执行步骤404。
在本实施例中,电子设备中预设有授权用户的声纹特征,以及唤醒关键词。识别处理后的语音中的声纹特征以及关键词,将识别的关键词与唤醒关键词进行匹配,以及将提取的声纹特征与授权用户的声纹特征进行匹配,当上述均匹配成功时,对电子设备进行唤醒。示例性的,当电子设备为 手机时,对电子设备进行唤醒可以是从锁屏状态切换为工作状态,并根据处理后的语音中的关键词生成对应的控制指令,例如从处理后的语音识别的关键词可以是“嘿Siri,今天天气如何”,当关键词“嘿Siri”与预设的唤醒关键词匹配成功,且提取的声纹特征与授权用户的声纹特征匹配成功时,根据“今天天气如何”生成天气查询指令,执行该天气查询指令,并将查询结果通过语音播放或者图文显示的方式进行输出。
需要说明的是,步骤404可省略,直接提取去混响语音的声纹特征,基于去混响语音的声纹特征对电子设备进行声纹唤醒。
本实施例提供的语音处理方法,通过采集用户输入的原始语音对电子设备进行声纹唤醒,基于生成式对抗网络模型的生成子模型对原始语音进行高精度的去混响处理,提高了去混响语音的清晰度,进一步地提高了去混响语音的声纹特征的准确度和识别率,避免对电子设备的误操作,提高了电子设备的控制精度。
图5为本申请实施例提供的一种语音处理装置的结构框图,该装置可由软件和/或硬件实现,一般集成在电子设备中,可通过执行电子设备的语音处理方法来对采集的语音信号进行去混响处理。如图5所示,该装置包括:语音获取模块501、语音处理模块502和去混响语音确定模块503。
语音获取模块501,用于获取原始语音;
语音处理模块502,用于若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;
去混响语音确定模块503,用于将所述生成子模型的输出语音确定为去混响语音。
本申请实施例中提供的语音处理装置,基于GAN网络对用户输入的原始语音进行去混响处理,无需提取原始语音的语音特征,快速得到高精度的去混响语音,提高了对原始语音信号处理效率和处理精度。
在上述实施例的基础上,所述生成式对抗网络模型还包括判别子模型,其中,所述判别子模型用于判别输入语音的语音类型。
在上述实施例的基础上,还包括:
混响语音判别模块,用于在获取原始语音之后,将所述原始语音输入至所述预先训练的生成式对抗网络模型的判别子模型中,根据所述判别子模型的输出结果确定所述原始语音是否为混响语音。
在上述实施例的基础上,还包括:
生成子模型训练模块,用于将混响语音样本输入至待训练的生成子模型,得到所述生成子模型输出的生成语音;将所述生成语音输入至预先训练的判别子模型中,根据所述判别子模型的输出结果确定所述生成语音为干净语音的判别概率;根据所述生成语音的判别概率与期望概率的确定损失信息;基于所述损失信息调整所述生成子模型的网络参数。
在上述实施例的基础上,还包括:
判别子模型训练模块,用于采集语音样本,并对根据语音样本的语音类型设置类型标识,其中, 所述语音样本包括干净语音样本和混响语音样本;将所述语音样本输入至待训练的判别子模型,得到所述判别子模型的判别结果;根据所述判别结果与所述语音样本的类型标识,调整所述判别子模型的网络参数。
在上述实施例的基础上,所述混响语音样本是对干净语音样本基于不同的混响次数和/或不同的混响时间进行叠加生成。
在上述实施例的基础上,还包括:
掩蔽处理模块,用于在将所述生成子模型的输出语音确定为去混响语音之后,对所述去混响语音进行掩蔽处理,生成处理后的语音。
在上述实施例的基础上,掩蔽处理模块用于:
对所述去混响语音进行短时傅里叶变换,生成所述去混响语音的幅度谱和相位谱;
对所述去混响语音的幅度谱进行掩蔽处理,将处理后的幅度谱与所述相位谱进行重组,并进行短时傅里叶逆变换,生成处理后的语音。
在上述实施例的基础上,还包括:
声纹识别模块,用于识别所述去混响语音的声纹特征,对所述声纹特征与预设声纹特征进行特征比对;
设备唤醒模块,用于当比对成功时,对设备进行唤醒。
本申请实施例还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行语音处理方法,该方法包括:
获取原始语音;
若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;
将所述生成子模型的输出语音确定为去混响语音。
存储介质——任何的各种类型的存储器设备或存储设备。术语“存储介质”旨在包括:安装介质,例如CD-ROM、软盘或磁带装置;计算机系统存储器或随机存取存储器,诸如DRAM、DDRRAM、SRAM、EDORAM,兰巴斯(Rambus)RAM等;非易失性存储器,诸如闪存、磁介质(例如硬盘或光存储);寄存器或其它相似类型的存储器元件等。存储介质可以还包括其它类型的存储器或其组合。另外,存储介质可以位于程序在其中被执行的第一计算机系统中,或者可以位于不同的第二计算机系统中,第二计算机系统通过网络(诸如因特网)连接到第一计算机系统。第二计算机系统可以提供程序指令给第一计算机用于执行。术语“存储介质”可以包括可以驻留在不同位置中(例如在通过网络连接的不同计算机系统中)的两个或更多存储介质。存储介质可以存储可由一个或多个处理器执行的程序指令(例如具体实现为计算机程序)。
当然,本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的语音处理操作,还可以执行本申请任意实施例所提供的语音处理方法中的相关操作。
本申请实施例提供了一种电子设备,该电子设备中可集成本申请实施例提供的语音处理装置。图6为本申请实施例提供的一种电子设备的结构示意图。电子设备600可以包括:存储器601,处理器602及存储在存储器601上并可在处理器602运行的计算机程序,所述处理器602执行所述计算机程序时实现如本申请实施例所述的语音处理方法。
本申请实施例提供的电子设备,基于GAN网络对用户输入的原始语音进行去混响处理,无需提取原始语音的语音特征,快速得到高精度的去混响语音,提高了对原始语音信号处理效率和处理精度。
图7为本申请实施例提供的另一种电子设备的结构示意图。该电子设备可以包括:壳体(图中未示出)、存储器701、中央处理器(central processing unit,CPU)702(又称处理器,以下简称CPU)、电路板(图中未示出)和电源电路(图中未示出)。所述电路板安置在所述壳体围成的空间内部;所述CPU702和所述存储器701设置在所述电路板上;所述电源电路,用于为所述电子设备的各个电路或器件供电;所述存储器701,用于存储可执行程序代码;所述CPU702通过读取所述存储器701中存储的可执行程序代码来运行与所述可执行程序代码对应的计算机程序,以实现以下步骤:
获取原始语音;
若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;
将所述生成子模型的输出语音确定为去混响语音。
所述电子设备还包括:外设接口703、RF(Radio Frequency,射频)电路705、音频电路706、扬声器711、电源管理芯片708、输入/输出(I/O)子系统709、其他输入/控制设备710、触摸屏712、其他输入/控制设备710以及外部端口704,这些部件通过一个或多个通信总线或信号线707来通信。
应该理解的是,图示电子设备700仅仅是电子设备的一个范例,并且电子设备700可以具有比图中所示出的更多的或者更少的部件,可以组合两个或更多的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
下面就本实施例提供的用于对语音处理操作的电子设备进行详细的描述,该电子设备以手机为例。
存储器701,所述存储器701可以被CPU702、外设接口703等访问,所述存储器701可以包括高速随机存取存储器,还可以包括非易失性存储器,例如一个或多个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
外设接口703,所述外设接口703可以将设备的输入和输出外设连接到CPU702和存储器701。
I/O子系统709,所述I/O子系统709可以将设备上的输入输出外设,例如触摸屏712和其他输入/控制设备710,连接到外设接口703。I/O子系统709可以包括显示控制器7091和用于控制其他输入/控制设备710的一个或多个输入控制器7092。其中,一个或多个输入控制器7092从其他输入 /控制设备710接收电信号或者向其他输入/控制设备710发送电信号,其他输入/控制设备710可以包括物理按钮(按压按钮、摇臂按钮等)、拨号盘、滑动开关、操纵杆、点击滚轮。值得说明的是,输入控制器7092可以与以下任一个连接:键盘、红外端口、USB接口以及诸如鼠标的指示设备。
触摸屏712,所述触摸屏712是用户电子设备与用户之间的输入接口和输出接口,将可视输出显示给用户,可视输出可以包括图形、文本、图标、视频等。
I/O子系统709中的显示控制器7091从触摸屏712接收电信号或者向触摸屏712发送电信号。触摸屏712检测触摸屏上的接触,显示控制器7091将检测到的接触转换为与显示在触摸屏712上的用户界面对象的交互,即实现人机交互,显示在触摸屏712上的用户界面对象可以是运行游戏的图标、联网到相应网络的图标等。值得说明的是,设备还可以包括光鼠,光鼠是不显示可视输出的触摸敏感表面,或者是由触摸屏形成的触摸敏感表面的延伸。
RF电路705,主要用于建立手机与无线网络(即网络侧)的通信,实现手机与无线网络的数据接收和发送。例如收发短信息、电子邮件等。具体地,RF电路705接收并发送RF信号,RF信号也称为电磁信号,RF电路705将电信号转换为电磁信号或将电磁信号转换为电信号,并且通过该电磁信号与通信网络以及其他设备进行通信。RF电路705可以包括用于执行这些功能的已知电路,其包括但不限于天线系统、RF收发机、一个或多个放大器、调谐器、一个或多个振荡器、数字信号处理器、CODEC(COder-DECoder,编译码器)芯片组、用户标识模块(Subscriber Identity Module,SIM)等等。
音频电路706,主要用于从外设接口703接收音频数据,将该音频数据转换为电信号,并且将该电信号发送给扬声器711。
扬声器711,用于将手机通过RF电路705从无线网络接收的语音信号,还原为声音并向用户播放该声音。
电源管理芯片708,用于为CPU702、I/O子系统及外设接口所连接的硬件进行供电及电源管理。
上述实施例中提供的语音处理装置、存储介质及电子设备可执行本申请任意实施例所提供的语音处理方法,具备执行该方法相应的功能模块和有益效果。未在上述实施例中详尽描述的技术细节,可参见本申请任意实施例所提供的语音处理方法。
注意,上述仅为本申请的较佳实施例及所运用技术原理。本领域技术人员会理解,本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此,虽然通过以上实施例对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例,在不脱离本申请构思的情况下,还可以包括更多其他等效实施例,而本申请的范围由所附的权利要求范围决定。
Claims (20)
- 一种语音处理方法,其中,包括:获取原始语音;若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;将所述生成子模型的输出语音确定为去混响语音。
- 根据权利要求1所述的方法,其中,所述生成式对抗网络模型还包括判别子模型,所述判别子模型用于判别输入语音的语音类型;其中,在获取原始语音之后,还包括:将所述原始语音输入至所述预先训练的生成式对抗网络模型的判别子模型中,根据所述判别子模型的输出结果确定所述原始语音是否为混响语音。
- 根据权利要求2所述的方法,其中,所述生成子模型的训练方法包括:将混响语音样本输入至待训练的生成子模型,得到所述生成子模型输出的生成语音;将所述生成语音输入至预先训练的判别子模型中,根据所述判别子模型的输出结果确定所述生成语音为干净语音的判别概率;根据所述生成语音为干净语音的判别概率与期望概率的确定损失信息;基于所述损失信息调整所述生成子模型的网络参数。
- 根据权利要求3所述的方法,其中,所述判别子模型的训练方法包括:采集语音样本,并对根据语音样本的语音类型设置类型标识,其中,所述语音样本包括干净语音样本和混响语音样本;将所述语音样本输入至待训练的判别子模型,得到所述判别子模型的判别结果;根据所述判别结果与所述语音样本的类型标识,调整所述判别子模型的网络参数。
- 根据权利要求3所述的方法,其中,所述混响语音样本是对干净语音样本基于不同的混响次数和/或不同的混响时间进行叠加生成。
- 根据权利要求3所述的方法,其中,将所述生成子模型的输出语音确定为去混响语音之后,还包括:将所述去混响语音传输至所述预先训练的生成式对抗网络模型的判别子模型中,获取所述判别子模型的输出结果;当所述输出结果中所述去混响语音为干净语音的判别概率小于预设概率时,将所述去混响语音输入至所述生成子模型中,进行二次去混响处理。
- 根据权利要求1所述的方法,其中,在将所述生成子模型的输出语音确定为去混响语音之后,还包括:对所述去混响语音进行掩蔽处理,生成处理后的语音。
- 根据权利要求7所述的方法,其中,所述对所述去混响语音进行掩蔽处理,生成处理后的语音,包括:对所述去混响语音进行短时傅里叶变换,生成所述去混响语音的幅度谱和相位谱;对所述去混响语音的幅度谱进行掩蔽处理,将处理后的幅度谱与所述相位谱进行重组,并进行短时傅里叶逆变换,生成处理后的语音。
- 根据权利要求1所述的方法,其中,在将所述生成子模型的输出语音确定为去混响语音之后,还包括:识别所述去混响语音的声纹特征,对所述声纹特征与预设声纹特征进行特征比对;当比对成功时,对设备进行唤醒。
- 一种语音处理装置,其中,包括:语音获取模块,用于获取原始语音;语音处理模块,用于若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;去混响语音确定模块,用于将所述生成子模型的输出语音确定为去混响语音。
- 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现:获取原始语音;若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;将所述生成子模型的输出语音确定为去混响语音。
- 一种电子设备,其中,包括存储器,处理器及存储在存储器上并可在处理器运行的计算机程序,所述处理器执行所述计算机程序时实现:获取原始语音;若所述原始语音为混响语音,则将所述原始语音输入至预先训练的生成式对抗网络模型的生成子模型,其中,所述生成子模型用于对所述原始语音进行去混响处理;将所述生成子模型的输出语音确定为去混响语音。
- 根据权利要求12所述的电子设备,其中,所述生成式对抗网络模型还包括判别子模型,所述判别子模型用于判别输入语音的语言类型,在获取原始语音之后,所述处理器还用于执行:将所述原始语音输入至所述预先训练的生成式对抗网络模型的判别子模型中,根据所述判别子模型的输出结果确定所述原始语音是否为混响语音。
- 根据权利要求13所述的电子设备,其中,所述处理器还用于执行:将混响语音样本输入至待训练的生成子模型,得到所述生成子模型输出的生成语音;将所述生成语音输入至预先训练的判别子模型中,根据所述判别子模型的输出结果确定所述生成语音为干净语音的判别概率;根据所述生成语音为干净语音的判别概率与期望概率的确定损失信息;基于所述损失信息调整所述生成子模型的网络参数。
- 根据权利要求14所述的电子设备,其中,所述处理器还用于执行:采集语音样本,并对根据语音样本的语音类型设置类型标识,其中,所述语音样本包括干净语音样本和混响语音样本;将所述语音样本输入至待训练的判别子模型,得到所述判别子模型的判别结果;根据所述判别结果与所述语音样本的类型标识,调整所述判别子模型的网络参数。
- 根据权利要求14所述的电子设备,其中,所述混响语音样本是对干净语音样本基于不同的混响次数和/或不同的混响时间进行叠加生成。
- 根据权利要求14所述的电子设备,其中,在将所述生成子模型的输出语音确定为去混响语音之后,所述处理器还用于执行:将所述去混响语音传输至所述预先训练的生成式对抗网络模型的判别子模型中,获取所述判别子模型的输出结果;当所述输出结果中所述去混响语音为干净语音的判别概率小于预设概率时,将所述去混响语音输入至所述生成子模型中,进行二次去混响处理。
- 根据权利要求12所述的电子设备,其中,在将所述生成子模型的输出语音确定为去混响语音之后,所述处理器还用于执行:对所述去混响语音进行掩蔽处理,生成处理后的语音。
- 根据权利要求18所述的电子设备,其中,在对所述去混响语音进行掩蔽处理,生成处理后的语音时,所述处理器用于执行:对所述去混响语音进行短时傅里叶变换,生成所述去混响语音的幅度谱和相位谱;对所述去混响语音的幅度谱进行掩蔽处理,将处理后的幅度谱与所述相位谱进行重组,并进行短时傅里叶逆变换,生成处理后的语音。
- 根据权利要求12所述的电子设备,其中,在将所述生成子模型的输出语音确定确为混响 语音之后,所述处理器还用于执行:识别所述去混响语音的声纹特征,对所述声纹特征与预设声纹特征进行特征比对;当比对成功时,对设备进行唤醒。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811273432.4 | 2018-10-30 | ||
CN201811273432.4A CN109119090A (zh) | 2018-10-30 | 2018-10-30 | 语音处理方法、装置、存储介质及电子设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020088153A1 true WO2020088153A1 (zh) | 2020-05-07 |
Family
ID=64854713
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/107578 WO2020088153A1 (zh) | 2018-10-30 | 2019-09-24 | 语音处理方法、装置、存储介质和电子设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109119090A (zh) |
WO (1) | WO2020088153A1 (zh) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109119090A (zh) * | 2018-10-30 | 2019-01-01 | Oppo广东移动通信有限公司 | 语音处理方法、装置、存储介质及电子设备 |
CN109887489B (zh) * | 2019-02-23 | 2021-10-26 | 天津大学 | 基于生成对抗网络的深度特征的语音去混响方法 |
CN110458904B (zh) * | 2019-08-06 | 2023-11-10 | 苏州瑞派宁科技有限公司 | 胶囊式内窥镜图像的生成方法、装置及计算机存储介质 |
CN110853663B (zh) * | 2019-10-12 | 2023-04-28 | 平安科技(深圳)有限公司 | 基于人工智能的语音增强方法、服务器及存储介质 |
CN111489760B (zh) * | 2020-04-01 | 2023-05-16 | 腾讯科技(深圳)有限公司 | 语音信号去混响处理方法、装置、计算机设备和存储介质 |
CN112652321B (zh) * | 2020-09-30 | 2023-05-02 | 北京清微智能科技有限公司 | 一种基于深度学习相位更加友好的语音降噪系统及方法 |
CN112653979A (zh) * | 2020-12-29 | 2021-04-13 | 苏州思必驰信息科技有限公司 | 自适应去混响方法和装置 |
CN112992170B (zh) * | 2021-01-29 | 2022-10-28 | 青岛海尔科技有限公司 | 模型训练方法及装置、存储介质及电子装置 |
CN113112998B (zh) * | 2021-05-11 | 2024-03-15 | 腾讯音乐娱乐科技(深圳)有限公司 | 模型训练方法、混响效果复现方法、设备及可读存储介质 |
CN114333882B (zh) * | 2022-03-09 | 2022-08-19 | 深圳市友杰智新科技有限公司 | 基于幅度谱的语音降噪方法、装置、设备及存储介质 |
CN115295013B (zh) * | 2022-08-02 | 2024-10-01 | 北京声智科技有限公司 | 样本确定方法、装置及电子设备 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012155301A (ja) * | 2011-01-21 | 2012-08-16 | Wrk Solution Co Ltd | 状況認知型音声認識方法 |
CN105448302A (zh) * | 2015-11-10 | 2016-03-30 | 厦门快商通信息技术有限公司 | 一种环境自适应的语音混响消除方法和系统 |
CN107293289A (zh) * | 2017-06-13 | 2017-10-24 | 南京医科大学 | 一种基于深度卷积生成对抗网络的语音生成方法 |
CN108346433A (zh) * | 2017-12-28 | 2018-07-31 | 北京搜狗科技发展有限公司 | 一种音频处理方法、装置、设备及可读存储介质 |
CN108597496A (zh) * | 2018-05-07 | 2018-09-28 | 广州势必可赢网络科技有限公司 | 一种基于生成式对抗网络的语音生成方法及装置 |
CN109119090A (zh) * | 2018-10-30 | 2019-01-01 | Oppo广东移动通信有限公司 | 语音处理方法、装置、存储介质及电子设备 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPWO2017168870A1 (ja) * | 2016-03-28 | 2019-02-07 | ソニー株式会社 | 情報処理装置及び情報処理方法 |
CN107452389B (zh) * | 2017-07-20 | 2020-09-01 | 大象声科(深圳)科技有限公司 | 一种通用的单声道实时降噪方法 |
CN110660403B (zh) * | 2018-06-28 | 2024-03-08 | 北京搜狗科技发展有限公司 | 一种音频数据处理方法、装置、设备及可读存储介质 |
-
2018
- 2018-10-30 CN CN201811273432.4A patent/CN109119090A/zh active Pending
-
2019
- 2019-09-24 WO PCT/CN2019/107578 patent/WO2020088153A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012155301A (ja) * | 2011-01-21 | 2012-08-16 | Wrk Solution Co Ltd | 状況認知型音声認識方法 |
CN105448302A (zh) * | 2015-11-10 | 2016-03-30 | 厦门快商通信息技术有限公司 | 一种环境自适应的语音混响消除方法和系统 |
CN107293289A (zh) * | 2017-06-13 | 2017-10-24 | 南京医科大学 | 一种基于深度卷积生成对抗网络的语音生成方法 |
CN108346433A (zh) * | 2017-12-28 | 2018-07-31 | 北京搜狗科技发展有限公司 | 一种音频处理方法、装置、设备及可读存储介质 |
CN108597496A (zh) * | 2018-05-07 | 2018-09-28 | 广州势必可赢网络科技有限公司 | 一种基于生成式对抗网络的语音生成方法及装置 |
CN109119090A (zh) * | 2018-10-30 | 2019-01-01 | Oppo广东移动通信有限公司 | 语音处理方法、装置、存储介质及电子设备 |
Also Published As
Publication number | Publication date |
---|---|
CN109119090A (zh) | 2019-01-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020088153A1 (zh) | 语音处理方法、装置、存储介质和电子设备 | |
WO2020088154A1 (zh) | 语音降噪方法、存储介质和移动终端 | |
CN110503969B (zh) | 一种音频数据处理方法、装置及存储介质 | |
WO2019101123A1 (zh) | 语音活性检测方法、相关装置和设备 | |
CN110970057B (zh) | 一种声音处理方法、装置与设备 | |
JP7498560B2 (ja) | システム及び方法 | |
CN110554357B (zh) | 声源定位方法和装置 | |
CN111696570B (zh) | 语音信号处理方法、装置、设备及存储介质 | |
WO2020048431A1 (zh) | 一种语音处理方法、电子设备和显示设备 | |
CN109756818B (zh) | 双麦克风降噪方法、装置、存储介质及电子设备 | |
WO2024027246A1 (zh) | 声音信号处理方法、装置、电子设备和存储介质 | |
CN109119097B (zh) | 基音检测方法、装置、存储介质及移动终端 | |
CN110797051A (zh) | 一种唤醒门限设置方法、装置、智能音箱及存储介质 | |
US11783809B2 (en) | User voice activity detection using dynamic classifier | |
CN113707149A (zh) | 音频处理方法和装置 | |
CN114333817A (zh) | 遥控器及遥控器语音识别方法 | |
CN113436613A (zh) | 语音识别方法、装置、电子设备及存储介质 | |
CN114694667A (zh) | 语音输出方法、装置、计算机设备及存储介质 | |
WO2024016793A1 (zh) | 语音信号的处理方法、装置、设备及计算机可读存储介质 | |
CN116935883B (zh) | 声源定位方法、装置、存储介质及电子设备 | |
CN117012202B (zh) | 语音通道识别方法、装置、存储介质及电子设备 | |
CN113808606B (zh) | 语音信号处理方法和装置 | |
CN105989838B (zh) | 语音识别方法及装置 | |
CN115331672B (zh) | 设备控制方法、装置、电子设备及存储介质 | |
US20230186936A1 (en) | Method for processing voice signal, and apparatus using same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19879415 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19879415 Country of ref document: EP Kind code of ref document: A1 |