CN112420063A - Voice enhancement method and device - Google Patents

Voice enhancement method and device Download PDF

Info

Publication number
CN112420063A
CN112420063A CN201910774538.0A CN201910774538A CN112420063A CN 112420063 A CN112420063 A CN 112420063A CN 201910774538 A CN201910774538 A CN 201910774538A CN 112420063 A CN112420063 A CN 112420063A
Authority
CN
China
Prior art keywords
sound
information
voice
electronic device
voiceprint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910774538.0A
Other languages
Chinese (zh)
Inventor
王保辉
李伟
李晓建
胡伟湘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201910774538.0A priority Critical patent/CN112420063A/en
Priority to PCT/CN2020/105296 priority patent/WO2021031811A1/en
Publication of CN112420063A publication Critical patent/CN112420063A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Abstract

The embodiment of the application discloses a voice enhancement method and a voice enhancement device, relates to the technical field of communication, and solves the problems that in the prior art, under a scene with a complex sound environment, self-adaption to the environment cannot be achieved, and user intelligent interaction experience is poor. The specific scheme is as follows: the electronic equipment collects a first sound, wherein the first sound comprises at least one of a second sound and a background sound; the electronic device identifies a first sound; when the first sound has a second sound, the electronic equipment performs sound analysis on the second sound to obtain a third sound; the electronic device processes the third sound.

Description

Voice enhancement method and device
Technical Field
The embodiment of the application relates to the technical field of communication, in particular to a voice enhancement method and device.
Background
At present, the traditional speech enhancement algorithm carries out the same set of speech enhancement algorithm on all input sounds in different environments, and carries out the same enhancement regardless of any input sound. For example, under the condition that the received sound of the smart sound box includes various sounds such as human voice, television voice, dog cry, water flow sound, and the like, the same enhancement is performed on all the sounds by using the existing voice enhancement algorithm, which may cause the smart sound box to possibly fail to accurately acquire the user command, thereby causing the problems that the voice interaction of the smart sound box is not output, or the voice interaction output is not accurate, and the like, so that the experience of the user intelligent interaction is poor. Therefore, the existing voice enhancement algorithm cannot adapt to the environment in a scene with a complex sound environment, and the experience of intelligent interaction of users is poor.
Disclosure of Invention
The embodiment of the application provides a voice enhancement method and a voice enhancement device, which can accurately capture the voice of a target user from a complex sound environment in a scene with a complex sound environment, and improve the experience of intelligent interaction of the user.
In a first aspect of embodiments of the present application, a method for speech enhancement is provided, where the method includes: the electronic device collects a first sound, the first sound comprising at least one of a second sound and a background sound; the electronic equipment identifies the first sound; when the first sound has the second sound, the electronic equipment performs sound analysis on the second sound to obtain a third sound; the electronic device processes the third sound. Based on this scheme, through the first sound that discernment was gathered to when having the second sound in first sound, separate out the second sound (voice) from first sound, and extract third sound (user voice interaction command) from the second sound, thereby handle user voice interaction command, the voice interaction output that obtains is comparatively accurate, consequently can be accurate completion voice interaction, promoted user's intelligent interaction's experience. When the user voice interaction command is extracted from the complex sound environment, the user voice interaction command (third sound) is obtained by identifying and analyzing the sound and combining the attribute information of the sound, so that the voice enhancement method does not enhance all input sounds equally, but enhances pertinence by combining the attribute information of the currently acquired sound, thereby being capable of adapting to the complex sound environment and improving the intelligent interaction experience of the user in the complex sound environment.
With reference to the first aspect, in a possible implementation manner, the recognizing, by the electronic device, a first sound includes: and the electronic equipment performs voice event recognition on the first voice according to the voice event recognition model to acquire the voice category information of the first voice. According to the scheme, the sound type information of the first sound can be acquired by identifying the first sound, so that the second sound can be extracted from the first sound according to the sound type information of the first sound.
With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the performing, by the electronic device, a sound analysis on the second sound to obtain a third sound includes: the electronic device separating the second sound from the first sound according to sound type information of the first sound; the electronic equipment analyzes sound attribute information of the second sound; wherein the sound attribute information includes: one or more of sound azimuth information, voiceprint information, sound time information and sound decibel information; and the electronic equipment obtains a third sound according to the sound attribute information of the second sound. Based on the scheme, the second sound is separated from the first sound, attribute analysis is carried out on the second sound, the sound can be extracted from the multiple human voices according to the attribute information of the second sound, a clean user voice interaction command is obtained, and pertinence enhancement is achieved.
With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the voiceprint information of the third sound is matched with the voiceprint information of the registered user. Based on the scheme, the sound matched with the voiceprint information of the registered user can be determined as the third sound through the voiceprint information.
With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the method further includes: the electronic equipment clusters the fourth sound to acquire new sound category information; the fourth sound is the sound of which the sound type information is not recognized according to the sound event recognition model in the first sound; and updating the voice event recognition model according to the new voice category information, and acquiring the updated voice event recognition model. Based on the scheme, the sounds which are not recognized by the sound event recognition model can be clustered, and a new sound event recognition model is obtained through training, namely, the sounds which often appear in the environment can be learned by the electronic equipment, the sound event recognition model is updated, the environment can be self-adapted, and the interaction experience of a user is improved. And the stability and robustness of the sound event recognition model are more stable along with the increase of the use time of a user, and the better effect can be achieved.
With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the method further includes: the electronic equipment acquires sound direction information of the first sound; acquiring direction information of useless sound according to the sound direction information of the first sound and the sound type information of the first sound; the electronic device filters the sound from the direction based on the direction information of the useless sound. Based on the scheme, the directions of useless sounds which often appear in a specific scene can be obtained by learning the direction information of various sounds and combining the sound type information, so that the sound separation can be better assisted.
With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the method further includes: acquiring voice interaction information of the third sound; and performing voiceprint registration on the third voice without the voiceprint registration in the interactive voice information of the third voice. Based on the scheme, the electronic equipment can learn the voiceprint information by daily collecting the user interaction voice with good interaction feedback and combining voice recognition and voiceprint registration irrelevant to the text, and registers. Therefore, voice corresponding to voiceprint information can be enhanced in intelligent interaction, and intelligent interaction experience is improved.
With reference to the first aspect or any possible implementation manner of the first aspect, in another possible implementation manner, the method further includes: and the electronic equipment outputs interactive information according to the sound attribute information of the third sound. The interactive information may be voice interactive information or a control signal. Based on the scheme, the electronic equipment can output corresponding interactive information by combining the azimuth information, the voiceprint information, the time information, the age information and the like of the third sound.
In a second aspect of the embodiments of the present application, there is provided a speech enhancement apparatus, including: a processor to identify a first sound, the first sound captured by a speech capture device, the first sound comprising at least one of a second sound and a background sound; when the first sound has the second sound, the processor performs sound analysis on the second sound to obtain a third sound; the processor processes the third sound.
With reference to the second aspect, in a possible implementation manner, the upper processor is further configured to perform sound event recognition on the first sound according to a sound event recognition model, and acquire sound category information of the first sound.
With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the processor is further configured to: separating the second sound from the first sound according to sound type information of the first sound; analyzing the sound attribute information of the second sound; wherein the sound attribute information includes: one or more of sound azimuth information, voiceprint information, sound time information and sound decibel information; and obtaining the third sound according to the sound attribute information of the second sound.
With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the voiceprint information of the third sound is matched with the voiceprint information of the registered user.
With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the processor is further configured to: clustering the fourth sound to acquire new sound category information; the fourth sound is the sound of which the sound type information is not recognized according to the sound event recognition model in the first sound; and updating the voice event recognition model according to the new voice category information, and acquiring the updated voice event recognition model.
With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the processor is further configured to: acquiring sound direction information of the first sound; acquiring direction information of useless sound according to the sound direction information of the first sound and the sound type information of the first sound; based on the direction information of the useless sound, the sound from the direction is filtered.
With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the processor is further configured to: acquiring interactive voice information of the third sound; and performing voiceprint registration on the third voice without the voiceprint registration in the interactive voice information of the third voice.
With reference to the second aspect or any possible implementation manner of the second aspect, in another possible implementation manner, the processor is further configured to: and outputting the interactive information according to the sound attribute information of the third sound.
In a third aspect of embodiments of the present application, an embodiment of the present application provides an electronic device, where the electronic device may implement the speech enhancement method described in the first aspect, and the method may be implemented by software, hardware, or by executing corresponding software through hardware. In one possible design, the electronic device may include a processor and a memory. The processor is configured to enable the electronic device to perform the respective functions of the method of the first or second aspect. The memory is for coupling with the processor and holds the necessary program instructions and data for the electronic device.
In a fourth aspect of the embodiments of the present application, a computer storage medium is provided, where the computer storage medium includes computer instructions, and when the computer instructions are executed on an electronic device, the electronic device is caused to perform the speech enhancement method according to any one of the above aspects and possible design manners.
In a fifth aspect of the embodiments of the present application, there is provided a computer program product, which when run on a computer, causes the computer to execute the speech enhancement method according to any one of the above aspects and possible designs thereof.
For the descriptions of the effects of the second aspect, the third aspect, the fourth aspect and the fifth aspect, reference may be made to the descriptions of the corresponding effects of the first aspect, and details are not repeated here.
Drawings
FIG. 1 is a schematic diagram illustrating an example scenario of a sound environment according to an embodiment of the present application;
fig. 2 is a schematic composition diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a software architecture of an electronic device according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a system architecture adapted by a speech enhancement method according to an embodiment of the present application;
fig. 5 is a flowchart illustrating a speech enhancement method according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a sound localization method according to an embodiment of the present disclosure;
FIG. 7 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;
FIG. 8 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;
FIG. 9 is a flowchart illustrating another speech enhancement method according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a and b, a and c, b and c, or a and b and c, wherein a, b and c can be single or multiple. In addition, for the convenience of clearly describing the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", and the like are used to distinguish the same items or similar items with basically the same functions and actions, and those skilled in the art can understand that the words "first", "second", and the like do not limit the quantity and execution order. For example, the "first" in the first application and the "second" in the second application in the embodiment of the present application are used only to distinguish different application programs. The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.
It is noted that, in the present application, words such as "exemplary" or "for example" are used to mean exemplary, illustrative, or descriptive. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
First, a basic concept in the embodiment of the present application is described.
Voiceprint: is a biological characteristic consisting of hundreds of characteristic dimensions such as wavelength, frequency, intensity and the like. The generation of human language is a complex physiological and physical process between the human language center and the vocal organs, and the vocal print atlas of any two people is different because the vocal organs of different people are different in size, form and function. The voiceprint has the characteristics of specificity, relative stability and variability. The particularity of the voiceprint means that the voiceprints of different people are different, even if a speaker intentionally imitates the voice and tone of other people, the speaker still speaks with whisper and whisper, and even if the imitation is vivid, the voiceprints are always different. Relative stability of the voiceprint means that a human voiceprint can remain relatively stable for a long period of time after adulthood. The variability of the voiceprint can come from physiological, pathological, psychological, simulation, camouflage and is also related to factors such as environmental interference. Because the pronunciation organs of each person are not completely the same, based on the characteristics of voiceprint specificity and relative stability, the voices of different persons can be distinguished through voiceprints.
And (3) voiceprint recognition: the method is one of biological recognition technology, and can extract voiceprint features from a voice signal sent by a speaker to perform identity recognition. I.e., voiceprint recognition, may also be referred to as speaker recognition, including both speaker recognition and speaker verification. Wherein, the speaker identification is used for judging which one of a plurality of persons said a certain section of voice; and speaker verification is used to verify that a certain speech was spoken by a specified person. Optionally, a deep learning algorithm may be employed to extract voiceprint features of the speaker. For example, a speaker recognition system adopting a classical deep neural network and a speaker feature extraction system based on an end-to-end deep neural network are adopted to extract the voiceprint features of the speaker.
For example, the electronic device may perform voiceprint registration before performing voiceprint recognition. Voiceprint enrollment, i.e., a user enters a piece of speech through a sound pickup device (e.g., a microphone) of an electronic device, which may be referred to as enrollment speech, for example; the equipment extracts the voiceprint characteristics of the registered voice; and establishing and storing the corresponding relation between the voiceprint characteristics and the user who inputs the voice. In this embodiment, a user who performs voiceprint registration may be referred to as a registered user. It is understood that the voiceprint registration may be a voiceprint registration which is independent of the content of the voice input by the user, that is, the electronic device may extract voiceprint features from the voice input by the user for registration. The embodiment of the present application does not limit the specific method for extracting the voiceprint feature from the voice information.
Illustratively, the electronic device may support voiceprint registration for multiple users. For example, a plurality of users respectively input a voice through the sound pickup device of the electronic device. Illustratively, user 1 enters speech 1, user 2 enters speech 2, and user 3 enters speech 3. The electronic equipment extracts the voiceprint characteristics of each section of voice (voice 1, voice 2 and voice 3). Further, the electronic device establishes and stores a corresponding relationship between the voiceprint feature of the voice 1 and the user 1, establishes and stores a corresponding relationship between the voiceprint feature of the voice 2 and the user 2, and establishes and stores a corresponding relationship between the voiceprint feature of the voice 3 and the user 3. In this way, the electronic device stores the corresponding relationship between a plurality of users and the voiceprint features extracted from the registered voice of the user, and the users are registered users on the electronic device.
For example, the electronic device may perform voiceprint recognition after receiving the user's voice. For example, after receiving a voice of a user, the electronic device may extract a voiceprint feature of the voice; the electronic equipment can also acquire the voiceprint characteristics of the registered voice of each registered user; the electronic equipment compares the voiceprint features of the received user voice with the voiceprint features of the registered voice of each registered user respectively, and whether the voiceprint features of the received user voice are matched with the voiceprint features of the registered voice of each registered user respectively is obtained.
Optionally, the authority of multiple users who have registered voiceprints in the embodiment of the present application may be different. For example, the usage rights may differ depending on the way the user registers the voiceprint. If the registered voiceprint is actively registered by the user, the user's permission is higher. If the registered voiceprint is registered through adaptive learning, the user's permission is low and the privacy function cannot be used. The user right of the registered voiceprint may also be preset by the user, which is not limited in the embodiment of the present application.
The speech enhancement technology is a technology for extracting a useful speech signal from a speech signal containing a plurality of kinds of noise in a complicated sound environment, and suppressing and reducing noise interference. I.e. extracting as pure as possible the original speech from noisy speech.
Illustratively, the electronic device is a smart speaker, and the sound environment of the smart speaker is the scene shown in fig. 1 as an example. As shown in fig. 1, the smart speaker may receive a variety of audio inputs in the environment, which may include a user asking for "how to look like today," a pet dog beep, a television beep, a person talking, etc. A voice enhancement method would enhance all four sounds in the environment shown in fig. 1 in the same way, so that the sound input into the smart speaker includes both the voice of the user command and the sound in the environment, which would cause the electronic device to fail to correctly capture the voice command of the target user, resulting in a problem of poor user experience. And the speech enhancement algorithm can not be adaptive to the environment when the electronic equipment is in a complex sound environment.
The method aims to solve the problem that the experience of user intelligent interaction is poor in the complex sound environment of a voice enhancement algorithm in the prior art. The embodiment of the application provides a voice enhancement method, which can accurately capture the voice of a target user from a complex sound environment in a scene with a complex sound environment, and improve the intelligent interaction experience of the user.
The voice enhancement method can be applied to electronic equipment, and the electronic equipment can be intelligent household equipment capable of performing voice interaction such as an intelligent sound box, an intelligent television, an intelligent refrigerator and the like, can also be wearable electronic equipment capable of performing voice interaction such as an intelligent watch, intelligent glasses, an intelligent helmet and the like, and can also be equipment capable of performing voice interaction in other forms. The embodiment of the present application does not particularly limit the specific form of the electronic device.
Please refer to fig. 2, which is a schematic structural diagram of an electronic device 100 according to an embodiment of the present disclosure. As shown in fig. 2, the electronic device 100 may include a processor 110, a memory 120, a Universal Serial Bus (USB) interface 130, a charge management module 140, a power management module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a microphone 170B, and the like. Optionally, the electronic device 100 may further include a battery 180.
It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a controller, a memory, a video codec, and a Digital Signal Processor (DSP). The different processing units may be separate devices or may be integrated into one or more processors.
The controller may be, among other things, a neural center and a command center of the electronic device 100. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.
A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.
Memory 120 for storing instructions and data. In some embodiments, memory 120 is a cache memory. The memory may hold instructions or data that have been used or recycled by the processor 110. If the processor 110 needs to reuse the instruction or data, it may be called directly from the memory 120. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.
The memory 120 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the memory 120. For example, in embodiments of the present application, processor 110 may perform enhancement processing on sound by executing instructions stored in memory 120. The storage program area may store an operating system, an application program (such as a sound playing function) required by at least one function, and the like. The storage data area may store data (such as audio data) created during the use of the electronic device 100, and the like. Further, the memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.
In some embodiments, the memory 120 may also be disposed in the processor 110, i.e., the processor 110 includes the memory 120. This is not limited in the embodiments of the present application.
The charging management module 140 is configured to receive charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 150 while charging the battery 180.
The power management module 150 is used to connect the battery 180, the charging management module 140 and the processor 110. The power management module 150 receives input from the battery 180 and/or the charge management module 140 to power the processor 110, the memory 120, and the wireless communication module 160, among other things. The power management module 150 may also be used to monitor parameters such as battery capacity, battery cycle number, battery state of health (leakage, impedance), etc. In other embodiments, the power management module 150 may also be disposed in the processor 110. In other embodiments, the power management module 150 and the charging management module 140 may be disposed in the same device.
The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via an antenna, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. Wireless communication module 160 may also receive signals to be transmitted from processor 110, frequency modulate them, amplify them, and convert them into electromagnetic waves via an antenna for radiation.
The electronic device 100 may implement audio functions through the audio module 170, the speaker 170A, and the application processor, etc. Such as music playing, voice collection, etc.
The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.
The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.
The microphone 170B, also referred to as a "microphone", is used to convert a sound signal into an electrical signal. The microphone 170B may receive voice input from the user when the user is engaged in a voice interaction with the electronic device. The electronic device 100 may be provided with at least one microphone 170B. In other embodiments, the electronic device 100 may be provided with two microphones 170B to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170B to collect sound signals, reduce noise, identify sound sources, perform directional recording, and the like. Alternatively, the electronic device 100 may be provided with 6 microphones 170B to form a microphone array, by which the azimuth information of the sound may be analyzed.
The methods in the following embodiments may be implemented in the electronic device 100 having the above-described hardware structure.
It can be understood that the electronic device 100 may be divided into different functional modules when implementing the speech enhancement method provided by the embodiment of the present application. For example, as shown in fig. 3, an electronic device 100 provided in an embodiment of the present application may include an audio collecting module, an audio processing module, an audio recognition module, an audio synthesis module, a voice interaction module, and other functional modules.
The audio collection module is used for acquiring audio data and is responsible for storing and forwarding the acquired audio data. For example, the function of the audio collection module can be realized by the microphone 170B in fig. 2, and the microphone 170B can receive various sound inputs and convert the sound inputs into audio data through the audio module 170; the audio collection module may obtain audio data, store it, and forward it to other modules.
The audio processing module is used for enhancing the audio data acquired by the audio collecting module. For example, sound attribute analysis and sound separation and filtering are performed on the received voice data. Optionally, the audio processing module may include a sound event recognition module, a sound localization module, a voiceprint recognition module, and the like. The voice event identification module is used for identifying a plurality of voices collected by the audio collection module and determining the category information of each voice. The sound positioning module is used for positioning the various sounds collected by the audio collection module and analyzing the direction information of each sound. The voiceprint recognition module is used for carrying out voiceprint recognition on the various sound data collected by the audio collection module. In this embodiment of the application, the voiceprint recognition module may be configured to perform voiceprint feature extraction on the input voice, and determine whether the voiceprint feature extraction is matched with a voiceprint of a registered user.
And the audio recognition module is used for carrying out voice recognition on the processed audio data. Such as recognizing a wake-up word, recognizing a voice command, etc.
The audio synthesis module is used for synthesizing and playing the audio data. For example, the commands of the server may be synthesized as audio data, and voice broadcast may be performed through the speaker 170A.
The input of the voice interaction module is the voice of the target user after the enhancement processing is carried out by the audio processing module, and the voice interaction module is used for obtaining voice interaction output according to the voice of the target user.
Illustratively, the related functions of the audio processing module, the audio recognition module, the audio synthesis module, and the voice interaction module described above may be implemented by program instructions stored in the memory 120. For example, processor 110 may implement voiceprint recognition functionality by executing program instructions associated with an audio processing module stored in memory 120.
Illustratively, the speech enhancement method provided by the embodiment of the present application can be applied to the system architecture shown in fig. 4. As shown in FIG. 4, the sounds input into the smart speech enhancement model include both sounds in the everyday environment (e.g., water flow sounds, television sounds, cooking sounds, etc.) and voice commands input by the user. The intelligent voice enhancement model is used for processing a plurality of input sounds through enhancement, and the output of the intelligent voice enhancement model only comprises voice commands input by a user.
As shown in fig. 4, the processing procedure of the intelligent speech enhancement model includes sound recognition, sound attribute analysis, and sound separation and filtering, where the sound recognition includes recognition of sound category information, and the user sound can be separated from the background sound by combining the recognized sound category information. The voice attribute analysis is to perform attribute analysis such as voice positioning, time analysis, voiceprint analysis and the like on the separated user voice, and the enhanced model can determine the voice interaction command of the target user from the user voice by combining the attribute information of the voice, namely, the voice interaction command of the target user can be extracted from a plurality of human voices. It can be understood that, this application carries out the pertinence reinforcing processing through the sound among the comparatively complicated sound environment that intelligent speech enhancement model electronic equipment gathered to extract target user's voice interaction order, consequently according to this clean target user's voice interaction order, can export more accurate mutual information, promote user's intelligent interaction and experience.
The user interaction feedback in fig. 4 refers to the feedback of the user to the interaction process. For example, after the user inputs a voice interaction command and the electronic device feeds back interaction output, the user continues to perform voice interaction with the electronic device; or, the user may output the interactive command, and after the electronic device feeds back the voice interactive output, the user may stop the conversation, etc., which may be used as the judgment of the enhanced result. The user may not be very good at stopping the dialog description enhancement results and the user may be less satisfied with the speech interaction output and the dialog may be stopped. The user continues to request voice interaction, the enhancement result is better, and the interaction requirements of the user can be met. The interactive feedback can assist the intelligent enhancement model to carry out self-learning, so that the enhancement model is more adaptive to the sound environment.
The technical solution provided by the embodiment of the present application will be specifically described below by taking the above-mentioned electronic device as an example of an intelligent sound box. In conjunction with fig. 4, as shown in fig. 5, the speech enhancement method may include steps S501-S505:
s501, the electronic equipment collects first sound.
The first sound includes at least one of a second sound and a background sound. The second sound may include a sound of a user speaking (human sound, abbreviated as human voice). The second sound may include a voice interaction command input by the user.
For example, the background sound may be a sound in an environment in which the electronic device is currently located. For example, the sound in the environment where the electronic device is currently located may include daily environment sounds such as a pet sound, a television sound, a water flow sound, a coffee machine sound, a microwave oven sound, a cooking sound, an air conditioner sound, and the like. The specific type of the background sound is not limited in the embodiments of the present application, and is merely an exemplary illustration.
As shown in fig. 1, taking the environment in which the smart speaker is located as the sound environment shown in fig. 1 as an example, the first sound may include: a second sound and a background sound. The second sound includes a sound of a person talking, and a voice command asking the user what the weather is today. The background sound includes a television sound and a dog call.
For example, the first sound may include a second sound and a background sound from a plurality of different sound sources. For example, the sound sources of four sounds, i.e., a television sound, a dog call sound, a user voice command, and a character talking sound, are different from each other. Optionally, the sound characteristics of different sound sources are different from each other.
Illustratively, the electronic device collecting the first sound includes: a microphone array of an electronic device collects a first sound. Alternatively, the microphone array may convert the first sound collected by the microphone 170B into audio data through the audio module 170.
S502, the electronic equipment identifies the first sound.
Illustratively, the electronic device recognizing the first sound includes: the electronic equipment performs sound event recognition on the first sound according to the sound event recognition model to acquire sound category information of the first sound. For example, the sound category information may include: voice types such as human voice, television voice, air conditioner voice, dog cry, water flow voice, etc.
For example, the voice event recognition model may be a neural network model obtained by training a large number of voices, and the type information of the voices may be obtained from the voice event recognition model. Optionally, the sound event recognition model may be stored locally in the electronic device, or may be stored in the cloud, which is not limited in the embodiment of the present application.
Optionally, the specific method for acquiring the category information of the first sound may be acquired by using a deep learning algorithm, where the deep learning algorithm may be algorithms such as a convolutional neural network, a fully-connected neural network, a long-term and short-term memory network, and a gated cyclic unit. The embodiment of the present application is not limited to a specific deep learning algorithm for obtaining sound category information, and is only an exemplary illustration here. In the embodiment, the sound category information of the first sound collected by the electronic equipment can be acquired through a deep learning algorithm.
For example, as shown in fig. 1, the electronic device may determine the category information of the first sound collected by the smart sound box in fig. 1 according to a convolutional neural network algorithm, including a human voice, a dog call, and a television sound.
Illustratively, by the electronic device recognizing the first sound in step S502 and acquiring the sound type information of the first sound, when the electronic device determines that the second sound exists in the first sound, that is, the first sound includes a human voice, the method further includes step S503.
And S503, when the first sound has a second sound, the electronic equipment performs sound analysis on the second sound to obtain a third sound.
The third sound is a user voice interaction command. Optionally, the second sound may include a third sound, and may also be the same as the third sound. For example, when the second sound includes only one human voice, the second sound may be the same as the third sound, i.e., the second sound may be a user voice interaction command; when the second sound includes at least two human voices, the second sound includes a third sound, and the electronic device may extract the third sound from the second sound, that is, the electronic device may extract the user voice interaction command from a plurality of human voices.
For example, when the second sound exists in the first sound in step S503, the electronic device performs sound analysis on the second sound to obtain a third sound, which may include steps S5031-S5033.
S5031, the electronic device separates a second sound from the first sound according to the sound type information of the first sound.
For example, the electronic device may separate the second sound from the first sound in conjunction with the sound classification information. That is, when the second sound exists in the first sound, the electronic device may separate the second sound from the background sound in the first sound, filter out the background sound, and separate out the human voice.
Optionally, the electronic device may combine the sound category information, separate the human voice from the background sound of the daily environment by using a sound source separation algorithm, filter the background sound in the daily environment, and only retain the human voice. For example, the first sound collected by the electronic device includes sounds such as a character conversation sound, a user voice interaction command, a television sound, a dog call sound, and the like, and the electronic device may combine the sound category information to separate the television sound, the dog call sound, the character conversation sound, and the user voice interaction command, and filter the power-down television sound and the dog call sound, and only retain the character conversation sound and the user voice interaction command.
For example, the sound source separation algorithm may use a conventional separation algorithm to perform sound separation, or may use a deep learning algorithm to perform sound separation. For example, the sound separation may be implemented by using a conventional matrix decomposition algorithm, a singular value decomposition algorithm, or a deep learning algorithm such as a deep neural network, a convolutional neural network, or a fully-connected neural network. The specific algorithm for sound separation in the embodiments of the present application is not limited, and is only an exemplary one. The embodiment of the application performs sound separation based on the deep learning algorithm, and can enable the electronic equipment to separate the voice from the background sound, so that the voice can be extracted from the noisy daily environment sound.
S5032, the electronic device analyzes the sound attribute information of the second sound.
Wherein the sound attribute information includes: one or more of sound azimuth information, voiceprint information, sound time information, sound decibel information. Illustratively, the sound azimuth information refers to azimuth information of a sound source. The voiceprint information includes voiceprint features extracted from the speech information uttered by the speaker, and the voiceprint features can be used for identity recognition. Sound time information refers to a time period in which a certain type of sound occurs frequently. The specific content of the attribute information of the sound is not limited in the embodiments of the present application, and is merely an exemplary description here.
For example, the analyzing, by the electronic device, the sound attribute information of the second sound may include: the electronic equipment obtains the sound azimuth information of different sound sources in the second sound through the sound positioning module. For example, taking an example that a microphone array of an electronic device includes 6 microphones, the electronic device may collect sounds through the 6 microphones and analyze sound bearing information of different human voices. The embodiment of the present application does not limit how the microphone array analyzes the sound direction information, and reference may be made to a sound positioning method based on a microphone array in the prior art. For example, a sound localization method based on beamforming, or a sound localization method based on high resolution spectral estimation, or a sound localization method based on delay difference of arrival, etc.
For example, the electronic device determines the direction information of the sound based on the sound arrival delay difference method. The microphone array of the intelligent sound box can firstly estimate the sound arrival time difference according to a sound source and acquire the sound delay among array elements in the microphone array; and further determining the position of the sound source by combining the known spatial position of the microphone array by using the acquired sound arrival time difference. For example, as shown in fig. 6, the solid dots 1, 2, and 3 represent the 3 microphones 170B of the smart speaker, respectively. The time delay from the sound source to the microphones 1 and 3 is a constant from which a black-bolded hyperbola in fig. 6 can be drawn; the time delay from the sound source to the microphones 2 and 3 is also a constant from which a dashed hyperbola can be drawn in fig. 6. The intersection of these two hyperbolas, the point of intersection of which is the location of the sound source, is shown as a black filled square in fig. 6. Furthermore, the intelligent sound box can determine the azimuth information of the sound source according to the spatial position of the microphone array.
For example, the analyzing, by the electronic device, the sound attribute information of the second sound may further include: and the electronic equipment separates different voices in the second voice according to a blind source separation algorithm, extracts the vocal print characteristics of the different voices and acquires the vocal print characteristics of each voice. Because the voiceprint characteristics of different sound sources are different from each other, the electronic device can distinguish different voices according to the voiceprint characteristics.
For example, the analyzing, by the electronic device, the sound attribute information of the second sound may further include: the electronic device analyzes the decibel magnitude of the second sound.
S5033, the electronic device obtains the third sound according to the sound attribute information of the second sound.
For example, the electronic device may extract a third sound (user voice interaction command) from at least one human voice according to the attribute information of the sound.
Illustratively, the voiceprint information of the third sound is matched with the voiceprint information of the registered user in the electronic equipment. Optionally, the voiceprint of the registered user may be registered by the user, or may be registered by the electronic device in a self-adaptive learning manner, which is not limited in this embodiment of the application. Optionally, the voiceprint information may be a voiceprint feature.
In one implementation, the obtaining, by the electronic device, the third sound according to the sound attribute information of the second sound may include: and the electronic equipment separates different voices in the second voice according to a blind source separation algorithm, extracts voiceprint features from the different voices, matches the voiceprint features of the different voices with the voiceprint features of the registered user, and determines the matched voice as a third voice if the voiceprint features of the different voices are matched with the voiceprint features of the registered user. Optionally, the electronic device may extract the voiceprint features from the human voice using a convolutional neural network algorithm. The blind source separation algorithm can adopt a traditional separation algorithm to carry out sound separation and can also adopt a deep learning algorithm to carry out sound separation. For example, the sound separation may be implemented by using a conventional matrix decomposition algorithm, a singular value decomposition algorithm, or a deep learning algorithm such as a deep neural network, a convolutional neural network, or a fully-connected neural network. The specific algorithm for sound separation in the embodiments of the present application is not limited, and is only an exemplary one. The embodiment of the application performs sound separation based on the deep learning algorithm, and can separate multiple voices when the second voice comprises the multiple voices.
For example, the matching between the voiceprint features of the voice and the voiceprint features of the registered user may be that the matching degree between the voiceprint features of the voice and the voiceprint features of the registered user exceeds a preset threshold, or the matching degree between the voiceprint features of the voice and the voiceprint features of the registered user is the highest. For example, the confidence level may be used to indicate the matching degree of the input voice and the voiceprint features of the registered user, and the higher the confidence level is, the higher the matching degree of the input voice and the voiceprint features of the registered user is. It is understood that the confidence level can also be described as a confidence probability, a confidence score, etc., which are values used to characterize the confidence level of the voiceprint recognition result. The present application embodiment does not limit this, and the confidence is described in the present application embodiment.
For example, the confidence that the input voice belongs to user 1 is higher than the confidence that the input voice belongs to other registered users, it is determined that the input voice matches user 1. As another example, a confidence threshold may be set. If the confidence level that the input speech belongs to each registered user is less than or equal to the confidence threshold, determining that the input speech does not belong to any registered user. And if the confidence level that the input voice belongs to a registered user is greater than or equal to the confidence threshold value, determining that the input voice inputs the registered user. Illustratively, if the confidence threshold is 0.5, the confidence that the input speech belongs to user 1 is 0.9, and the confidence that the input speech belongs to user 2 is 0.4, then it may be determined that the input speech belongs to user 1, i.e., the input speech matches the voiceprint of user 1.
For example, the first sound collected by the electronic device includes sounds such as a character conversation sound, a user voice interaction command, a television sound, a dog call sound, and the like, and the electronic device can separate the television sound, the dog call sound, the character conversation sound, and the user voice interaction command according to the sound category information, filter the power-down television sound and the dog call sound, and retain the character conversation sound and the user voice interaction command. And the electronic equipment separates the character talking voice and the user voice interaction command according to a blind source separation algorithm, respectively extracts the voiceprint characteristics of the character talking voice and the user voice interaction command, determines whether the voiceprint characteristics are matched with the voiceprint characteristics of the registered user, and if the voiceprint characteristics are matched with the voiceprint characteristics of the registered user, the electronic equipment determines the matched voiceprint as a third voice.
For example, taking the sound environment shown in fig. 1 as an example, the smart speaker may separate the voice from the sound in the daily environment from the voice such as the voice of the person, the voice of the television, the voice of the dog, and the like according to the type information of the sound, filter the voice of the dog and the voice of the television, separate different voices (the voice of the person talking and the voice interaction command of the user) by using a blind source separation algorithm, extract voiceprint features of the voice of the person talking and the voice interaction command of the user, and respectively match the voiceprint features with the voiceprint features of the registered users, and if the user inquires that the voiceprint feature of the voice of the user of "weather of today" matches with the voiceprint features of the registered users, the electronic device may determine the voice of the user inquires of "weather of today how" as the third sound.
Optionally, when the first sound collected by the electronic device includes multiple voices, and the voiceprint features of at least two voices in the multiple voices are matched with the voiceprint features of the registered user, the electronic device may determine, according to the decibel magnitudes of the at least two voices, a voice with a larger sound decibel as the third sound. Optionally, when the first sound collected by the electronic device includes multiple voices, and the voiceprint features of at least two voices in the multiple voices are matched with the voiceprint features of the registered user, the electronic device may also determine, according to the location information of the at least two voices from the electronic device, a voice closer to the electronic device as the third sound.
In another implementation, the obtaining, by the electronic device, the third sound according to the sound attribute information of the second sound may further include: the electronic device obtains the third sound from the second sound based on information such as the azimuth information, the energy characteristics, or the sound characteristics of the second sound.
For example, the electronic device may determine the third sound from the second sound according to the location information of the different human voices in the second sound. For example, taking the sound environment shown in fig. 1 as an example, the smart speaker may filter the television sound and the dog-call sound from the television sound, the dog-call sound, the character talking sound, and the user voice interaction command, separate the character talking sound from the user voice interaction command, and combine the character talking sound and the direction information of the user voice interaction command. If the direction information of the voice interaction command of the user is located on the sofa and the direction information of the person talking sound is located on the dining table, the intelligent sound box can determine the voice interaction command of the user located on the sofa as a third sound. For another example, the voice interaction command of the user and the direction information of the character talking sound are both located in the sofa, but the decibel of the voice interaction command of the user is much larger than the character talking sound, so that the smart sound box can also determine the voice interaction command of the user as the sound of the target user.
For example, the electronic device may further determine the third sound from different human voices included in the second sound according to the energy characteristics or the sound characteristics of the sound by using a deep learning algorithm. For example, a neural network model that is used to determine user voice interaction commands may be trained by a number of users speaking at different locations relative to the electronic device and the user voice interaction commands. The electronic device may determine the user voice interaction command as a third sound from among the character conversation sound and the user voice interaction command according to the sound feature or the energy feature based on the trained neural network model. For example, the electronic device may determine that the human voice facing the electronic device is the third sound if a different human voice in the second sound is facing the electronic device or facing away from the electronic device.
In another implementation, the obtaining, by the electronic device, the third sound according to the sound attribute information of the second sound may further include: and the electronic equipment determines the human voice with higher decibel in the second sound as the third sound according to the decibel of the second sound.
It can be understood that, in the embodiment of the present application, the voice attribute analysis is performed on the second sound, and a separation algorithm is combined, so that the user voice interaction command can be extracted from a plurality of human voices.
And S504, the electronic equipment processes the third sound.
For example, as shown in fig. 1, the smart sound box may extract the user voice interaction command "how much weather is today" from the sound environment shown in fig. 1, and input the "how much weather is today" into the voice interaction module for processing. It can be understood that, because the voice input into the voice interaction module in the embodiment of the present application is a clean user voice interaction command, the voice interaction module can obtain a more accurate voice interaction output according to the voice interaction command.
And S505, the electronic equipment outputs interactive information according to the third sound.
Illustratively, the interaction information output by the electronic device may be voice interaction information or a control signal. For example, the voice interaction command of the user is "how much the weather is today", and the electronic device may output voice interaction information "the weather is sunny today, the lowest temperature is 25 degrees, and the highest temperature is 35 degrees". As another example, the voice interaction command of the user is "turn on hall lantern", and the electronic device may output a control signal to turn on the hall lantern.
Optionally, the electronic device may further output the interaction information in combination with the sound attribute information of the third sound. For example, the electronic device may output the interactive information in combination with the direction information and/or the time information of the third sound. For example, the electronic device may output a control signal to turn on a kitchen light according to the direction information of the third sound if the third sound is "light on" and the direction information of the third sound is in the kitchen. For another example, the electronic device may output a control signal for turning on the television and adjust the volume of the television through a number of learning processes according to the time information of the third sound if the third sound is "turn on the television". For another example, the electronic device may output a control signal to turn on the hall lantern in conjunction with the time information of the third sound and turn on the appropriate brightness according to the time information of the third sound if the third sound is "turn on the hall lantern".
Illustratively, the electronic device may further output the interaction information in combination with a voiceprint feature of the third sound. For example, when the voice interaction command of the user (child at home) is "turn on the light", the electronic device may output a control signal for turning on the sub-lying light through a large number of learning processes in combination with the voiceprint feature and the sound time information of the user. For another example, when mom inputs a voice interaction command "turn on the light", the electronic device may output a control signal to turn on the kitchen light in combination with the voiceprint feature and the voice time information of the user.
Illustratively, the electronic device may further output the interaction information in combination with age information of the third sound. For example, when a child inputs a user voice interaction command of turning on a television, the electronic device may recognize age information of the voice interaction command as the child, and play an animation or a child program according to the recognized age information. For another example, when the grandfather inputs a user voice interaction command of turning on a television, the electronic device may recognize that the age information of the voice interaction command is old people, and play the television program watched last time according to the recognized age information.
Optionally, the user permissions of different voiceprints in the embodiment of the present application may be different. For example, when a child at home does not have the right to turn on a television, the third sound is a voice interaction command "turn on the television", and the voiceprint feature of the third sound matches the voiceprint feature of the registered user, but since the voiceprint user does not have the right to turn on the television, the interaction information output by the electronic device may be "you do not have the right to turn on the television" or other voice prompt information.
Optionally, when the user registers the voiceprint in different manners, the user's rights may be different. For example, the user has a lower authority to register in a self-adaptive learning manner, and the user has a higher authority to actively register the voiceprint, so that a high-confidentiality function can be used. The user authority of the different voiceprints can also be preset by the user. This is not limited in the examples of the present application.
It can be understood that, in this embodiment, the electronic device extracts the clean user voice interaction command (the third sound) from the complex sound environment (the first sound), and processes the user voice interaction command, so as to output more accurate interaction information, and the experience of user intelligent interaction is better.
The embodiment provides a voice enhancement method, which includes collecting a first sound through electronic equipment; the electronic device identifies a first sound; when the first sound has a second sound, the electronic equipment performs sound analysis on the second sound to obtain a third sound; the electronic device processes the third sound; the electronic device outputs the interaction information. According to the embodiment, based on the deep learning algorithm, the voice is separated from various sounds including background sounds, and the clean user voice interaction command is extracted from the voice, so that the user voice interaction command is processed, the obtained voice interaction output is accurate, the voice interaction can be accurately completed, and the intelligent interaction experience of the user is improved. When the user voice interaction command is extracted from the complex sound environment, the user voice interaction command (third sound) is obtained by identifying and analyzing the sound and combining the attribute information of the sound, so that the voice enhancement method does not enhance all input sounds equally, but enhances pertinence by combining the attribute information of the currently acquired sound, thereby being capable of adapting to the complex sound environment and improving the intelligent interaction experience of the user in the complex sound environment.
The embodiment of the present application further provides a speech enhancement method, as shown in fig. 7, steps S701 to S702 may be further included after step S502.
S701, the electronic equipment clusters the fourth sound to acquire new sound category information.
The fourth sound is a sound in which the sound event recognition model does not recognize the sound type information in the first sound. Optionally, the electronic device may store the unrecognized fourth sound in the memory.
For example, when the electronic device is shipped from a factory, the type of sound that can be recognized by the sound event recognition model is limited, for example, the sound event recognition model of the electronic device can recognize a relatively common sound. For example, although various sounds such as a human voice, a water flow sound, a dog call sound, and a television sound are not recognized, a coffee machine bean grinding sound, an oven sound, and a washing machine sound are not recognized. The electronic device may collect and store sounds of the coffee machine, the oven, the washing machine, and the like, which are not recognized by the sound event recognition model as being of a category.
Alternatively, the fourth sound may be a frequently occurring unrecognized sound, and the electronic device may collect a large amount of the fourth sound.
For example, the electronic device may cluster a large number of fourth sounds through unsupervised learning, and learn to find the intrinsic properties and rules of the data to obtain new sound category information. For example, the electronic device may obtain new sound category information through cluster analysis, which may include a comprehensive hierarchical Clustering algorithm, a density-based Clustering algorithm, Expectation-maximization (EM), Fuzzy Clustering (Fuzzy Clustering), K-means algorithm, K-means Clustering (K-means Clustering), and the like. The cluster analysis method specifically adopted by the electronic device in the embodiment of the present application is not limited, and is only an exemplary description here.
For example, the electronic device may collect a large amount of sounds of coffee machine grinding beans, and a new sound category information may be obtained through a cluster analysis algorithm. So that the next time the sound input by the electronic device comprises the sound of the coffee machine grinding beans, the electronic device can recognize the sound category.
S702, the electronic equipment updates the voice event recognition model according to the new voice category information, and obtains the updated voice event recognition model.
For example, the electronic device may retrain the sound event recognition model and obtain a new sound event recognition model according to the new sound category information obtained after clustering and using the new sound category information as a training sample, where the updated new sound event recognition model is more stable and has better robustness. That is, after the electronic device obtains the new voice event recognition model, the electronic device may update the original voice event recognition model to the new voice event recognition model.
Optionally, after the electronic device updates the sound event recognition model, when the electronic device executes the speech enhancement method in steps S501 to S505 again, in step S502, multiple sounds may be recognized according to the updated sound event recognition model, and category information of each sound may be acquired. The updated sound event recognition model has better recognition effect because more sound event types can be recognized by the updated sound event recognition model.
Optionally, the steps S701 to S702 may be executed after the steps S503 to S505, or may be executed before the steps S503 to S505, or may be executed simultaneously with the steps S503 to S505, and the execution sequence of the steps is not limited in this embodiment of the application.
It can be understood that, in the embodiment of the present application, the acoustic event recognition model is updated according to the new acoustic category information, so that the acoustic event recognition model can adapt to the environment.
According to the method and the device, the sounds which are not recognized by the sound event recognition model can be clustered, and a new sound event recognition model is obtained through training, namely, the sounds which often appear in the environment can be learned by the electronic equipment, the sound event recognition model is updated, the environment can be self-adapted, and the interaction experience of users is improved. And the stability and robustness of the sound event recognition model are more stable along with the increase of the use time of a user, and the better effect can be achieved.
The present application further provides another embodiment, as shown in fig. 8, after step S502, the method may further include: steps S801 to S803.
S801, the electronic equipment acquires sound direction information of the first sound.
For example, the electronic device may obtain the azimuth information of the first sound according to the sound localization method based on the microphone array in step S5032, and store the sound azimuth information. For example, the electronic device may record the bearing information of a fixed sound source, such as television sound, water flow sound, washing machine sound, and the like. Alternatively, the electronic device may store the sound bearing information in a memory.
S802, the electronic equipment acquires direction information of the useless sound according to the sound direction information of the first sound and the sound type information of the first sound.
For example, the electronic device may include a sound localization module to determine bearing information for a sound source. For example, the sound localization module may determine the bearing information of the sound source through a microphone array in the electronic device.
For example, the electronic device may perform adaptive learning according to the sound direction information and the sound type information to acquire direction information of the unwanted sound. For example, the sound localization module determines a sound in a fixed direction other than a human voice as an unwanted sound by combining the direction information of the sound and the category information of the sound. For example, the sound of television, the sound of water flow, the sound of cooking, the sound of grinding beans by a coffee machine, the sound of washing machine, and the like are all useless sounds, the direction information of the useless sounds is relatively fixed to the electronic device, and the electronic device can acquire the direction information of the useless sounds.
S803, the electronic device filters the sound from the direction according to the direction information of the useless sound.
For example, the electronic device may filter the sound from the orientation information of the unwanted sound when enhancing the sound received by the electronic device based on the orientation information of the unwanted sound.
Optionally, the electronic device may filter the sound from the direction according to the direction information of the useless sound and the category information of the sound. For example, if the sound from the direction information of the unwanted sound is not a human sound among the sounds received by the electronic device, the electronic device filters the sound from the direction. That is, the electronic device may assist the electronic device in performing sound separation and filtering based on the directional information of the unwanted sound.
For example, the electronic device may combine the orientation information of the sound and the sound category information to obtain the orientation of the sound that frequently occurs in a specific scene. For example, the position of the sound of a television, the position of the sound of water flow and the like, the electronic equipment can shield the sound in the direction, and the sound separation is better assisted.
According to the method and the device, the direction information of various sounds is learned, and the direction of the sound which often appears in a specific scene can be obtained by combining the sound type information, so that the sound separation can be better assisted.
It should be noted that, the traditional speech enhancement method cannot perform targeted enhancement on sounds in different directions, and cannot achieve environment adaptation. The method in the embodiment can identify the direction of the frequently-occurring useless sound through the direction information of the sound under the condition that the environment of the electronic equipment is changed, and can ignore the sound in the direction during separation, so that the enhancement effect is ensured, the user interaction experience is improved, and the environment can be self-adapted.
The embodiment of the present application further provides a speech enhancement method, as shown in fig. 9, steps S901 to S902 may be further included after the above steps S501 to S505.
And S901, acquiring the voice interaction information of the third sound.
The voice interaction information of the third sound comprises: the user voice interaction command (third sound) processed by the electronic equipment and the command that the user further performs voice interaction with the electronic equipment after the voice interaction output is fed back by the electronic equipment. Namely, the voice interaction between the electronic equipment and the user has good feedback, and the user further actively carries out voice interaction with the electronic equipment.
For example, the user's voice interaction command is "how today is weather? "the voice interaction output fed back by the electronic device is" today is sunny, the lowest temperature is 25 degrees, the highest temperature is 35 degrees ", because the voice interaction output fed back by the electronic device is accurate, the user further asks" what clothes are suitable for wearing? ". It can be understood that, since the electronic device can accurately feed back the interaction command of the user, the voice interaction between the user and the electronic device can be further promoted.
Illustratively, the electronic device may collect voice interaction information of the third sound. The voice interaction information of the third sound may include at least two user voice interaction commands. For example: "how do the weather today? "and" what clothing is fit? ".
And S902, performing voiceprint registration on the third voice without voiceprint registration in the interactive voice information of the third voice.
For example, the electronic device may perform unsupervised mining on the collected interactive voice information, and if the voiceprint feature extracted from the interactive voice information is not registered, the electronic device may perform text-independent voiceprint registration.
For example, the electronic device may collect a large amount of interactive voice information with good interactive feedback, and mine the user voice without registered voiceprint feature from the large amount of interactive voice information with good interactive feedback through unsupervised learning. And extracting voiceprint features from the voice information of the user based on a deep learning algorithm to register the voiceprint. Alternatively, the voiceprint registration process may be referred to as adaptive voiceprint registration.
Optionally, after the electronic device performs the adaptive voiceprint registration through steps S901 to S902, when the electronic device executes the speech enhancement method of steps S501 to S505 again, the voiceprint information of the registered user in step S503 includes the voiceprint information for performing the adaptive voiceprint registration. The electronic device may match voiceprint information of the multiple sounds with voiceprint information of the registered user to obtain a third sound.
It can be understood that, in the embodiment, the electronic device can learn the voiceprint information by itself and register the voiceprint information by daily collecting the user interaction voice with good interaction feedback and combining voice recognition and voiceprint registration irrelevant to the text. Therefore, voice corresponding to voiceprint information can be enhanced in intelligent interaction, and intelligent interaction experience is improved.
It should be noted that, the traditional speech enhancement algorithm cannot perform targeted enhancement on sounds from different people, and cannot perform adaptive voiceprint registration on people who do not register voiceprints, but this embodiment can ensure an enhancement effect by using voiceprint adaptive registration under the condition that the environment where the device is located is more and more complex, and the better the effect of the method can be achieved as the user use time increases in practical application.
The voice enhancement algorithm can be applied to scenes with complex sound environments, the first sound collected by the electronic equipment comprises user voice and background sound in the daily environment where the electronic equipment is located, and the electronic equipment can perform sound event recognition on the collected first sound and determine sound category information of the first sound. And separating a second sound from the first sound according to the sound category information of the first sound, namely filtering out background sound in the daily environment and only keeping the sound of the user. In the case that the second sound includes a plurality of user sounds, the electronic device may separate the plurality of user sounds included in the second sound according to a deep learning algorithm, and determine a third sound (user voice interaction command) from the second sound in combination with attribute information of the plurality of user sounds, where the third sound is a voice interaction command of the target user. Therefore, a clean voice interaction command of the target user can be extracted from a complex sound environment, and the voice interaction feedback output by the electronic equipment according to the third sound is more accurate. According to the scheme, when the user voice interaction command is extracted from the complex sound environment, the voice is identified and analyzed, and the user voice interaction command (third sound) is obtained by combining the attribute information of the voice, so that the voice enhancement method does not perform the same enhancement on all input voices, but performs the targeted enhancement on the voice in combination with the attribute information of the currently acquired voice, and therefore the voice enhancement method can adapt to the complex sound environment, and the intelligent interaction experience of the user in the complex sound environment is improved.
By way of example, the speech enhancement algorithm provided by the embodiment of the present application is described with reference to the speech environment in which the smart sound box shown in fig. 1 is located. The smart sound box collects first sound, and the first sound comprises television sound, dog calling sound, user conversation sound and user voice interaction commands. The electronic equipment identifies the first sound according to the sound event identification model, and determines the sound category information of the first sound, wherein the category information of the first sound comprises a human voice, a dog call voice and a television voice. The electronic device can combine the sound category information of the first sound, separate the human voice from the dog call and the television sound according to a sound source separation algorithm, and filter the dog call and the television sound to obtain the human voice (second sound). The electronic equipment performs blind source separation on the voices (including the person talking voice and the user voice interaction command), namely, different voices can be separated, and then third voice is obtained according to the voice attribute information (voiceprint characteristics and/or voice direction information) of different voices, namely, the voice interaction command of the target user is extracted from a plurality of voices. It can be understood that, in the embodiment of the application, the electronic device can extract a clean voice interaction command of the target user from the complex sound environment by identifying and analyzing the collected first sound, namely by performing targeted enhancement on the sound in the complex sound environment, so that the intelligent interaction experience of the user can be improved.
It is understood that the electronic device includes hardware structures and/or software modules for performing the functions in order to realize the functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
In the embodiment of the present application, the electronic device may be divided into the functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.
In the case of an integrated unit, fig. 10 shows a schematic view of a possible configuration of the electronic device involved in the above-described embodiment. The electronic device 1000 includes: a processing unit 1001 and a storage unit 1002.
The processing unit 1001 is configured to control and manage operations of the electronic device 1000. For example, it can be used to execute the processing steps of S501-S505 in fig. 5; or, may be used to execute the processing steps of S701 and S702 in fig. 7; or, may be used to perform the processing steps of S801-S803 in fig. 8; or, may be used to execute the processing steps of S901-S902 in fig. 9; and/or other processes for the techniques described herein.
The storage unit 1002 is used to store program codes and data of the electronic apparatus 1000. For example, it may be used to store sound bearing information and the like.
Of course, the unit modules in the electronic device 1000 include, but are not limited to, the processing unit 1001 and the storage unit 1002. For example, an audio unit, a communication unit, and the like may also be included in the electronic device 1000. The audio unit is used for collecting voice sent by a user and playing the voice. The communication unit is used to support communication between the electronic apparatus 1000 and other devices.
The processing unit 1001 may be a processor or a controller, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. The processor may include an application processor and a baseband processor. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The storage unit 1002 may be a memory. The audio unit may include a microphone, a speaker, and the like. The communication unit may be a transceiver, a transceiving circuit or a communication interface, etc.
For example, the processing unit 1001 is a processor (such as the processor 110 shown in fig. 2), and the storage unit 1002 may be a memory (such as the internal memory 120 shown in fig. 2). The audio unit may include a speaker (e.g., speaker 170A shown in fig. 2), a microphone (e.g., microphone 170B shown in fig. 2). The communication unit includes a wireless communication module (such as the wireless communication module 160 shown in fig. 2). The wireless communication module may be collectively referred to as a communication interface. The electronic device 1000 provided by the embodiment of the application may be the electronic device 100 shown in fig. 2. Wherein the processor, the memory, the communication interface, etc. may be coupled together, for example by a bus connection.
An embodiment of the present application further provides a computer storage medium, where computer program codes are stored in the computer storage medium, and when the processor executes the computer program codes, the electronic device executes the relevant method steps in fig. 5, fig. 7, fig. 8, or fig. 9 to implement the method in any of the above embodiments.
Embodiments of the present application also provide a computer program product, which when run on a computer causes the computer to execute the relevant method steps in fig. 5, fig. 7, fig. 8 or fig. 9 to implement the method in any of the above embodiments.
The electronic device 1000, the computer storage medium, or the computer program product provided in the embodiment of the present application are all configured to execute the corresponding methods provided above, so that beneficial effects achieved by the electronic device can refer to the beneficial effects in the corresponding methods provided above, and are not described herein again.
Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (19)

1. A method of speech enhancement, the method comprising:
the electronic device collects a first sound, the first sound comprising at least one of a second sound and a background sound;
the electronic device recognizing the first sound; when the first sound has the second sound, the electronic equipment performs sound analysis on the second sound to obtain a third sound;
the electronic device processes the third sound.
2. The method of claim 1, wherein the electronic device identifies the first sound, comprising:
and the electronic equipment performs sound event identification on the first sound according to a sound event identification model to acquire sound category information of the first sound.
3. The method of claim 2, wherein the electronic device performs sound analysis on the second sound to obtain a third sound, comprising:
the electronic equipment separates the second sound from the first sound according to the sound type information of the first sound;
the electronic equipment analyzes sound attribute information of the second sound; wherein the sound attribute information includes: one or more of sound azimuth information, voiceprint information, sound time information and sound decibel information;
and the electronic equipment obtains the third sound according to the sound attribute information of the second sound.
4. The method according to any one of claims 1 to 3, wherein the voiceprint information of the third sound matches voiceprint information of registered users.
5. The method according to any one of claims 1-4, further comprising:
the electronic equipment clusters the fourth sound to acquire new sound category information; the fourth sound is the sound of which the sound type information is not identified according to the sound event identification model in the first sound;
and updating the sound event identification model according to the new sound category information, and acquiring the updated sound event identification model.
6. The method according to any one of claims 1-5, further comprising:
the electronic equipment acquires sound direction information of the first sound;
acquiring direction information of useless sound according to the sound direction information of the first sound and the sound type information of the first sound;
and the electronic equipment filters the sound from the direction according to the direction information of the useless sound.
7. The method according to any one of claims 1-6, further comprising:
acquiring voice interaction information of the third sound;
and performing voiceprint registration on the third voice without voiceprint registration in the interactive voice information of the third voice.
8. The method according to any one of claims 1-7, further comprising:
and the electronic equipment outputs interactive information according to the sound attribute information of the third sound.
9. An electronic device, characterized in that the electronic device comprises: a processor and a memory; the memory is coupled with the processor; the memory for storing computer program code; the computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any of claims 1-8.
10. A computer storage medium comprising computer instructions that, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-8.
11. A computer program product, characterized in that, when the computer program product is run on a computer, it causes the computer to perform the method according to any of claims 1-8.
12. A speech enhancement apparatus, comprising:
a processor to identify a first sound, the first sound captured by a speech capture device, the first sound comprising at least one of a second sound and a background sound;
when the first sound has the second sound, the processor performs sound analysis on the second sound to obtain a third sound;
the processor processes the third sound.
13. The apparatus of claim 12, wherein the processor is further configured to perform sound event recognition on the first sound according to a sound event recognition model, and obtain sound category information of the first sound.
14. The apparatus of claim 13, wherein the processor is further configured to:
separating a second sound from the first sound according to the sound type information of the first sound;
analyzing sound attribute information of the second sound; wherein the sound attribute information includes: one or more of sound azimuth information, voiceprint information, sound time information and sound decibel information;
and obtaining the third sound according to the sound attribute information of the second sound.
15. The apparatus according to any one of claims 12-14, wherein the voiceprint information of the third sound matches voiceprint information of registered users.
16. The apparatus according to any of claims 12-15, wherein the processor is further configured to:
clustering the fourth sound to acquire new sound category information; the fourth sound is the sound of which the sound type information is not identified according to the sound event identification model in the first sound;
and updating the sound event identification model according to the new sound category information, and acquiring the updated sound event identification model.
17. The apparatus according to any of claims 12-16, wherein the processor is further configured to:
acquiring sound direction information of the first sound;
acquiring direction information of useless sound according to the sound direction information of the first sound and the sound type information of the first sound;
and filtering the sound from the direction according to the direction information of the useless sound.
18. The apparatus according to any of claims 12-17, wherein the processor is further configured to:
acquiring interactive voice information of the third sound;
and performing voiceprint registration on the third voice without voiceprint registration in the interactive voice information of the third voice.
19. The apparatus according to any of claims 12-18, wherein the processor is further configured to:
and outputting interactive information according to the sound attribute information of the third sound.
CN201910774538.0A 2019-08-21 2019-08-21 Voice enhancement method and device Pending CN112420063A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910774538.0A CN112420063A (en) 2019-08-21 2019-08-21 Voice enhancement method and device
PCT/CN2020/105296 WO2021031811A1 (en) 2019-08-21 2020-07-28 Method and device for voice enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910774538.0A CN112420063A (en) 2019-08-21 2019-08-21 Voice enhancement method and device

Publications (1)

Publication Number Publication Date
CN112420063A true CN112420063A (en) 2021-02-26

Family

ID=74660413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910774538.0A Pending CN112420063A (en) 2019-08-21 2019-08-21 Voice enhancement method and device

Country Status (2)

Country Link
CN (1) CN112420063A (en)
WO (1) WO2021031811A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113794963A (en) * 2021-09-14 2021-12-14 深圳大学 Speech enhancement system based on low-cost wearable sensor
CN114722884A (en) * 2022-06-08 2022-07-08 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7117145B1 (en) * 2000-10-19 2006-10-03 Lear Corporation Adaptive filter for speech enhancement in a noisy environment
CN1889172A (en) * 2005-06-28 2007-01-03 松下电器产业株式会社 Sound sorting system and method capable of increasing and correcting sound class
CN105185378A (en) * 2015-10-20 2015-12-23 珠海格力电器股份有限公司 Voice control method, voice control system and voice-controlled air-conditioner
CN105280183A (en) * 2015-09-10 2016-01-27 百度在线网络技术(北京)有限公司 Voice interaction method and system
CN105703978A (en) * 2014-11-24 2016-06-22 武汉物联远科技有限公司 Smart home control system and method
CN107146614A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 A kind of audio signal processing method, device and electronic equipment
CN108172219A (en) * 2017-11-14 2018-06-15 珠海格力电器股份有限公司 The method and apparatus for identifying voice
CN108831440A (en) * 2018-04-24 2018-11-16 中国地质大学(武汉) A kind of vocal print noise-reduction method and system based on machine learning and deep learning
CN109712625A (en) * 2019-02-18 2019-05-03 珠海格力电器股份有限公司 Smart machine control method based on gateway, control system, intelligent gateway
WO2019133732A1 (en) * 2017-12-28 2019-07-04 Knowles Electronics, Llc Content-based audio stream separation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9721202B2 (en) * 2014-02-21 2017-08-01 Adobe Systems Incorporated Non-negative matrix factorization regularized by recurrent neural networks for audio processing
CN107357875B (en) * 2017-07-04 2021-09-10 北京奇艺世纪科技有限公司 Voice search method and device and electronic equipment
US10547937B2 (en) * 2017-08-28 2020-01-28 Bose Corporation User-controlled beam steering in microphone array
CN109087631A (en) * 2018-08-08 2018-12-25 北京航空航天大学 A kind of Vehicular intelligent speech control system and its construction method suitable for complex environment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7117145B1 (en) * 2000-10-19 2006-10-03 Lear Corporation Adaptive filter for speech enhancement in a noisy environment
CN1889172A (en) * 2005-06-28 2007-01-03 松下电器产业株式会社 Sound sorting system and method capable of increasing and correcting sound class
CN105703978A (en) * 2014-11-24 2016-06-22 武汉物联远科技有限公司 Smart home control system and method
CN105280183A (en) * 2015-09-10 2016-01-27 百度在线网络技术(北京)有限公司 Voice interaction method and system
CN105185378A (en) * 2015-10-20 2015-12-23 珠海格力电器股份有限公司 Voice control method, voice control system and voice-controlled air-conditioner
CN107146614A (en) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 A kind of audio signal processing method, device and electronic equipment
CN108172219A (en) * 2017-11-14 2018-06-15 珠海格力电器股份有限公司 The method and apparatus for identifying voice
WO2019133732A1 (en) * 2017-12-28 2019-07-04 Knowles Electronics, Llc Content-based audio stream separation
CN108831440A (en) * 2018-04-24 2018-11-16 中国地质大学(武汉) A kind of vocal print noise-reduction method and system based on machine learning and deep learning
CN109712625A (en) * 2019-02-18 2019-05-03 珠海格力电器股份有限公司 Smart machine control method based on gateway, control system, intelligent gateway

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113794963A (en) * 2021-09-14 2021-12-14 深圳大学 Speech enhancement system based on low-cost wearable sensor
CN113794963B (en) * 2021-09-14 2022-08-05 深圳大学 Speech enhancement system based on low-cost wearable sensor
CN114722884A (en) * 2022-06-08 2022-07-08 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium
CN114722884B (en) * 2022-06-08 2022-09-30 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium

Also Published As

Publication number Publication date
WO2021031811A1 (en) 2021-02-25

Similar Documents

Publication Publication Date Title
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN111128197B (en) Multi-speaker voice separation method based on voiceprint features and generation confrontation learning
WO2019104698A1 (en) Information processing method and apparatus, multimedia device, and storage medium
CN113035227B (en) Multi-modal voice separation method and system
CN108899044A (en) Audio signal processing method and device
WO2016150001A1 (en) Speech recognition method, device and computer storage medium
CN107257996A (en) The method and system of environment sensitive automatic speech recognition
CN102404278A (en) Song request system based on voiceprint recognition and application method thereof
CN104575504A (en) Method for personalized television voice wake-up by voiceprint and voice identification
CN103024530A (en) Intelligent television voice response system and method
CN103151039A (en) Speaker age identification method based on SVM (Support Vector Machine)
CN109999314A (en) One kind is based on brain wave monitoring Intelligent sleep-assisting system and its sleep earphone
WO2019233228A1 (en) Electronic device and device control method
CN114141230A (en) Electronic device, and voice recognition method and medium thereof
CN109119090A (en) Method of speech processing, device, storage medium and electronic equipment
CN110930987B (en) Audio processing method, device and storage medium
CN109935226A (en) A kind of far field speech recognition enhancing system and method based on deep neural network
CN113205803B (en) Voice recognition method and device with self-adaptive noise reduction capability
WO2021031811A1 (en) Method and device for voice enhancement
CN109147787A (en) A kind of smart television acoustic control identifying system and its recognition methods
CN110428835A (en) A kind of adjusting method of speech ciphering equipment, device, storage medium and speech ciphering equipment
CN114067776A (en) Electronic device and audio noise reduction method and medium thereof
CN114067782A (en) Audio recognition method and device, medium and chip system thereof
CN208724111U (en) Far field speech control system based on television equipment
WO2022007846A1 (en) Speech enhancement method, device, system, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination