CN113053368A - Speech enhancement method, electronic device, and storage medium - Google Patents

Speech enhancement method, electronic device, and storage medium Download PDF

Info

Publication number
CN113053368A
CN113053368A CN202110257165.7A CN202110257165A CN113053368A CN 113053368 A CN113053368 A CN 113053368A CN 202110257165 A CN202110257165 A CN 202110257165A CN 113053368 A CN113053368 A CN 113053368A
Authority
CN
China
Prior art keywords
voice signal
sound
voice
signal
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110257165.7A
Other languages
Chinese (zh)
Inventor
夏洁
方思敏
罗丽云
李开
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
RDA Microelectronics Shanghai Co Ltd
RDA Microelectronics Inc
Original Assignee
RDA Microelectronics Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by RDA Microelectronics Shanghai Co Ltd filed Critical RDA Microelectronics Shanghai Co Ltd
Priority to CN202110257165.7A priority Critical patent/CN113053368A/en
Publication of CN113053368A publication Critical patent/CN113053368A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application provides a voice enhancement method, electronic equipment and a storage medium, and relates to the technical field of voice processing. The voice enhancement method comprises the following steps: firstly, a voice signal collected by a microphone array is obtained. Then, according to the vocal tract parameters of each vocal tract, pre-enhancing the voice signals respectively to obtain pre-enhanced voice signals corresponding to each vocal tract respectively. Second, a target speech signal containing a wake-up word is determined from each of the pre-enhanced speech signals. And determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located. And finally, positioning a sound source generating the voice signal in the target sound area, and performing directional enhancement on the voice signal according to the positioning information of the sound source. By using the pre-enhanced speech signal of the vocal tract in the wake-up stage, the wake-up performance can be improved. Therefore, under the condition of interference of a plurality of sound sources, the position of the target sound source can be accurately positioned, and the voice enhancement performance in the recognition stage is improved.

Description

Speech enhancement method, electronic device, and storage medium
[ technical field ] A method for producing a semiconductor device
The present application relates to the field of speech processing technologies, and in particular, to a speech enhancement method, an electronic device, and a storage medium.
[ background of the invention ]
In some scenarios involving voice interaction, such as smart speakers, smart cars, and smart robots, it is generally necessary to perform voice signal processing on a voice signal input by a user. The voice signal processing mainly comprises the steps of determining the incoming wave direction of a target sound source and utilizing a beam forming technology to carry out beam enhancement on voice signals in the incoming wave direction, so that the purposes of enhancing effective signals and suppressing noise and interference are achieved.
Currently, when determining the incoming wave direction of a target sound source, the target sound source is mainly located by a direction of arrival estimation technique. However, when there is interference from multiple sound sources in the environment, the current technology cannot accurately locate the direction of the target sound source, which causes the beam generated in the voice enhancement process to diverge, thereby affecting the subsequent voice interaction service.
[ summary of the invention ]
The embodiment of the application provides a voice enhancement method, electronic equipment and a storage medium, so that the position of a target sound source is accurately positioned under the condition of interference of a plurality of sound sources, and the voice enhancement performance in the awakening and recognition stages is improved.
In a first aspect, an embodiment of the present application provides a speech enhancement method, where the method includes: acquiring a voice signal acquired by a microphone array; according to the sound zone parameters of each sound zone, respectively pre-enhancing the voice signals to obtain pre-enhanced voice signals respectively corresponding to each sound zone; wherein the sound zones are divided in advance according to azimuth information of microphones included in the microphone array; determining a target voice signal containing a wake-up word from each of the pre-enhanced voice signals; determining a sound zone corresponding to the target voice signal as a target sound zone where a sound source generating the voice signal is located; and positioning a sound source generating the voice signal in the target sound area, and directionally enhancing the voice signal according to the positioning information of the sound source.
In one possible implementation manner, the azimuth information of the microphone includes: a relative position parameter of a microphone in the microphone array; pre-dividing each sound area according to azimuth information of each microphone contained in the microphone array, wherein the pre-dividing comprises the following steps: dividing a signal acquisition area of the microphone array into a plurality of sound areas according to relative position parameters of all microphones contained in the microphone array, and determining sound area parameters of the sound areas according to the central line positions of the sound areas.
In one possible implementation manner, determining a target speech signal containing a wake-up word from each of the pre-enhanced speech signals includes: scoring the similarity between the signal characteristics of each pre-enhanced voice signal and preset signal characteristics by using a neural network model; the preset signal characteristics are signal characteristics of a wake-up voice signal corresponding to a wake-up word; and determining the target voice signal according to the scoring result.
In one possible implementation manner, determining the target speech signal according to the scoring result includes: and determining the pre-enhanced voice signal with the score higher than a preset threshold value and the score highest in each pre-enhanced voice signal as a target voice signal.
In one possible implementation manner, if the score of each of the pre-enhanced speech signals is lower than the preset threshold, the method further includes: and acquiring new voice signals through the microphone array until the score of at least one pre-enhanced voice signal in each generated pre-enhanced voice signal is higher than the preset threshold value.
In one possible implementation manner, after directionally enhancing the speech signal according to the positioning information of the sound source, the method further includes: and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.
In a second aspect, an embodiment of the present application provides a speech enhancement apparatus, including: the acquisition module is used for acquiring the voice signals acquired by the microphone array; the pre-enhancement module is used for respectively pre-enhancing the voice signals according to the sound zone parameters of each sound zone to obtain pre-enhanced voice signals corresponding to each sound zone; wherein the sound zones are divided in advance according to azimuth information of microphones included in the microphone array; the first determining module is used for determining a target voice signal containing a wake-up word from each pre-enhanced voice signal; the second determining module is used for determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located; and the execution module is used for positioning the sound source generating the voice signal in the target sound area and directionally enhancing the voice signal according to the positioning information of the sound source.
In one possible implementation manner, after directionally enhancing the voice signal according to the positioning information of the sound source, the execution module is further configured to: and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.
In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor being capable of performing the method of the first aspect when invoked by the processor.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions for causing a computer to perform the method according to the first aspect.
In the above technical scheme, firstly, a voice signal acquired by a microphone array is acquired. Then, according to the vocal tract parameters of each vocal tract, pre-enhancing the voice signals respectively to obtain pre-enhanced voice signals corresponding to each vocal tract respectively. Second, a target speech signal containing a wake-up word is determined from each of the pre-enhanced speech signals. And determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located. And finally, positioning a sound source generating the voice signal in the target sound area, and performing directional enhancement on the voice signal according to the positioning information of the sound source. According to the scheme, wave beam enhancement is performed on the awakening and identification stages, and the sound source positioning based on the preset sound zone range is realized, so that the reliability of the positioning result is improved.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a block diagram of a speech enhancement system according to an embodiment of the present application;
FIG. 2 is a block diagram of another speech enhancement system provided in an embodiment of the present application;
fig. 3 is a flowchart of a speech enhancement method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a speech enhancement method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of another speech enhancement method according to an embodiment of the present application;
FIG. 6 is a flow chart of another speech enhancement method provided by an embodiment of the present application;
fig. 7 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present application;
fig. 8 is a schematic view of an electronic device according to an embodiment of the present application.
[ detailed description ] embodiments
For better understanding of the technical solutions of the present application, the following detailed descriptions of the embodiments of the present application are provided with reference to the accompanying drawings.
It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The embodiment of the application can provide a voice enhancement system, and the voice enhancement system can be positioned in terminal equipment with a voice interaction function, such as an intelligent sound box, an intelligent automobile, an intelligent robot and the like. The speech enhancement system provided by the embodiment of the application can be used for executing the speech enhancement method provided by the embodiment of the application.
Fig. 1 is a block diagram of a speech enhancement system according to an embodiment of the present application. As shown in fig. 1, the speech enhancement system 10 may include: a microphone array 11, a first enhancing unit 12, a positioning unit 13, a second enhancing unit 14 and a wake-up unit 15.
The microphone array 11 is connected to the first enhancement unit 12, the positioning unit 13, and the second enhancement unit 14, respectively. The first enhancement unit 12 is connected to a wake-up unit 15. The wake-up unit 15 is connected to the positioning unit 13. The positioning unit 13 is connected to the second enhancement unit 14.
Further, the speech enhancement system 10 provided in the embodiment of the present application may be connected to the cloud server 20, so that the enhanced speech signal may be uploaded to the cloud server 20, and the recognition unit 21 of the cloud server 20 performs speech recognition, and triggers speech interaction according to a recognition result.
Fig. 2 is a block diagram of another speech enhancement system according to an embodiment of the present application. Compared to fig. 1, the speech enhancement system shown in fig. 2 may further comprise an echo cancellation unit 16, a first speech processing unit 17 and a second speech processing unit 18.
The input end of the echo cancellation unit 16 is connected to the microphone array 11, and the output end is connected to the first enhancement unit 12, the positioning unit 13, and the second enhancement unit 14, respectively. The echo cancellation unit 16 may perform echo cancellation on the voice signal collected by the microphone array 11. The first speech processing unit 17 is connected to the second enhancement unit 14. The second speech processing unit 18 is connected to the first speech processing unit 17.
Fig. 3 is a flowchart of a speech enhancement method according to an embodiment of the present application. As shown in fig. 3, the speech enhancement method may include:
step 101, acquiring a voice signal acquired by a microphone array.
In the embodiment of the present application, as shown in fig. 1, after the microphone array 11 collects the voice signals, the voice signals may be sent to the first enhancing unit 12, the positioning unit 13, and the second enhancing unit 14, respectively.
And 102, respectively pre-enhancing the voice signals according to the vocal tract parameters of each vocal tract to obtain pre-enhanced voice signals respectively corresponding to each vocal tract.
In the embodiment of the present application, to accurately determine the position of the sound source, the signal acquisition area of the microphone array 11 may be divided into a plurality of non-overlapping sound zones according to the azimuth information of each microphone in the microphone array 11.
Specifically, the signal collection area of the microphone array 11 may be divided into a plurality of sound zones according to the relative position parameters of each microphone in the microphone array 11. The dividing mode can be average dividing, and the obtained number of the sound zones can be equal to the number of the microphones. Wherein, the sound zone central line position of every sound zone all corresponds a microphone. And, the vocal tract parameters of the vocal tract can be determined according to the centerline position of the vocal tract. The vocal tract parameters may include vocal tract directions.
In the embodiment of the present application, the more the number of microphones included in the microphone array 11 is, the more the number of sound regions obtained by division is, and the smaller the sound region range of each sound region is, then, the higher the accuracy of sound source positioning in the embodiment of the present application is, the better the speech signal enhancement effect is.
The division of the sound zones will be described by taking a 6-m circular array as an example.
As shown in fig. 4, the signal acquisition area of the microphone array 30 can be equally divided into 6 sub-areas that do not overlap with each other according to the relative positions of the 6 microphones included in the microphone array 30. Wherein, every subregion is a vocal tract, and 6 vocal tracts are respectively: acoustic zone 1, acoustic zone 2, acoustic zone 3, acoustic zone 4, acoustic zone 5, and acoustic zone 6. The center line position of each sound zone corresponds to one microphone respectively, and the direction of the center line of each sound zone is the sound zone direction of the sound zone.
In order to determine the sound zone where the sound source of the speech signal is located, the first enhancing unit 12 may respectively pre-enhance the speech signal according to the sound zone parameters of each sound zone to obtain the pre-enhanced speech signal corresponding to each sound zone. Specifically, a Fixed beam forming algorithm (FBF) may be used to pre-enhance the obtained voice signals in different vocal tract directions, so as to weaken noise and interference in the voice signals and obtain pre-enhanced voice signals corresponding to different vocal tract directions.
Step 103, determining a target voice signal containing the awakening word from each pre-enhanced voice signal.
Since each pre-enhanced speech signal is enhanced according to different vocal tract parameters, the signal strength of each pre-enhanced speech signal and the speech information contained in each pre-enhanced speech signal are different.
According to the basic principle of the fixed beam forming algorithm, the beam directivity obtained by enhancing the voice signal in the direction closer to the sound source position is better, and the voice enhancement effect is better. Therefore, the speech information included in the pre-enhanced speech signal corresponding to the sound zone where the sound source of the speech signal is located or the sound zone adjacent to the sound zone where the sound source of the speech signal is located is the largest.
Based on the above understanding, in the embodiment of the present application, after the first enhancing unit 12 generates each pre-enhanced voice signal, the generated pre-enhanced voice signal may be sent to the waking unit 15, and the waking unit 15 determines the target voice signal containing the wake-up word from each pre-enhanced voice signal. The sound zone corresponding to the target voice signal is the sound zone where the sound source of the voice signal is located.
The wake-up word refers to a specific vocabulary which can trigger the terminal device from a standby state to a voice interaction state. Determining the target speech signal containing the wake-up word from the respective pre-enhanced speech signal may be:
first, a neural network model, such as a deep neural network model, a convolutional neural network model, etc., is used to score the similarity between the signal characteristics of each pre-enhanced speech signal and the preset signal characteristics.
The preset signal characteristic may be a signal characteristic of a wake-up voice signal corresponding to the wake-up word. The scoring result may reflect whether the pre-enhanced speech signal includes a wake-up word.
Then, a target speech signal is determined based on the scoring result.
In this embodiment of the application, the pre-enhanced voice signal with the score higher than the preset threshold value in each pre-enhanced voice signal may be determined as the pre-enhanced voice signal containing the wakeup word. Furthermore, the pre-enhanced speech signal with the highest score in the pre-enhanced speech signals with scores higher than the preset threshold value can be determined as the target speech signal. The preset threshold is a critical value that can be considered that the pre-enhanced speech signal contains a wakeup word.
In a possible case, if the score of each pre-enhanced speech signal is lower than the preset threshold, it may be determined that no wake-up word is included in each pre-enhanced speech signal. New speech signals may then be acquired by the microphone array 11 until the score of at least one of the generated respective pre-enhanced speech signals is above the above-mentioned preset threshold.
And 104, determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located.
According to step 103, the sound zone corresponding to the target speech signal is the sound zone where the sound source of the speech signal is located. In this embodiment, after the wake-up unit 15 determines the target voice signal containing the wake-up word, the sound zone corresponding to the target voice signal may be determined as the target sound zone where the sound source is located.
Specifically, if there is one pre-enhanced speech signal with the highest score, the sound zone corresponding to the pre-enhanced speech signal with the highest score is determined as the target sound zone where the sound source generating the speech signal is located. If the scores of the plurality of pre-enhanced voice signals are the same and the scores are the highest, the sound zone corresponding to any one of the pre-enhanced voice signals can be determined as the target sound zone where the sound source generating the voice signal is located.
For convenience of understanding, the 6-mm circular array is still taken as an example to explain the actual scenes corresponding to the two situations.
As shown in fig. 5, if the sound source of the speech signal is located at a, then it is likely that only the pre-enhanced speech signal corresponding to sound zone 1 has the highest score. At this time, it is determined that the number of the target voice signals is 1, and the vocal tract 1 corresponding to the target voice signals is the target vocal tract where the sound source generating the voice signals is located.
If the sound source of the speech signal is located at B, the score of the pre-enhanced speech signal corresponding to sound zone 2 and sound zone 3 may be the same and highest, since it is at the boundary of sound zone 2 and sound zone 3. At this time, the sound zone corresponding to any one of the pre-enhanced voice signals may be determined as the target sound zone where the sound source generating the voice signal is located.
And 105, positioning a sound source generating a voice signal in the target sound area, and performing directional enhancement on the voice signal according to the positioning information of the sound source.
In the embodiment of the present application, after the wake-up unit 15 determines the target sound zone where the sound source generating the voice signal is located, the target sound zone information may be sent to the positioning unit 13. The positioning unit 13 may position the sound source generating the voice signal in the target sound zone based on the voice signal received in the step 101 by using a Direction Of Arrival estimation algorithm (DOA). Therefore, the positioning range can be reduced, and the accuracy of sound source positioning is improved.
Having obtained accurate sound source localization information, the localization unit 13 may send the sound source location information to the second enhancement unit 14. The second enhancement unit 14 may utilize an Adaptive Beamforming (ABF) algorithm to perform directional enhancement on the voice signal based on accurate positioning information. Thereby enhancing the speech enhancement effect while suppressing noise.
In the embodiment of the application, firstly, the voice signals collected by the microphone array are obtained. Then, according to the vocal tract parameters of each vocal tract, pre-enhancing the voice signals respectively to obtain pre-enhanced voice signals corresponding to each vocal tract respectively. Second, a target speech signal containing a wake-up word is determined from each of the pre-enhanced speech signals. And determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located. And finally, positioning a sound source generating the voice signal in the target sound area, and performing directional enhancement on the voice signal according to the positioning information of the sound source. According to the scheme, fixed wave beam enhancement of the preset sound zone is performed in the awakening stage, and sound source positioning based on the range of the preset sound zone is performed, so that the reliability of a positioning result is improved, and the voice enhancement performance in the recognition stage is improved.
Fig. 6 is a flowchart of another speech enhancement method according to an embodiment of the present application. As shown in fig. 6, after the step 105, the speech enhancement method provided in the embodiment of the present application may further include:
step 201, sending the directionally enhanced voice signal to a cloud server.
In this embodiment, the second enhancing unit 14 may directly send the directionally enhanced voice signal to the cloud server 20. Alternatively, as shown in fig. 2, the second enhancement unit 14 may first send the directionally enhanced speech signal to the first speech processing unit 17 to implement further dereverberation; the dereverberated speech signal is then sent to the second speech processing unit 18 for further noise suppression. The processed voice signal is then sent by the second voice processing unit 18 to the cloud server 20.
After receiving the voice signal, the voice recognition unit 21 of the cloud server 20 may perform voice recognition, and trigger voice interaction according to a voice recognition result. Specifically, natural language understanding can be triggered according to the speech recognition result, then speech synthesis is performed according to the understanding result and the service logic of speech interaction, and speech interaction is realized.
Fig. 7 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present application. As shown in fig. 7, a speech enhancement apparatus provided in an embodiment of the present application may include: an acquisition module 61, a pre-enhancement module 62, a first determination module 63, a second determination module 64, and an execution module 65.
And the obtaining module 61 is configured to obtain a voice signal collected by the microphone array.
The pre-enhancement module 62 is configured to pre-enhance the speech signal according to the vocal tract parameters of each vocal tract, so as to obtain pre-enhanced speech signals corresponding to each vocal tract; wherein each sound zone is divided in advance according to the azimuth information of each microphone contained in the microphone array.
A first determining module 63, configured to determine a target speech signal containing a wake-up word from each pre-enhanced speech signal.
And a second determining module 64, configured to determine the sound zone corresponding to the target speech signal as the target sound zone where the sound source generating the speech signal is located.
And the execution module 65 is configured to locate a sound source generating a voice signal in the target sound zone, and perform directional enhancement on the voice signal according to the location information of the sound source.
In a specific implementation process, when determining a target speech signal containing a wake-up word from each pre-enhanced speech signal, the first determining module 63 is specifically configured to: and scoring the similarity between the signal characteristics of each pre-enhanced voice signal and the preset signal characteristics by using a neural network model. And determining the target voice signal according to the scoring result.
In a specific implementation process, the first determining module 63 determines the target speech signal according to the scoring result, including: and determining the pre-enhanced voice signal with the score higher than a preset threshold value and the score highest in each pre-enhanced voice signal as a target voice signal.
In a specific implementation process, if the first determining module 63 determines that the score of each pre-enhanced speech signal is lower than the preset threshold, a new speech signal is acquired through the microphone array until the score of at least one pre-enhanced speech signal in each generated pre-enhanced speech signal is higher than the preset threshold.
In a specific implementation process, after the performing module 65 performs directional enhancement on the speech signal according to the positioning information of the sound source, the performing module is further configured to: and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.
In this embodiment of the application, first, the obtaining module 61 obtains a voice signal collected by a microphone array. Then, the pre-enhancement module 62 pre-enhances the speech signal according to the vocal tract parameters of each vocal tract, so as to obtain pre-enhanced speech signals corresponding to each vocal tract. Next, the first determining module 63 determines a target speech signal containing a wakeup word from each of the pre-enhanced speech signals. The second determining module 64 determines the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located. Finally, the execution module 65 locates the sound source generating the voice signal in the target sound zone, and performs directional enhancement on the voice signal according to the location information of the sound source. Therefore, under the condition of interference of a plurality of sound sources, the position of the target sound source is accurately positioned, and the voice enhancement performance is improved.
Fig. 8 is a schematic diagram of an electronic device according to an embodiment of the present application, where the electronic device may include at least one processor, as shown in fig. 8; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute the voice enhancement method provided by the embodiment of the application.
The electronic device may be a voice enhancement device, and the embodiment does not limit the specific form of the electronic device.
FIG. 8 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 8, the electronic device is in the form of a general purpose computing device. Components of the electronic device may include, but are not limited to: one or more processors 410, a memory 430, a communication interface 420, and a communication bus 440 that connects the various system components (including the memory 430 and the processors 410).
Communication bus 440 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Electronic devices typically include a variety of computer system readable media. Such media may be any available media that is accessible by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 430 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) and/or cache Memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Although not shown in FIG. 8, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to the communication bus 440 by one or more data media interfaces. Memory 430 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility having a set (at least one) of program modules, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored in memory 430, each of which examples or some combination may include an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the embodiments described herein.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), one or more devices that enable a user to interact with the electronic device, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may occur via communication interface 420. Furthermore, the electronic device may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via a Network adapter (not shown in FIG. 8) that may communicate with other modules of the electronic device via the communication bus 440. It should be appreciated that although not shown in FIG. 8, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape Drives, and data backup storage systems, among others.
The processor 410 executes various functional applications and data processing, such as implementing a speech enhancement method provided by an embodiment of the present application, by executing programs stored in the memory 430.
An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and the computer instructions enable the computer to execute the speech enhancement method provided in the embodiment of the present application.
The computer-readable storage medium described above may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a flash Memory, an optical fiber, a portable compact disc Read Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be noted that the terminal according to the embodiments of the present application may include, but is not limited to, a Personal Computer (Personal Computer; hereinafter, referred to as PC), a Personal Digital Assistant (Personal Digital Assistant; hereinafter, referred to as PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a mobile phone, an MP3 player, an MP4 player, and the like.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims (10)

1. A method of speech enhancement, comprising:
acquiring a voice signal acquired by a microphone array;
according to the sound zone parameters of each sound zone, respectively pre-enhancing the voice signals to obtain pre-enhanced voice signals respectively corresponding to each sound zone; wherein the sound zones are divided in advance according to azimuth information of microphones included in the microphone array;
determining a target voice signal containing a wake-up word from each of the pre-enhanced voice signals;
determining a sound zone corresponding to the target voice signal as a target sound zone where a sound source generating the voice signal is located;
and positioning a sound source generating the voice signal in the target sound area, and directionally enhancing the voice signal according to the positioning information of the sound source.
2. The method of claim 1, wherein the positional information of the microphone comprises: a relative position parameter of a microphone in the microphone array; pre-dividing each sound area according to azimuth information of each microphone contained in the microphone array, wherein the pre-dividing comprises the following steps:
dividing a signal acquisition area of the microphone array into a plurality of sound areas according to relative position parameters of all microphones contained in the microphone array, and determining sound area parameters of the sound areas according to the central line positions of the sound areas.
3. The method of claim 1, wherein determining a target speech signal containing a wake-up word from each of the pre-enhanced speech signals comprises:
scoring the similarity between the signal characteristics of each pre-enhanced voice signal and preset signal characteristics by using a neural network model; the preset signal characteristics are signal characteristics of a wake-up voice signal corresponding to a wake-up word;
and determining the target voice signal according to the scoring result.
4. The method of claim 3, wherein determining the target speech signal based on the scoring comprises:
and determining the pre-enhanced voice signal with the score higher than a preset threshold value and the score highest in each pre-enhanced voice signal as a target voice signal.
5. The method according to claim 4, wherein if the score of each of the pre-enhanced speech signals is below the preset threshold, the method further comprises:
and acquiring new voice signals through the microphone array until the score of at least one pre-enhanced voice signal in each generated pre-enhanced voice signal is higher than the preset threshold value.
6. The method according to claim 1, wherein after directionally enhancing the speech signal according to the localization information of the sound source, the method further comprises:
and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.
7. A speech enhancement apparatus, comprising:
the acquisition module is used for acquiring the voice signals acquired by the microphone array;
the pre-enhancement module is used for respectively pre-enhancing the voice signals according to the sound zone parameters of each sound zone to obtain pre-enhanced voice signals corresponding to each sound zone; wherein the respective sound zones are determined from azimuth information of respective microphones comprised by the microphone array;
the first determining module is used for determining a target voice signal containing a wake-up word from each pre-enhanced voice signal;
the second determining module is used for determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located;
and the execution module is used for positioning the sound source generating the voice signal in the target sound area and directionally enhancing the voice signal according to the positioning information of the sound source.
8. The apparatus of claim 7, wherein the execution module, after performing directional enhancement on the speech signal according to the positioning information of the sound source, is further configured to:
and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.
9. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 6.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6.
CN202110257165.7A 2021-03-09 2021-03-09 Speech enhancement method, electronic device, and storage medium Pending CN113053368A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110257165.7A CN113053368A (en) 2021-03-09 2021-03-09 Speech enhancement method, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110257165.7A CN113053368A (en) 2021-03-09 2021-03-09 Speech enhancement method, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN113053368A true CN113053368A (en) 2021-06-29

Family

ID=76510851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110257165.7A Pending CN113053368A (en) 2021-03-09 2021-03-09 Speech enhancement method, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN113053368A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782024A (en) * 2021-09-27 2021-12-10 上海互问信息科技有限公司 Method for improving automatic voice recognition accuracy rate after voice awakening
CN114500733A (en) * 2022-01-21 2022-05-13 维沃移动通信有限公司 Capacitive sound control method, device, equipment and medium
WO2023138632A1 (en) * 2022-01-24 2023-07-27 维沃移动通信有限公司 Voice recording method and apparatus, and electronic device
CN117854526A (en) * 2024-03-08 2024-04-09 深圳市声扬科技有限公司 Speech enhancement method, device, electronic equipment and computer readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10051366B1 (en) * 2017-09-28 2018-08-14 Sonos, Inc. Three-dimensional beam forming with a microphone array
CN108538305A (en) * 2018-04-20 2018-09-14 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN108597507A (en) * 2018-03-14 2018-09-28 百度在线网络技术(北京)有限公司 Far field phonetic function implementation method, equipment, system and storage medium
CN109949810A (en) * 2019-03-28 2019-06-28 华为技术有限公司 A kind of voice awakening method, device, equipment and medium
CN110556103A (en) * 2018-05-31 2019-12-10 阿里巴巴集团控股有限公司 Audio signal processing method, apparatus, system, device and storage medium
CN110646763A (en) * 2019-10-10 2020-01-03 出门问问信息科技有限公司 Sound source positioning method and device based on semantics and storage medium
CN110673096A (en) * 2019-09-30 2020-01-10 北京地平线机器人技术研发有限公司 Voice positioning method and device, computer readable storage medium and electronic equipment
CN111599371A (en) * 2020-05-19 2020-08-28 苏州奇梦者网络科技有限公司 Voice adding method, system, device and storage medium
CN112216295A (en) * 2019-06-25 2021-01-12 大众问问(北京)信息科技有限公司 Sound source positioning method, device and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10051366B1 (en) * 2017-09-28 2018-08-14 Sonos, Inc. Three-dimensional beam forming with a microphone array
CN108597507A (en) * 2018-03-14 2018-09-28 百度在线网络技术(北京)有限公司 Far field phonetic function implementation method, equipment, system and storage medium
CN108538305A (en) * 2018-04-20 2018-09-14 百度在线网络技术(北京)有限公司 Audio recognition method, device, equipment and computer readable storage medium
CN110556103A (en) * 2018-05-31 2019-12-10 阿里巴巴集团控股有限公司 Audio signal processing method, apparatus, system, device and storage medium
CN109949810A (en) * 2019-03-28 2019-06-28 华为技术有限公司 A kind of voice awakening method, device, equipment and medium
CN112216295A (en) * 2019-06-25 2021-01-12 大众问问(北京)信息科技有限公司 Sound source positioning method, device and equipment
CN110673096A (en) * 2019-09-30 2020-01-10 北京地平线机器人技术研发有限公司 Voice positioning method and device, computer readable storage medium and electronic equipment
CN110646763A (en) * 2019-10-10 2020-01-03 出门问问信息科技有限公司 Sound source positioning method and device based on semantics and storage medium
CN111599371A (en) * 2020-05-19 2020-08-28 苏州奇梦者网络科技有限公司 Voice adding method, system, device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782024A (en) * 2021-09-27 2021-12-10 上海互问信息科技有限公司 Method for improving automatic voice recognition accuracy rate after voice awakening
CN113782024B (en) * 2021-09-27 2024-03-12 上海互问信息科技有限公司 Method for improving accuracy of automatic voice recognition after voice awakening
CN114500733A (en) * 2022-01-21 2022-05-13 维沃移动通信有限公司 Capacitive sound control method, device, equipment and medium
CN114500733B (en) * 2022-01-21 2024-02-20 维沃移动通信有限公司 Capacitance sound control method, device, equipment and medium
WO2023138632A1 (en) * 2022-01-24 2023-07-27 维沃移动通信有限公司 Voice recording method and apparatus, and electronic device
CN117854526A (en) * 2024-03-08 2024-04-09 深圳市声扬科技有限公司 Speech enhancement method, device, electronic equipment and computer readable storage medium
CN117854526B (en) * 2024-03-08 2024-05-24 深圳市声扬科技有限公司 Speech enhancement method, device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110992974B (en) Speech recognition method, apparatus, device and computer readable storage medium
CN113053368A (en) Speech enhancement method, electronic device, and storage medium
CN107591151B (en) Far-field voice awakening method and device and terminal equipment
US20200279563A1 (en) Method and apparatus for executing voice command in electronic device
CN109599124B (en) Audio data processing method and device and storage medium
US9668048B2 (en) Contextual switching of microphones
US9953634B1 (en) Passive training for automatic speech recognition
CN105190746B (en) Method and apparatus for detecting target keyword
CN108681440A (en) A kind of smart machine method for controlling volume and system
US20160162469A1 (en) Dynamic Local ASR Vocabulary
CN107622770A (en) voice awakening method and device
CN108986833A (en) Sound pick-up method, system, electronic equipment and storage medium based on microphone array
CN108140398B (en) Method and system for identifying sound from a source of interest based on multiple audio feeds
CN111402877B (en) Noise reduction method, device, equipment and medium based on vehicle-mounted multitone area
CN108564944B (en) Intelligent control method, system, equipment and storage medium
US9508345B1 (en) Continuous voice sensing
CN107240396B (en) Speaker self-adaptation method, device, equipment and storage medium
US9772815B1 (en) Personalized operation of a mobile device using acoustic and non-acoustic information
CN112492207B (en) Method and device for controlling camera to rotate based on sound source positioning
US11776563B2 (en) Textual echo cancellation
CN111615045A (en) Audio processing method, device, equipment and storage medium
WO2023103693A1 (en) Audio signal processing method and apparatus, device, and storage medium
US20230197084A1 (en) Apparatus and method for classifying speakers by using acoustic sensor
CN112466305B (en) Voice control method and device of water dispenser
CN110941455B (en) Active wake-up method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination