CN113053368A

CN113053368A - Speech enhancement method, electronic device, and storage medium

Info

Publication number: CN113053368A
Application number: CN202110257165.7A
Authority: CN
Inventors: 夏洁; 方思敏; 罗丽云; 李开
Original assignee: RDA Microelectronics Shanghai Co Ltd
Current assignee: RDA Microelectronics Shanghai Co Ltd; RDA Microelectronics Inc
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-06-29

Abstract

The application provides a voice enhancement method, electronic equipment and a storage medium, and relates to the technical field of voice processing. The voice enhancement method comprises the following steps: firstly, a voice signal collected by a microphone array is obtained. Then, according to the vocal tract parameters of each vocal tract, pre-enhancing the voice signals respectively to obtain pre-enhanced voice signals corresponding to each vocal tract respectively. Second, a target speech signal containing a wake-up word is determined from each of the pre-enhanced speech signals. And determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located. And finally, positioning a sound source generating the voice signal in the target sound area, and performing directional enhancement on the voice signal according to the positioning information of the sound source. By using the pre-enhanced speech signal of the vocal tract in the wake-up stage, the wake-up performance can be improved. Therefore, under the condition of interference of a plurality of sound sources, the position of the target sound source can be accurately positioned, and the voice enhancement performance in the recognition stage is improved.

Description

Speech enhancement method, electronic device, and storage medium

[ technical field ] A method for producing a semiconductor device

The present application relates to the field of speech processing technologies, and in particular, to a speech enhancement method, an electronic device, and a storage medium.

[ background of the invention ]

In some scenarios involving voice interaction, such as smart speakers, smart cars, and smart robots, it is generally necessary to perform voice signal processing on a voice signal input by a user. The voice signal processing mainly comprises the steps of determining the incoming wave direction of a target sound source and utilizing a beam forming technology to carry out beam enhancement on voice signals in the incoming wave direction, so that the purposes of enhancing effective signals and suppressing noise and interference are achieved.

Currently, when determining the incoming wave direction of a target sound source, the target sound source is mainly located by a direction of arrival estimation technique. However, when there is interference from multiple sound sources in the environment, the current technology cannot accurately locate the direction of the target sound source, which causes the beam generated in the voice enhancement process to diverge, thereby affecting the subsequent voice interaction service.

[ summary of the invention ]

The embodiment of the application provides a voice enhancement method, electronic equipment and a storage medium, so that the position of a target sound source is accurately positioned under the condition of interference of a plurality of sound sources, and the voice enhancement performance in the awakening and recognition stages is improved.

In a first aspect, an embodiment of the present application provides a speech enhancement method, where the method includes: acquiring a voice signal acquired by a microphone array; according to the sound zone parameters of each sound zone, respectively pre-enhancing the voice signals to obtain pre-enhanced voice signals respectively corresponding to each sound zone; wherein the sound zones are divided in advance according to azimuth information of microphones included in the microphone array; determining a target voice signal containing a wake-up word from each of the pre-enhanced voice signals; determining a sound zone corresponding to the target voice signal as a target sound zone where a sound source generating the voice signal is located; and positioning a sound source generating the voice signal in the target sound area, and directionally enhancing the voice signal according to the positioning information of the sound source.

In one possible implementation manner, the azimuth information of the microphone includes: a relative position parameter of a microphone in the microphone array; pre-dividing each sound area according to azimuth information of each microphone contained in the microphone array, wherein the pre-dividing comprises the following steps: dividing a signal acquisition area of the microphone array into a plurality of sound areas according to relative position parameters of all microphones contained in the microphone array, and determining sound area parameters of the sound areas according to the central line positions of the sound areas.

In one possible implementation manner, determining a target speech signal containing a wake-up word from each of the pre-enhanced speech signals includes: scoring the similarity between the signal characteristics of each pre-enhanced voice signal and preset signal characteristics by using a neural network model; the preset signal characteristics are signal characteristics of a wake-up voice signal corresponding to a wake-up word; and determining the target voice signal according to the scoring result.

In one possible implementation manner, determining the target speech signal according to the scoring result includes: and determining the pre-enhanced voice signal with the score higher than a preset threshold value and the score highest in each pre-enhanced voice signal as a target voice signal.

In one possible implementation manner, if the score of each of the pre-enhanced speech signals is lower than the preset threshold, the method further includes: and acquiring new voice signals through the microphone array until the score of at least one pre-enhanced voice signal in each generated pre-enhanced voice signal is higher than the preset threshold value.

In one possible implementation manner, after directionally enhancing the speech signal according to the positioning information of the sound source, the method further includes: and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.

In a second aspect, an embodiment of the present application provides a speech enhancement apparatus, including: the acquisition module is used for acquiring the voice signals acquired by the microphone array; the pre-enhancement module is used for respectively pre-enhancing the voice signals according to the sound zone parameters of each sound zone to obtain pre-enhanced voice signals corresponding to each sound zone; wherein the sound zones are divided in advance according to azimuth information of microphones included in the microphone array; the first determining module is used for determining a target voice signal containing a wake-up word from each pre-enhanced voice signal; the second determining module is used for determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located; and the execution module is used for positioning the sound source generating the voice signal in the target sound area and directionally enhancing the voice signal according to the positioning information of the sound source.

In one possible implementation manner, after directionally enhancing the voice signal according to the positioning information of the sound source, the execution module is further configured to: and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor being capable of performing the method of the first aspect when invoked by the processor.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions for causing a computer to perform the method according to the first aspect.

In the above technical scheme, firstly, a voice signal acquired by a microphone array is acquired. Then, according to the vocal tract parameters of each vocal tract, pre-enhancing the voice signals respectively to obtain pre-enhanced voice signals corresponding to each vocal tract respectively. Second, a target speech signal containing a wake-up word is determined from each of the pre-enhanced speech signals. And determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located. And finally, positioning a sound source generating the voice signal in the target sound area, and performing directional enhancement on the voice signal according to the positioning information of the sound source. According to the scheme, wave beam enhancement is performed on the awakening and identification stages, and the sound source positioning based on the preset sound zone range is realized, so that the reliability of the positioning result is improved.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a block diagram of a speech enhancement system according to an embodiment of the present application;

FIG. 2 is a block diagram of another speech enhancement system provided in an embodiment of the present application;

fig. 3 is a flowchart of a speech enhancement method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech enhancement method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of another speech enhancement method according to an embodiment of the present application;

FIG. 6 is a flow chart of another speech enhancement method provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present application;

fig. 8 is a schematic view of an electronic device according to an embodiment of the present application.

[ detailed description ] embodiments

For better understanding of the technical solutions of the present application, the following detailed descriptions of the embodiments of the present application are provided with reference to the accompanying drawings.

It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The embodiment of the application can provide a voice enhancement system, and the voice enhancement system can be positioned in terminal equipment with a voice interaction function, such as an intelligent sound box, an intelligent automobile, an intelligent robot and the like. The speech enhancement system provided by the embodiment of the application can be used for executing the speech enhancement method provided by the embodiment of the application.

Fig. 1 is a block diagram of a speech enhancement system according to an embodiment of the present application. As shown in fig. 1, the speech enhancement system 10 may include: a microphone array 11, a first enhancing unit 12, a positioning unit 13, a second enhancing unit 14 and a wake-up unit 15.

The microphone array 11 is connected to the first enhancement unit 12, the positioning unit 13, and the second enhancement unit 14, respectively. The first enhancement unit 12 is connected to a wake-up unit 15. The wake-up unit 15 is connected to the positioning unit 13. The positioning unit 13 is connected to the second enhancement unit 14.

Further, the speech enhancement system 10 provided in the embodiment of the present application may be connected to the cloud server 20, so that the enhanced speech signal may be uploaded to the cloud server 20, and the recognition unit 21 of the cloud server 20 performs speech recognition, and triggers speech interaction according to a recognition result.

Fig. 2 is a block diagram of another speech enhancement system according to an embodiment of the present application. Compared to fig. 1, the speech enhancement system shown in fig. 2 may further comprise an echo cancellation unit 16, a first speech processing unit 17 and a second speech processing unit 18.

The input end of the echo cancellation unit 16 is connected to the microphone array 11, and the output end is connected to the first enhancement unit 12, the positioning unit 13, and the second enhancement unit 14, respectively. The echo cancellation unit 16 may perform echo cancellation on the voice signal collected by the microphone array 11. The first speech processing unit 17 is connected to the second enhancement unit 14. The second speech processing unit 18 is connected to the first speech processing unit 17.

Fig. 3 is a flowchart of a speech enhancement method according to an embodiment of the present application. As shown in fig. 3, the speech enhancement method may include:

step 101, acquiring a voice signal acquired by a microphone array.

In the embodiment of the present application, as shown in fig. 1, after the microphone array 11 collects the voice signals, the voice signals may be sent to the first enhancing unit 12, the positioning unit 13, and the second enhancing unit 14, respectively.

And 102, respectively pre-enhancing the voice signals according to the vocal tract parameters of each vocal tract to obtain pre-enhanced voice signals respectively corresponding to each vocal tract.

In the embodiment of the present application, to accurately determine the position of the sound source, the signal acquisition area of the microphone array 11 may be divided into a plurality of non-overlapping sound zones according to the azimuth information of each microphone in the microphone array 11.

Specifically, the signal collection area of the microphone array 11 may be divided into a plurality of sound zones according to the relative position parameters of each microphone in the microphone array 11. The dividing mode can be average dividing, and the obtained number of the sound zones can be equal to the number of the microphones. Wherein, the sound zone central line position of every sound zone all corresponds a microphone. And, the vocal tract parameters of the vocal tract can be determined according to the centerline position of the vocal tract. The vocal tract parameters may include vocal tract directions.

In the embodiment of the present application, the more the number of microphones included in the microphone array 11 is, the more the number of sound regions obtained by division is, and the smaller the sound region range of each sound region is, then, the higher the accuracy of sound source positioning in the embodiment of the present application is, the better the speech signal enhancement effect is.

The division of the sound zones will be described by taking a 6-m circular array as an example.

As shown in fig. 4, the signal acquisition area of the microphone array 30 can be equally divided into 6 sub-areas that do not overlap with each other according to the relative positions of the 6 microphones included in the microphone array 30. Wherein, every subregion is a vocal tract, and 6 vocal tracts are respectively: acoustic zone 1, acoustic zone 2, acoustic zone 3, acoustic zone 4, acoustic zone 5, and acoustic zone 6. The center line position of each sound zone corresponds to one microphone respectively, and the direction of the center line of each sound zone is the sound zone direction of the sound zone.

In order to determine the sound zone where the sound source of the speech signal is located, the first enhancing unit 12 may respectively pre-enhance the speech signal according to the sound zone parameters of each sound zone to obtain the pre-enhanced speech signal corresponding to each sound zone. Specifically, a Fixed beam forming algorithm (FBF) may be used to pre-enhance the obtained voice signals in different vocal tract directions, so as to weaken noise and interference in the voice signals and obtain pre-enhanced voice signals corresponding to different vocal tract directions.

Step 103, determining a target voice signal containing the awakening word from each pre-enhanced voice signal.

Since each pre-enhanced speech signal is enhanced according to different vocal tract parameters, the signal strength of each pre-enhanced speech signal and the speech information contained in each pre-enhanced speech signal are different.

According to the basic principle of the fixed beam forming algorithm, the beam directivity obtained by enhancing the voice signal in the direction closer to the sound source position is better, and the voice enhancement effect is better. Therefore, the speech information included in the pre-enhanced speech signal corresponding to the sound zone where the sound source of the speech signal is located or the sound zone adjacent to the sound zone where the sound source of the speech signal is located is the largest.

Based on the above understanding, in the embodiment of the present application, after the first enhancing unit 12 generates each pre-enhanced voice signal, the generated pre-enhanced voice signal may be sent to the waking unit 15, and the waking unit 15 determines the target voice signal containing the wake-up word from each pre-enhanced voice signal. The sound zone corresponding to the target voice signal is the sound zone where the sound source of the voice signal is located.

The wake-up word refers to a specific vocabulary which can trigger the terminal device from a standby state to a voice interaction state. Determining the target speech signal containing the wake-up word from the respective pre-enhanced speech signal may be:

first, a neural network model, such as a deep neural network model, a convolutional neural network model, etc., is used to score the similarity between the signal characteristics of each pre-enhanced speech signal and the preset signal characteristics.

The preset signal characteristic may be a signal characteristic of a wake-up voice signal corresponding to the wake-up word. The scoring result may reflect whether the pre-enhanced speech signal includes a wake-up word.

Then, a target speech signal is determined based on the scoring result.

In this embodiment of the application, the pre-enhanced voice signal with the score higher than the preset threshold value in each pre-enhanced voice signal may be determined as the pre-enhanced voice signal containing the wakeup word. Furthermore, the pre-enhanced speech signal with the highest score in the pre-enhanced speech signals with scores higher than the preset threshold value can be determined as the target speech signal. The preset threshold is a critical value that can be considered that the pre-enhanced speech signal contains a wakeup word.

In a possible case, if the score of each pre-enhanced speech signal is lower than the preset threshold, it may be determined that no wake-up word is included in each pre-enhanced speech signal. New speech signals may then be acquired by the microphone array 11 until the score of at least one of the generated respective pre-enhanced speech signals is above the above-mentioned preset threshold.

And 104, determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located.

According to step 103, the sound zone corresponding to the target speech signal is the sound zone where the sound source of the speech signal is located. In this embodiment, after the wake-up unit 15 determines the target voice signal containing the wake-up word, the sound zone corresponding to the target voice signal may be determined as the target sound zone where the sound source is located.

Specifically, if there is one pre-enhanced speech signal with the highest score, the sound zone corresponding to the pre-enhanced speech signal with the highest score is determined as the target sound zone where the sound source generating the speech signal is located. If the scores of the plurality of pre-enhanced voice signals are the same and the scores are the highest, the sound zone corresponding to any one of the pre-enhanced voice signals can be determined as the target sound zone where the sound source generating the voice signal is located.

For convenience of understanding, the 6-mm circular array is still taken as an example to explain the actual scenes corresponding to the two situations.

As shown in fig. 5, if the sound source of the speech signal is located at a, then it is likely that only the pre-enhanced speech signal corresponding to sound zone 1 has the highest score. At this time, it is determined that the number of the target voice signals is 1, and the vocal tract 1 corresponding to the target voice signals is the target vocal tract where the sound source generating the voice signals is located.

If the sound source of the speech signal is located at B, the score of the pre-enhanced speech signal corresponding to sound zone 2 and sound zone 3 may be the same and highest, since it is at the boundary of sound zone 2 and sound zone 3. At this time, the sound zone corresponding to any one of the pre-enhanced voice signals may be determined as the target sound zone where the sound source generating the voice signal is located.

And 105, positioning a sound source generating a voice signal in the target sound area, and performing directional enhancement on the voice signal according to the positioning information of the sound source.

In the embodiment of the present application, after the wake-up unit 15 determines the target sound zone where the sound source generating the voice signal is located, the target sound zone information may be sent to the positioning unit 13. The positioning unit 13 may position the sound source generating the voice signal in the target sound zone based on the voice signal received in the step 101 by using a Direction Of Arrival estimation algorithm (DOA). Therefore, the positioning range can be reduced, and the accuracy of sound source positioning is improved.

Having obtained accurate sound source localization information, the localization unit 13 may send the sound source location information to the second enhancement unit 14. The second enhancement unit 14 may utilize an Adaptive Beamforming (ABF) algorithm to perform directional enhancement on the voice signal based on accurate positioning information. Thereby enhancing the speech enhancement effect while suppressing noise.

In the embodiment of the application, firstly, the voice signals collected by the microphone array are obtained. Then, according to the vocal tract parameters of each vocal tract, pre-enhancing the voice signals respectively to obtain pre-enhanced voice signals corresponding to each vocal tract respectively. Second, a target speech signal containing a wake-up word is determined from each of the pre-enhanced speech signals. And determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located. And finally, positioning a sound source generating the voice signal in the target sound area, and performing directional enhancement on the voice signal according to the positioning information of the sound source. According to the scheme, fixed wave beam enhancement of the preset sound zone is performed in the awakening stage, and sound source positioning based on the range of the preset sound zone is performed, so that the reliability of a positioning result is improved, and the voice enhancement performance in the recognition stage is improved.

Fig. 6 is a flowchart of another speech enhancement method according to an embodiment of the present application. As shown in fig. 6, after the step 105, the speech enhancement method provided in the embodiment of the present application may further include:

step 201, sending the directionally enhanced voice signal to a cloud server.

In this embodiment, the second enhancing unit 14 may directly send the directionally enhanced voice signal to the cloud server 20. Alternatively, as shown in fig. 2, the second enhancement unit 14 may first send the directionally enhanced speech signal to the first speech processing unit 17 to implement further dereverberation; the dereverberated speech signal is then sent to the second speech processing unit 18 for further noise suppression. The processed voice signal is then sent by the second voice processing unit 18 to the cloud server 20.

After receiving the voice signal, the voice recognition unit 21 of the cloud server 20 may perform voice recognition, and trigger voice interaction according to a voice recognition result. Specifically, natural language understanding can be triggered according to the speech recognition result, then speech synthesis is performed according to the understanding result and the service logic of speech interaction, and speech interaction is realized.

Fig. 7 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present application. As shown in fig. 7, a speech enhancement apparatus provided in an embodiment of the present application may include: an acquisition module 61, a pre-enhancement module 62, a first determination module 63, a second determination module 64, and an execution module 65.

And the obtaining module 61 is configured to obtain a voice signal collected by the microphone array.

The pre-enhancement module 62 is configured to pre-enhance the speech signal according to the vocal tract parameters of each vocal tract, so as to obtain pre-enhanced speech signals corresponding to each vocal tract; wherein each sound zone is divided in advance according to the azimuth information of each microphone contained in the microphone array.

A first determining module 63, configured to determine a target speech signal containing a wake-up word from each pre-enhanced speech signal.

And a second determining module 64, configured to determine the sound zone corresponding to the target speech signal as the target sound zone where the sound source generating the speech signal is located.

And the execution module 65 is configured to locate a sound source generating a voice signal in the target sound zone, and perform directional enhancement on the voice signal according to the location information of the sound source.

In a specific implementation process, when determining a target speech signal containing a wake-up word from each pre-enhanced speech signal, the first determining module 63 is specifically configured to: and scoring the similarity between the signal characteristics of each pre-enhanced voice signal and the preset signal characteristics by using a neural network model. And determining the target voice signal according to the scoring result.

In a specific implementation process, the first determining module 63 determines the target speech signal according to the scoring result, including: and determining the pre-enhanced voice signal with the score higher than a preset threshold value and the score highest in each pre-enhanced voice signal as a target voice signal.

In a specific implementation process, if the first determining module 63 determines that the score of each pre-enhanced speech signal is lower than the preset threshold, a new speech signal is acquired through the microphone array until the score of at least one pre-enhanced speech signal in each generated pre-enhanced speech signal is higher than the preset threshold.

In a specific implementation process, after the performing module 65 performs directional enhancement on the speech signal according to the positioning information of the sound source, the performing module is further configured to: and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.

In this embodiment of the application, first, the obtaining module 61 obtains a voice signal collected by a microphone array. Then, the pre-enhancement module 62 pre-enhances the speech signal according to the vocal tract parameters of each vocal tract, so as to obtain pre-enhanced speech signals corresponding to each vocal tract. Next, the first determining module 63 determines a target speech signal containing a wakeup word from each of the pre-enhanced speech signals. The second determining module 64 determines the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located. Finally, the execution module 65 locates the sound source generating the voice signal in the target sound zone, and performs directional enhancement on the voice signal according to the location information of the sound source. Therefore, under the condition of interference of a plurality of sound sources, the position of the target sound source is accurately positioned, and the voice enhancement performance is improved.

Fig. 8 is a schematic diagram of an electronic device according to an embodiment of the present application, where the electronic device may include at least one processor, as shown in fig. 8; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute the voice enhancement method provided by the embodiment of the application.

The electronic device may be a voice enhancement device, and the embodiment does not limit the specific form of the electronic device.

FIG. 8 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the electronic device is in the form of a general purpose computing device. Components of the electronic device may include, but are not limited to: one or more processors 410, a memory 430, a communication interface 420, and a communication bus 440 that connects the various system components (including the memory 430 and the processors 410).

Communication bus 440 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Electronic devices typically include a variety of computer system readable media. Such media may be any available media that is accessible by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 430 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) and/or cache Memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. Although not shown in FIG. 8, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to the communication bus 440 by one or more data media interfaces. Memory 430 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility having a set (at least one) of program modules, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored in memory 430, each of which examples or some combination may include an implementation of a network environment. The program modules generally perform the functions and/or methodologies of the embodiments described herein.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), one or more devices that enable a user to interact with the electronic device, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may occur via communication interface 420. Furthermore, the electronic device may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via a Network adapter (not shown in FIG. 8) that may communicate with other modules of the electronic device via the communication bus 440. It should be appreciated that although not shown in FIG. 8, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape Drives, and data backup storage systems, among others.

The processor 410 executes various functional applications and data processing, such as implementing a speech enhancement method provided by an embodiment of the present application, by executing programs stored in the memory 430.

An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and the computer instructions enable the computer to execute the speech enhancement method provided in the embodiment of the present application.

The computer-readable storage medium described above may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a flash Memory, an optical fiber, a portable compact disc Read Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.

A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be noted that the terminal according to the embodiments of the present application may include, but is not limited to, a Personal Computer (Personal Computer; hereinafter, referred to as PC), a Personal Digital Assistant (Personal Digital Assistant; hereinafter, referred to as PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a mobile phone, an MP3 player, an MP4 player, and the like.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A method of speech enhancement, comprising:

acquiring a voice signal acquired by a microphone array;

according to the sound zone parameters of each sound zone, respectively pre-enhancing the voice signals to obtain pre-enhanced voice signals respectively corresponding to each sound zone; wherein the sound zones are divided in advance according to azimuth information of microphones included in the microphone array;

determining a target voice signal containing a wake-up word from each of the pre-enhanced voice signals;

determining a sound zone corresponding to the target voice signal as a target sound zone where a sound source generating the voice signal is located;

and positioning a sound source generating the voice signal in the target sound area, and directionally enhancing the voice signal according to the positioning information of the sound source.

2. The method of claim 1, wherein the positional information of the microphone comprises: a relative position parameter of a microphone in the microphone array; pre-dividing each sound area according to azimuth information of each microphone contained in the microphone array, wherein the pre-dividing comprises the following steps:

dividing a signal acquisition area of the microphone array into a plurality of sound areas according to relative position parameters of all microphones contained in the microphone array, and determining sound area parameters of the sound areas according to the central line positions of the sound areas.

3. The method of claim 1, wherein determining a target speech signal containing a wake-up word from each of the pre-enhanced speech signals comprises:

scoring the similarity between the signal characteristics of each pre-enhanced voice signal and preset signal characteristics by using a neural network model; the preset signal characteristics are signal characteristics of a wake-up voice signal corresponding to a wake-up word;

and determining the target voice signal according to the scoring result.

4. The method of claim 3, wherein determining the target speech signal based on the scoring comprises:

and determining the pre-enhanced voice signal with the score higher than a preset threshold value and the score highest in each pre-enhanced voice signal as a target voice signal.

5. The method according to claim 4, wherein if the score of each of the pre-enhanced speech signals is below the preset threshold, the method further comprises:

and acquiring new voice signals through the microphone array until the score of at least one pre-enhanced voice signal in each generated pre-enhanced voice signal is higher than the preset threshold value.

6. The method according to claim 1, wherein after directionally enhancing the speech signal according to the localization information of the sound source, the method further comprises:

and sending the directionally enhanced voice signal to a cloud server so that the cloud server performs voice recognition according to the directionally enhanced voice signal and performs voice interaction according to a voice recognition result.

7. A speech enhancement apparatus, comprising:

the acquisition module is used for acquiring the voice signals acquired by the microphone array;

the pre-enhancement module is used for respectively pre-enhancing the voice signals according to the sound zone parameters of each sound zone to obtain pre-enhanced voice signals corresponding to each sound zone; wherein the respective sound zones are determined from azimuth information of respective microphones comprised by the microphone array;

the first determining module is used for determining a target voice signal containing a wake-up word from each pre-enhanced voice signal;

the second determining module is used for determining the sound zone corresponding to the target voice signal as the target sound zone where the sound source generating the voice signal is located;

and the execution module is used for positioning the sound source generating the voice signal in the target sound area and directionally enhancing the voice signal according to the positioning information of the sound source.

8. The apparatus of claim 7, wherein the execution module, after performing directional enhancement on the speech signal according to the positioning information of the sound source, is further configured to:

9. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 6.

10. A computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6.