WO2020048431A1

WO2020048431A1 - Voice processing method, electronic device and display device

Info

Publication number: WO2020048431A1
Application number: PCT/CN2019/104081
Authority: WO
Inventors: 纳跃跃; 刘鑫; 刘勇; 高杰; 付强
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-09-03
Filing date: 2019-09-03
Publication date: 2020-03-12
Also published as: CN110875045A

Abstract

A voice processing method, an electronic device, and a display device, the method comprising: separating original voice data into one or more pieces of voice data (601); determining the credibility of the separated pieces of voice data (602); and performing voice recognition on the voice data having the highest credibility (603). The impact of noise on voice data recognition may be reduced, the existing problem of low voice recognition accuracy due to the existence of noise and other disturbing sounds is solved, thus achieving the accurate recognition of voice data, and more effective man-machine voice interaction may be performed.

Description

Voice processing method, electronic equipment and display equipment

This application claims the priority of a Chinese patent application filed on September 03, 2018 with an application number of 201811020120.2 and an invention name of "a voice recognition method, a smart device, and a smart TV", the entire contents of which are incorporated herein by reference. .

Technical field

The present application belongs to the field of Internet technology, and particularly relates to a voice processing method, an electronic device, and a display device.

Background technique

With the development of computers, the Internet, the mobile Internet, and the Internet of Things, the use of smart devices (such as mobile phones, computers, smart homes, smart robots, etc.) is becoming more frequent. In the past, a single human-computer interaction method based on a keyboard, a mouse, and a remote controller could not meet the needs for smart device control. Correspondingly, the needs for voice recognition and voice control have become more and more extensive, and voice interaction will become a A wider range of human-computer interaction.

The human-machine voice interaction is that the intelligent device converts the voice command into text through speech recognition technology, and then understands the intention of the command through semantic understanding technology and gives corresponding feedback. However, the premise of human-machine voice interaction is that the machine must be able to hear the content of the voice command clearly.

However, as shown in Figure 1, in the actual voice interaction scene, in addition to the voice of the target object, there are generally a variety of adverse acoustic factors such as device echo, non-target speaker's voice, external noise interference, and room reverberation. Impact. Therefore, the original sound received by the sound pickup device is a noisy, low signal-to-noise ratio speech signal. Such a signal is not conducive to the processing of the speech recognition algorithm, resulting in the inability to perform effective human-machine speech interaction.

In view of the above problems, no effective solution has been proposed.

Summary of the Invention

The purpose of this application is to provide a voice processing method, an electronic device, and a display device, which can realize accurate recognition of voice data, and thus can obtain more effective human-machine voice interaction.

The present application provides a voice processing method, an electronic device, and a display device as follows:

A speech processing method, the method includes:

Separate the original speech data into one or more speech data;

Determine the credibility of the separated voice data;

Perform speech recognition on the most reliable speech data.

An electronic device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the processor implements:

Separate the original speech data into one or more speech data;

Determining the credibility of the various voice data;

Recognize the most reliable speech data.

A display device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the display device implements:

Separate the original speech data into one or more speech data;

Determine the credibility of the separated voice data;

Perform speech recognition on the most reliable speech data.

A computer-readable storage medium stores computer instructions thereon, the steps of the above method being implemented when the instructions are executed.

A voice processing method, an electronic device, and a display device provided by the present application, by separating the original voice data into multiple channels of voice data, and then determining the credibility of each channel of voice data, the voice of the most reliable voice data is voiced. Recognition, which can reduce the impact of noise on speech data recognition, solve the existing problem of high accuracy of speech recognition due to the existence of interference sounds such as noise, and achieve accurate recognition of speech data, which can be more effective Human-machine voice interaction.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions in the embodiments of the present application or the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are merely These are some of the embodiments described in this application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is a schematic diagram of a conventional sound transmission;

FIG. 2 is a schematic diagram of a voice interaction system architecture provided by this application; FIG.

3 is a schematic diagram of channel separation and determination provided in the present application;

FIG. 4 is a schematic structural diagram of a voice processing process provided by the present application; FIG.

5 is a flowchart of a voice processing method provided by the present application;

6 is a flowchart of another method of speech recognition provided by the present application;

FIG. 7 is a schematic structural diagram of a computing terminal provided by the present application; FIG.

FIG. 8 is a structural block diagram of a voice interaction device provided by the present application.

detailed description

In order to enable those skilled in the art to better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described The examples are only part of the examples of this application, but not all examples. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts should fall within the protection scope of this application.

Considering that if the original audio signal collected by the microphone array of the smart device can be enhanced, noise signals such as device echo, non-target speech, and environmental noise can be suppressed, thereby improving the signal-to-noise ratio of the target speech, then the speech recognition can be effectively improved. Accuracy, improve human-machine voice interaction efficiency.

In this example, a multi-channel signal is obtained through speech separation, and then the signal that interacts with the smart device is determined based on the credibility of the wake word in each channel signal, and then the channel signal of the birth source signal is determined. Speech recognition can effectively reduce the impact of noise.

As shown in FIG. 2, in this example, a voice interaction system is provided, which may include: User X, a noise source, and an interactive device. The interactive device includes a processor, where the processor is configured to complete a voice. Multi-channel separation and wake-word-based channel selection.

The above user X is a user who performs voice interaction with the interactive device. User X sends out interactive voices, for example: Hello TV, help me turn up the TV sound to 50.

For an interactive device, not only a processor but also a sound collector may be provided. The sound collector may be a microphone array. The microphone array is used to collect sound and provide the collected sound to the processor for processing. The processor processes the sound. The microphone array may be a microphone array structure with a specific shape or a microphone array structure with a regular shape. The specific structure of the microphone array is not specifically limited in this application, and may be selected and set according to actual needs.

In order to achieve accurate speech recognition, in this example, the speech mixed with the original sound is separated using speech separation technology. During the speech enhancement phase, each separated speech will have a channel output. Then, through the voice wake-up technology, the channel with the highest score is selected from the awakened channels as the target voice, and then sent to the voice recognition system for voice recognition.

For example, as shown in FIG. 3, the user's original voice data (that is, the voice data collected by the microphone array) is obtained. The original voice data includes the user's original voice (that is, the voice data that needs to be acquired), and there are many Noise data (for example, noise interference from other users speaking, other sound interference, etc.). After obtaining the voice data, the processor can separate the original voice data into multi-channel voice data through voice separation processing: the first channel voice data, the second channel voice data, the third channel voice data, the fourth channel voice data ... ….

Use each channel data in the first channel voice data, the second channel voice data, the third channel voice data, the fourth channel voice data ... as an input signal, and detect whether a predefined wake-up word appears in each channel data , And score the wake-up words. The higher the score, the better the signal quality of the wake-up words. For example, the wake-up words are scored as 20 in the first channel of voice data, the wake-up words are scored as 98 in the second channel of voice data, and the third channel is voice data. The awakening word is scored 50 in the middle, and the awakening word is scored 35 in the fourth channel of voice data. Then it can be determined that the second channel of voice data is the original sound.

Therefore, the voice data of the second channel can be sent to the speech recognition system as the determined original sound. The speech recognition system converts the voice command into text through the voice recognition technology, and then understands the intention of the command through the semantic understanding technology, and Give feedback accordingly.

When performing voice separation, that is, separating original voice data into multi-channel voice data, the separation can be performed in one of the following ways, but not limited to:

Method 1: Since different sound sources are generated by different physical processes, it can be assumed that different sound source signals are statistically independent. The above-mentioned original speech signal is a mixture of multiple source signals, and the signals collected by each channel of the microphone array become no longer independent. Therefore, an objective function can be defined to maximize the independence between each output channel during the iteration process. To achieve the purpose of speech separation.

Method 2: Since the speech signal is sparse in the frequency domain, it can be assumed that only one sound source is dominant at the same time-frequency point. To this end, a time-frequency masking method can be defined, which separates and classifies the time-frequency points belonging to the same sound source, and calculates the energy change and sum of each source from the time-frequency masking of each source signal. Covariance matrix to achieve speech separation.

Method 3: Knowing the topology of the microphone array, the sound source localization algorithm is used to estimate the azimuth of each sound source among multiple sound sources, and then a beam forming algorithm is used to form a beam for each sound source to output multiple sound sources. Channel voice signal.

In order to realize the effective processing of the original voice data, the echo cancellation processing can be performed on the data before the voice separation, thereby eliminating the echo data in the voice data of each channel. After the multi-channel voice data is obtained by the voice separation, the voice data of each channel can be processed. Perform noise reduction processing, and then perform gain control on the voice data of each channel after the noise reduction processing. After gain control, the wake-up word probability judgment is performed on each channel of voice data to determine the original sound, that is, the channel is realized. s Choice.

For example, it is determined that channel 2 (that is, the second channel voice data) is the original sound. In this voice interaction, the voice data of channel 2 is used as the data for voice instruction recognition that needs to be transmitted to the voice recognition system.

In the judgment of the wake word, the probability of the presence of the wake word in the voice data of each channel may be determined, and the voice data of the channel with the highest probability is used as the voice data corresponding to the original sound. Among them, the wake-up word can be some sensitive words set in advance, for example: if the interactive device is a TV, then the wake-up word can be Hello TV, if the interactive device is a speaker, then the wake-up word can be a Hello speaker, if the interactive device is sold Machine, then the wake-up word can be a hello vending machine or a name for the device, for example: miumiu, then you can set the wake-up word to be hello miumiu, or miumiu, and so on. Specifically, how to set the wake word can be selected according to actual needs, which is not limited in this application.

Among them, Automatic Control (AGC) is an automatic control method that automatically adjusts the gain of the channel with the signal strength. Automatic gain control is a kind of limit output. It uses the effective of linear amplification and compression amplification. The combination adjusts the output signal of the hearing aid. When a weak signal is input, the linear amplification electrical channel works to ensure the strength of the output signal; when the input signal reaches a certain intensity, the compression amplification electrical channel is started to reduce the output amplitude. In other words, the AGC function can automatically control the amplitude of the gain by changing the input-output compression ratio. One is to increase the AGC voltage to reduce the gain. It is called forward AGC. The other is to reduce the gain by reducing the AGC voltage. It is called reverse AGC. The forward AGC has strong control capabilities and requires Large control power The range of the working point of the controlled amplifier is large, and the impedance changes at both ends of the amplifier are also large. The control power required for reverse AGC is small, and the control range is also small.

When selecting a channel, the awakening word can be scored only, or other data can be combined, such as the length of the awakening word and the signal-to-noise ratio to select the channel, and select and output the channel with the highest comprehensive ranking. Which method is used to select the target channel may be selected according to actual needs, which is not specifically described in this application.

In the process of interacting with voice data, generally, a user triggers a voice interaction, and the user starts a wake-up word to trigger an interaction process. After the interaction ends, an interaction with another user can be triggered. In this example, the above-mentioned voice recognition method is based on determining that the interactive user's position does not move during the interaction. For example, the user stands in front of the TV and says to the TV: Hello TV, please turn up the volume, or, Standing in front of the vending machine said: Hello vending machine, I want a subway ticket from Suzhou Street to Finance Street.

That is, in this process, the position of someone's user is hardly moved. Considering that in actual implementation, this situation is prone to the movement of the user's position. If you continue to perform voice pickup and follow the determined channel, Recognition may cause obstacles to the entire interaction process. To this end, a step of voice user recognition can be added. That is, after the original audio channel is determined, after the subsequent acquisition of the voice data of the channel, the identity recognition is performed first to determine whether the user corresponds to the original audio and the user, and if so, continue to acquire the voice data of the channel, and The channel's voice data is used for speech recognition. If it is not the customer corresponding to the original tone, the channel is determined again.

The above-mentioned voice interaction method is described below with reference to a specific example. As shown in FIG. 4, the voice interaction system includes an enhancement system and a wake-up system, wherein the enhancement system enhances the original voice signal received by the microphone array and outputs multiple signals. A high signal-to-noise sound source signal. In FIG. 4, a sound source signal with a high signal-to-noise ratio between two channels (hello TV, today's weather) is taken as an example. In actual implementation, it can include two or more channels of sound source signals, that is, support Multi-channel voice output and multi-channel selection. The wake-up system determines whether the multi-channel signal output contains a user-defined wake-up word, such as "hello TV", and determines the output channel based on the wake-up score of the voice data output from each channel signal.

The enhancement system may include an echo cancellation module, a voice separation and noise reduction module, and a gain control module. The echo cancellation module is used to suppress the sound emitted by the interactive device itself, such as a program or a prompt sound played by a television or a speaker. ; Voice separation and noise reduction module, used to separate each source signal from the mixed signal, and suppress environmental noise, such as: smooth noise such as air conditioners, microwave ovens. A gain control module is used to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition.

The voice separation module may use, but is not limited to, one of the following ways to separate voice data:

The wake-up system in FIG. 4 may include: a wake-up word module and a channel selection module. The wake-up word module is configured to detect whether a predefined wake-up word appears from the input signal, and give a score and a score of the wake-up word. The higher the signal quality, the better the wake-up word; the channel selection module, which is used to select the channel based on the score of multiple wake-up words and other characteristics, such as the wake-up word length and signal-to-noise ratio, and select a comprehensive ranking The highest channel.

In the above solution in this example, based on the speech separation technology, the distortion caused by overlapping speech is small, and it does not rely on the localization of the azimuth of the sound source, even if the target sound and the interference sound are located at the same azimuth, as long as the target sound and There is a difference in the distance between the interference sound and the microphone array, so it can be effectively processed by the method of the present application. Specifically, in this example, the voice mixed in the original sound is separated using the voice separation technology. During the voice enhancement phase, each voice that is separated will have a channel output. Then, the voice wake-up technology Among the wake-up channels, the channel with the highest score is selected as the channel where the target voice is located, and then the voice data of the channel is sent to the speech recognition system for processing. Because the multi-channel enhanced output plus multi-channel wake word plus wake-up posterior probability score is used, each channel of voice enhanced output is a separate voice, and its voice signal-to-noise ratio can be effectively enhanced, which can improve the real wake-up The probability detected by the wake-up algorithm, so that the channel most likely to be the target speaker is selected for subsequent operations.

FIG. 5 is a method flowchart of an embodiment of a speech processing method described in this application. Although this application provides method operation steps or device structures as shown in the following embodiments or drawings, based on conventional or no creative labor, the method or device may include more or fewer operation steps or module units. . Among the steps or structures that do not logically have the necessary causal relationship, the execution order of these steps or the module structure of the device is not limited to the execution order or the module structure shown in the embodiments of the present application and shown in the accompanying drawings. When the method or the module structure is applied to an actual device or terminal product, the method or the module structure shown in the embodiment or the drawings may be connected to execute sequentially or in parallel (for example, a parallel processor or multi-threaded processing). Environment, or even a distributed processing environment).

Specifically, as shown in FIG. 5, a voice processing method provided by an embodiment of the present application may include:

Step 501: Separate the original voice data into multi-channel voice data;

The original voice data can be obtained by sound data picked up by the microphone array, and then the picked up sound data is obtained by echo cancellation. Among them, the sound emitted by the interactive device itself is suppressed, for example, a program or a prompt sound played by a television or a speaker. Wait.

Step 502: Determine the credibility of the voice data of each channel in the multi-channel voice data;

Specifically, the credibility may be determined according to at least one of the following: the signal quality of the predetermined phrase, the signal-to-noise ratio of the voice data, and the length of time that the predetermined phrase appears. Among them, the predetermined phrase may be a wake-up word, and the wake-up word may be some sensitive words set in advance, for example: if the interactive device is a TV, then the wake-up word may be Hello TV, and if the interactive device is a speaker, the wake-up word may be Hello Speaker, if the interactive device is a vending machine, then the wake-up word can be a hello vending machine, or a name for the device, such as: miumiu, then you can set the wake-up word to be hello miumiu, or miumiu, and so on. Specifically, how to set the wake word can be selected according to actual needs, which is not limited in this application.

Step 503: Use the channel corresponding to the most reliable voice data as the target channel.

Step 504: Perform voice recognition on the voice data of the target channel.

In the above example, the voice mixed with the original sound is separated by using voice separation technology. Each voice that is separated will have a channel output. Then, the credibility of the signal of each channel is determined, and the highest credibility is selected. The channel is used as the channel where the target voice is located, and then the voice data of this channel is sent to the speech recognition system for processing, which can reduce the impact of noise on speech data recognition, and solve the existing accurate speech recognition caused by the existence of disturbing sounds such as noise. The high degree of problem achieves accurate recognition of voice data and enables more effective human-machine voice interaction.

In step 504, performing voice recognition on the voice data of the target channel may be converting the voice data of the target channel into text content; identifying the intent of the text content, and then generating feedback data according to the intent . For example, the voice data is: Hello TV, please adjust the volume to 50, then determine that the intention is to increase the volume. The corresponding generated data can be: operation data to increase the volume, and also generate the voice for feedback to the user. The data, for example, the owner, has been adjusted and so on.

The foregoing uses television as an interactive device. In actual implementation, the interactive device may also be other devices, such as smart speakers, smart vending machines, etc., which is not limited in this application.

When implemented, the original voice data can be separated into multi-channel voice data in one of the following ways:

1) Iteratively calculate the original voice data by using an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data;

2) Separate and classify time-frequency points belonging to the same sound source in the original voice data, determine multiple sound source signals, and calculate the energy change and sum of each sound source from the time-frequency mask of each sound source signal. Covariance matrix to obtain the multi-channel speech data;

Or, 3) Obtain the topology of the microphone array, determine the azimuth of each of the multiple sound sources using a sound source localization algorithm, and form a beam for each sound source through a beam forming algorithm to obtain the multi-channel Voice data.

Specifically, after the original voice data is separated into multi-channel voice data, noise reduction processing and gain control can also be performed on the voice data of each channel in the multi-channel voice data. The noise reduction processing is to suppress environmental noise, for example: Air conditioner, microwave oven, etc. smooth noise. The gain control is to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition.

A voice processing method is also provided in this application. As shown in FIG. 6, the method may include the following steps:

Step 601: separate the original voice data into one or more voice data;

Step 602: Determine the credibility of the separated voice data;

Step 603: Perform speech recognition on the most reliable speech data.

That is, by separating the original speech data into multiple speech data, and selecting the speech data with the highest reliability for speech recognition, the problem of low recognition accuracy due to noise and the like in the existing speech recognition process can be solved .

Specifically, in step 602, determining the credibility of the separated voice data may be determined based on the wake word, for example, it may be to detect whether a predefined wake word appears for each voice data, The score of the arousal word is determined; the score of the voice data is determined according to the score of the arousal word; and the determined score of the voice data is used as the credibility of the voice data.

In order to improve the accuracy of the credibility confirmation, the credibility can be further determined by combining the content such as the signal-to-noise ratio. For example, the wake-up word duration and the signal-to-noise ratio can be obtained; The score of the voice data is calculated.

Considering that in actual implementation, the acquired sound data may have echo, so before separating the original voice data into one or more voice data, you can obtain the sound data and then perform echo cancellation on the sound data to obtain the original Voice data so that the effects of echo can be effectively eliminated.

In implementation, the above step 601 can separate the original voice data into multi-channel voice data. Specifically, one of the following methods can be adopted:

Method 1: Iteratively calculate the original voice data by using an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data; or,

Method 2: Separate and classify time-frequency points belonging to the same sound source in the original voice data, determine multiple sound source signals, and calculate the energy change of each sound source from the time-frequency mask of each sound source signal Sum covariance matrix to obtain the multi-channel voice data;

Method 3: Obtain the topology of the microphone array, use the sound source localization algorithm to determine the azimuth of each sound source among the multiple sound sources, and form a beam for each sound source through the beam forming algorithm to obtain the multi-channel voice data.

Specifically, after the original voice data is separated into multi-channel voice data, noise reduction processing and gain control can also be performed on the voice data of each channel in the multi-channel voice data. The noise reduction processing is to suppress environmental noise, for example: Air conditioner, microwave oven, etc. smooth noise. The gain control is to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition. In the implementation, the noise reduction process can be performed first, and then the gain control can be performed.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a computer terminal as an example, FIG. 7 is a block diagram of a hardware structure of a computer terminal of a voice processing method according to an embodiment of the present invention. As shown in FIG. 7, the computer terminal 10 may include one or more (only one shown in the figure) a processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) A memory 104 for storing data, and a transmission module 106 for communication functions. A person of ordinary skill in the art can understand that the structure shown in FIG. 7 is only schematic, and it does not limit the structure of the electronic device. For example, the computer terminal 10 may further include more or fewer components than those shown in FIG. 7, or have a configuration different from that shown in FIG. 7.

The memory 104 may be used to store software programs and modules of application software, such as program instructions / modules corresponding to the voice processing method in the embodiment of the present invention. The processor 102 executes various software programs and modules stored in the memory 104 to execute various programs. Function application and data processing, that is, the speech processing method for implementing the above application program. The memory 104 may include a high-speed random access memory, and may further include a non-volatile memory, such as one or more magnetic storage devices, a flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely disposed with respect to the processor 102, and these remote memories may be connected to the computer terminal 10 through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

The transmission module 106 is configured to receive or send data via a network. A specific example of the above network may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a network adapter (NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission module 106 may be a radio frequency (RF) module, which is used to communicate with the Internet in a wireless manner.

At the software level, the voice recognition device shown in FIG. 8 may include: a separation module 801, a first determination module 802, a second determination module 803, and a recognition module 804, where:

A separation module 801, configured to separate the original voice data into multi-channel voice data;

A first determining module 802, configured to determine the credibility of the voice data of each channel in the multi-channel voice data;

A second determining module 803, configured to use a channel corresponding to the most reliable voice data as a target channel;

The recognition module 804 is configured to perform voice recognition on the voice data of the target channel.

In an implementation manner, the first determining module 802 may specifically determine the credibility according to at least one of the following: the signal quality of the predetermined phrase, the signal-to-noise ratio of the voice data, and the duration of occurrence of the predetermined phrase.

In one embodiment, the above device may further include: a pick-up and cancellation module, configured to pick up sound data through a microphone array before separating the original voice data into multi-channel voice data; performing echo cancellation on the voice data to obtain the voice data Raw speech data.

In one embodiment, the identification module 804 may be specifically configured to convert the voice data of the target channel into text content; identify the intent of the text content; and generate feedback data according to the intent.

In one embodiment, the separation module 801 may, but is not limited to, separate the original voice data into multi-channel voice data in one of the following ways:

3) Obtain the topology of the microphone array, determine the azimuth of each of the multiple sound sources using the sound source localization algorithm, and form a beam for each sound source through the beam forming algorithm to obtain the multi-channel voice data .

In one embodiment, the device may further perform noise reduction processing and gain control on the voice data of each channel in the multi-channel voice data after separating the original voice data into multi-channel voice data.

In this example, a smart TV is also provided. The smart TV may include a processor and a memory for storing processor-executable instructions, and the processor implements when the instructions are executed:

Separating the original voice data into multi-channel voice data;

Determining the credibility of the voice data of each channel in the multi-channel voice data;

Use the channel corresponding to the most reliable voice data as the target channel;

Perform speech recognition on the speech data of the target channel.

In the above embodiment, the original voice data is separated into multi-channel voice data, and then the credibility of the voice data of each channel in the multi-channel voice data is determined, and the channel corresponding to the most reliable voice data is used as the target channel. , Recognize the voice data of the target channel, which can reduce the impact of noise on voice data recognition, solve the existing problem of high accuracy of voice recognition caused by the existence of noise and other interference sounds, and achieve the Accurate recognition allows for more effective human-machine voice interaction.

Although the present application provides method operation steps as described in the embodiments or flowcharts, more or less operation steps may be included based on conventional or non-creative labor. The sequence of steps listed in the embodiments is only one way of executing the steps, and does not represent the only sequence of execution. When the actual device or client product is executed, it may be executed sequentially or in parallel according to the method shown in the embodiment or the drawings (for example, a parallel processor or a multi-threaded environment).

The devices or modules described in the foregoing embodiments may be specifically implemented by a computer chip or entity, or may be implemented by a product having a certain function. For the convenience of description, when describing the above device, the functions are divided into various modules and described separately. When implementing this application, the functions of each module may be implemented in the same or multiple software and / or hardware. Of course, a module that implements a certain function may also be implemented by combining multiple submodules or subunits.

The method, device or module described in this application may be implemented in a computer-readable program code by the controller in any suitable manner. For example, the controller may adopt, for example, a microprocessor or processor and the storage may be processed by the (micro) Computer-readable program code (such as software or firmware) executed by a computer, computer-readable media, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers, and embedded microcontrollers Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320. The memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art also know that, in addition to implementing the controller in a purely computer-readable program code manner, it is entirely possible to make the controller logic gates, switches, dedicated integrated electrical channels, programmable logic controllers, and Embedded in the form of a microcontroller, etc. to achieve the same function. Therefore, such a controller can be considered as a hardware component, and the device included in the controller for implementing various functions can also be considered as a structure within the hardware component. Or even, the means for implementing various functions can be regarded as a structure that can be both a software module implementing the method and a hardware component.

Some modules in the apparatus described in this application may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types. The present application can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules may be located in local and remote computer storage media, including storage devices.

It can be known from the description of the foregoing embodiments that those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary hardware. Based on such an understanding, the technical solution of the present application in essence or a part that contributes to the existing technology may be embodied in the form of a software product, or may be reflected in the implementation process of data migration. The computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disc, etc., and includes a number of instructions to enable a computer device (which can be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute this software. Apply for the method described in each embodiment or some parts of the embodiment.

Each embodiment in this specification is described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other. Each embodiment focuses on differences from other embodiments. All or part of this application can be used in many general-purpose or special-purpose computer system environments or configurations. For example: personal computer, server computer, handheld device or portable device, tablet device, mobile communication terminal, multi-processor system, microprocessor-based system, programmable electronic device, network PC, small computer, mainframe computer, including Distributed computing environment for any of the above systems or devices, etc.

Although the present application is described through the examples, those skilled in the art know that there are many variations and changes in the present application without departing from the spirit of the present application, and it is expected that the appended claims include these variations and changes without departing from the spirit of the present application.

Claims

A speech processing method, wherein the method includes:

Separate the original speech data into one or more speech data;

Determine the credibility of the separated voice data;

Perform speech recognition on the most reliable speech data.
The method according to claim 1, wherein determining the credibility of the separated voice data comprises:

For each voice data, detect whether a predefined wake-up word appears and determine the score of the wake-up word;

Determine the score of the speech data according to the wake word score;

The determined score of the voice data is used as the credibility of the voice data.
The method according to claim 2, wherein determining the score of the speech data according to the wake word score comprises:

Get the wake-up word duration and signal-to-noise ratio;

The score of the speech data is calculated and calculated according to the wake word score, the wake word duration, and the signal-to-noise ratio.
The method according to claim 1, wherein before separating the original voice data into one or more voice data, the method further comprises:

Obtaining sound data;

Echo cancellation is performed on the sound data to obtain original speech data.
The method according to claim 1, wherein the original voice data is separated into multi-channel voice data:

Performing an iterative calculation on the original voice data by an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data; or,

Separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine multiple sound source signals, and calculating the energy change and covariance of each sound source from the time-frequency masking of each sound source signal Matrix to obtain the multi-channel voice data.
The method according to claim 1, wherein separating the original voice data into multi-channel voice data comprises:

The topological structure of the microphone array is obtained, and the azimuth angle of each sound source among the multiple sound sources is determined using a sound source localization algorithm, and a beam is formed for each sound source through a beam forming algorithm to obtain the multi-channel voice data.
The method according to claim 1, wherein after separating the original voice data into one or more voice data, the method further comprises:

Perform noise reduction processing and / or gain control on the separated voice data.
An electronic device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the processor implements:

Separate the original speech data into one or more speech data;

Determining the credibility of the various voice data;

Recognize the most reliable speech data.
The device according to claim 8, wherein the processor determining the credibility of the separated voice data comprises:

For each voice data, detect whether a predefined wake-up word appears and determine the score of the wake-up word;

Determine the score of the speech data according to the wake word score;

The determined score of the voice data is used as the credibility of the voice data.
The device according to claim 9, wherein the determining the score of the voice data according to the wake word score comprises:

Get the wake-up word duration and signal-to-noise ratio;

The score of the speech data is calculated and calculated according to the wake word score, the wake word duration, and the signal-to-noise ratio.
The device according to claim 8, wherein the processor is further configured to: before separating the original voice data into one or more voice data:

Obtaining sound data;

Echo cancellation is performed on the sound data to obtain original speech data.
The device according to claim 8, wherein the processor separates the original voice data into multi-channel voice data:

Performing an iterative calculation on the original voice data by an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data; or,

Separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine multiple sound source signals, and calculating the energy change and covariance of each sound source from the time-frequency masking of each sound source signal Matrix to obtain the multi-channel voice data.
The device according to claim 8, wherein the processor separating the original voice data into multi-channel voice data comprises:

The topological structure of the microphone array is obtained, and the azimuth angle of each sound source among the multiple sound sources is determined using a sound source localization algorithm, and a beam is formed for each sound source through a beam forming algorithm to obtain the multi-channel voice data.
The device according to claim 8, wherein after the processor separates the original voice data into one or more voice data, the processor is further configured to:

Perform noise reduction processing and / or gain control on the separated voice data.
A display device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the display device implements:

Separate the original speech data into one or more speech data;

Determine the credibility of the separated voice data;

Perform speech recognition on the most reliable speech data.
A data processing system includes: an enhancement module and a wake-up module, wherein:

The enhancement module is configured to separate the original voice data into one or more voice data;

The wake-up module is configured to determine the credibility of the separated voice data, and perform voice recognition on the voice data with the highest credibility.
The system according to claim 16, wherein the enhancement module comprises:

An echo cancellation unit, configured to perform echo cancellation on the acquired sound data to obtain the original voice data;

A speech separation unit for separating original speech data into one or more speech data;

Noise reduction unit, which is used to perform noise reduction processing on one or more speech data after separation;

A gain control unit is configured to perform gain control on the data after the noise reduction process.
A computer-readable storage medium having computer instructions stored thereon that, when executed, implement the steps of the method of any one of claims 1 to 7.