WO2020048431A1 - Voice processing method, electronic device and display device - Google Patents

Voice processing method, electronic device and display device Download PDF

Info

Publication number
WO2020048431A1
WO2020048431A1 PCT/CN2019/104081 CN2019104081W WO2020048431A1 WO 2020048431 A1 WO2020048431 A1 WO 2020048431A1 CN 2019104081 W CN2019104081 W CN 2019104081W WO 2020048431 A1 WO2020048431 A1 WO 2020048431A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
data
channel
speech
original
Prior art date
Application number
PCT/CN2019/104081
Other languages
French (fr)
Chinese (zh)
Inventor
纳跃跃
刘鑫
刘勇
高杰
付强
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020048431A1 publication Critical patent/WO2020048431A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the present application belongs to the field of Internet technology, and particularly relates to a voice processing method, an electronic device, and a display device.
  • the human-machine voice interaction is that the intelligent device converts the voice command into text through speech recognition technology, and then understands the intention of the command through semantic understanding technology and gives corresponding feedback.
  • the premise of human-machine voice interaction is that the machine must be able to hear the content of the voice command clearly.
  • the purpose of this application is to provide a voice processing method, an electronic device, and a display device, which can realize accurate recognition of voice data, and thus can obtain more effective human-machine voice interaction.
  • the present application provides a voice processing method, an electronic device, and a display device as follows:
  • a speech processing method includes:
  • An electronic device includes a processor and a memory for storing processor-executable instructions.
  • the processor executes the instructions, the processor implements:
  • a display device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the display device implements:
  • a computer-readable storage medium stores computer instructions thereon, the steps of the above method being implemented when the instructions are executed.
  • a voice processing method, an electronic device, and a display device provided by the present application by separating the original voice data into multiple channels of voice data, and then determining the credibility of each channel of voice data, the voice of the most reliable voice data is voiced.
  • Recognition which can reduce the impact of noise on speech data recognition, solve the existing problem of high accuracy of speech recognition due to the existence of interference sounds such as noise, and achieve accurate recognition of speech data, which can be more effective Human-machine voice interaction.
  • FIG. 1 is a schematic diagram of a conventional sound transmission
  • FIG. 2 is a schematic diagram of a voice interaction system architecture provided by this application.
  • FIG. 3 is a schematic diagram of channel separation and determination provided in the present application.
  • FIG. 4 is a schematic structural diagram of a voice processing process provided by the present application.
  • FIG. 5 is a flowchart of a voice processing method provided by the present application.
  • FIG. 6 is a flowchart of another method of speech recognition provided by the present application.
  • FIG. 7 is a schematic structural diagram of a computing terminal provided by the present application.
  • FIG. 8 is a structural block diagram of a voice interaction device provided by the present application.
  • noise signals such as device echo, non-target speech, and environmental noise can be suppressed, thereby improving the signal-to-noise ratio of the target speech, then the speech recognition can be effectively improved. Accuracy, improve human-machine voice interaction efficiency.
  • a multi-channel signal is obtained through speech separation, and then the signal that interacts with the smart device is determined based on the credibility of the wake word in each channel signal, and then the channel signal of the birth source signal is determined. Speech recognition can effectively reduce the impact of noise.
  • a voice interaction system which may include: User X, a noise source, and an interactive device.
  • the interactive device includes a processor, where the processor is configured to complete a voice. Multi-channel separation and wake-word-based channel selection.
  • the above user X is a user who performs voice interaction with the interactive device.
  • User X sends out interactive voices, for example: Hello TV, help me turn up the TV sound to 50.
  • the sound collector may be a microphone array.
  • the microphone array is used to collect sound and provide the collected sound to the processor for processing.
  • the processor processes the sound.
  • the microphone array may be a microphone array structure with a specific shape or a microphone array structure with a regular shape.
  • the specific structure of the microphone array is not specifically limited in this application, and may be selected and set according to actual needs.
  • the speech mixed with the original sound is separated using speech separation technology.
  • each separated speech will have a channel output.
  • the voice wake-up technology the channel with the highest score is selected from the awakened channels as the target voice, and then sent to the voice recognition system for voice recognition.
  • the user's original voice data (that is, the voice data collected by the microphone array) is obtained.
  • the original voice data includes the user's original voice (that is, the voice data that needs to be acquired), and there are many Noise data (for example, noise interference from other users speaking, other sound interference, etc.).
  • the processor can separate the original voice data into multi-channel voice data through voice separation processing: the first channel voice data, the second channel voice data, the third channel voice data, the fourth channel voice data ... ....
  • the wake-up words are scored as 20 in the first channel of voice data
  • the wake-up words are scored as 98 in the second channel of voice data
  • the third channel is voice data.
  • the awakening word is scored 50 in the middle
  • the awakening word is scored 35 in the fourth channel of voice data. Then it can be determined that the second channel of voice data is the original sound.
  • the voice data of the second channel can be sent to the speech recognition system as the determined original sound.
  • the speech recognition system converts the voice command into text through the voice recognition technology, and then understands the intention of the command through the semantic understanding technology, and Give feedback accordingly.
  • the separation can be performed in one of the following ways, but not limited to:
  • Method 1 Since different sound sources are generated by different physical processes, it can be assumed that different sound source signals are statistically independent.
  • the above-mentioned original speech signal is a mixture of multiple source signals, and the signals collected by each channel of the microphone array become no longer independent. Therefore, an objective function can be defined to maximize the independence between each output channel during the iteration process. To achieve the purpose of speech separation.
  • Method 2 Since the speech signal is sparse in the frequency domain, it can be assumed that only one sound source is dominant at the same time-frequency point. To this end, a time-frequency masking method can be defined, which separates and classifies the time-frequency points belonging to the same sound source, and calculates the energy change and sum of each source from the time-frequency masking of each source signal. Covariance matrix to achieve speech separation.
  • Method 3 Knowing the topology of the microphone array, the sound source localization algorithm is used to estimate the azimuth of each sound source among multiple sound sources, and then a beam forming algorithm is used to form a beam for each sound source to output multiple sound sources. Channel voice signal.
  • the echo cancellation processing can be performed on the data before the voice separation, thereby eliminating the echo data in the voice data of each channel.
  • the voice data of each channel can be processed. Perform noise reduction processing, and then perform gain control on the voice data of each channel after the noise reduction processing. After gain control, the wake-up word probability judgment is performed on each channel of voice data to determine the original sound, that is, the channel is realized. s Choice.
  • channel 2 that is, the second channel voice data
  • the voice data of channel 2 is used as the data for voice instruction recognition that needs to be transmitted to the voice recognition system.
  • the probability of the presence of the wake word in the voice data of each channel may be determined, and the voice data of the channel with the highest probability is used as the voice data corresponding to the original sound.
  • the wake-up word can be some sensitive words set in advance, for example: if the interactive device is a TV, then the wake-up word can be Hello TV, if the interactive device is a speaker, then the wake-up word can be a Hello speaker, if the interactive device is sold Machine, then the wake-up word can be a hello vending machine or a name for the device, for example: miumiu, then you can set the wake-up word to be hello miumiu, or miumiu, and so on.
  • how to set the wake word can be selected according to actual needs, which is not limited in this application.
  • Automatic Control is an automatic control method that automatically adjusts the gain of the channel with the signal strength.
  • Automatic gain control is a kind of limit output. It uses the effective of linear amplification and compression amplification. The combination adjusts the output signal of the hearing aid.
  • the linear amplification electrical channel works to ensure the strength of the output signal; when the input signal reaches a certain intensity, the compression amplification electrical channel is started to reduce the output amplitude.
  • the AGC function can automatically control the amplitude of the gain by changing the input-output compression ratio.
  • One is to increase the AGC voltage to reduce the gain. It is called forward AGC.
  • the other is to reduce the gain by reducing the AGC voltage.
  • reverse AGC The forward AGC has strong control capabilities and requires Large control power
  • the range of the working point of the controlled amplifier is large, and the impedance changes at both ends of the amplifier are also large.
  • the control power required for reverse AGC is small, and the control range is also small.
  • the awakening word can be scored only, or other data can be combined, such as the length of the awakening word and the signal-to-noise ratio to select the channel, and select and output the channel with the highest comprehensive ranking. Which method is used to select the target channel may be selected according to actual needs, which is not specifically described in this application.
  • a user triggers a voice interaction, and the user starts a wake-up word to trigger an interaction process. After the interaction ends, an interaction with another user can be triggered.
  • the above-mentioned voice recognition method is based on determining that the interactive user's position does not move during the interaction. For example, the user stands in front of the TV and says to the TV: Hello TV, please turn up the volume, or, Standing in front of the vending machine said: Hello vending machine, I want a subway ticket from Suzhou Street to Finance Street.
  • a step of voice user recognition can be added. That is, after the original audio channel is determined, after the subsequent acquisition of the voice data of the channel, the identity recognition is performed first to determine whether the user corresponds to the original audio and the user, and if so, continue to acquire the voice data of the channel, and The channel's voice data is used for speech recognition. If it is not the customer corresponding to the original tone, the channel is determined again.
  • the voice interaction system includes an enhancement system and a wake-up system, wherein the enhancement system enhances the original voice signal received by the microphone array and outputs multiple signals.
  • a high signal-to-noise sound source signal In FIG. 4, a sound source signal with a high signal-to-noise ratio between two channels (hello TV, today's weather) is taken as an example. In actual implementation, it can include two or more channels of sound source signals, that is, support Multi-channel voice output and multi-channel selection.
  • the wake-up system determines whether the multi-channel signal output contains a user-defined wake-up word, such as "hello TV", and determines the output channel based on the wake-up score of the voice data output from each channel signal.
  • the enhancement system may include an echo cancellation module, a voice separation and noise reduction module, and a gain control module.
  • the echo cancellation module is used to suppress the sound emitted by the interactive device itself, such as a program or a prompt sound played by a television or a speaker.
  • Voice separation and noise reduction module used to separate each source signal from the mixed signal, and suppress environmental noise, such as: smooth noise such as air conditioners, microwave ovens.
  • a gain control module is used to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition.
  • the voice separation module may use, but is not limited to, one of the following ways to separate voice data:
  • Method 1 Since different sound sources are generated by different physical processes, it can be assumed that different sound source signals are statistically independent.
  • the above-mentioned original speech signal is a mixture of multiple source signals, and the signals collected by each channel of the microphone array become no longer independent. Therefore, an objective function can be defined to maximize the independence between each output channel during the iteration process. To achieve the purpose of speech separation.
  • Method 2 Since the speech signal is sparse in the frequency domain, it can be assumed that only one sound source is dominant at the same time-frequency point. To this end, a time-frequency masking method can be defined, which separates and classifies the time-frequency points belonging to the same sound source, and calculates the energy change and sum of each source from the time-frequency masking of each source signal. Covariance matrix to achieve speech separation.
  • Method 3 Knowing the topology of the microphone array, the sound source localization algorithm is used to estimate the azimuth of each sound source among multiple sound sources, and then a beam forming algorithm is used to form a beam for each sound source to output multiple sound sources. Channel voice signal.
  • the wake-up system in FIG. 4 may include: a wake-up word module and a channel selection module.
  • the wake-up word module is configured to detect whether a predefined wake-up word appears from the input signal, and give a score and a score of the wake-up word. The higher the signal quality, the better the wake-up word; the channel selection module, which is used to select the channel based on the score of multiple wake-up words and other characteristics, such as the wake-up word length and signal-to-noise ratio, and select a comprehensive ranking The highest channel.
  • the distortion caused by overlapping speech is small, and it does not rely on the localization of the azimuth of the sound source, even if the target sound and the interference sound are located at the same azimuth, as long as the target sound and There is a difference in the distance between the interference sound and the microphone array, so it can be effectively processed by the method of the present application.
  • the voice mixed in the original sound is separated using the voice separation technology.
  • each voice that is separated will have a channel output.
  • the voice wake-up technology Among the wake-up channels, the channel with the highest score is selected as the channel where the target voice is located, and then the voice data of the channel is sent to the speech recognition system for processing. Because the multi-channel enhanced output plus multi-channel wake word plus wake-up posterior probability score is used, each channel of voice enhanced output is a separate voice, and its voice signal-to-noise ratio can be effectively enhanced, which can improve the real wake-up The probability detected by the wake-up algorithm, so that the channel most likely to be the target speaker is selected for subsequent operations.
  • FIG. 5 is a method flowchart of an embodiment of a speech processing method described in this application.
  • this application provides method operation steps or device structures as shown in the following embodiments or drawings, based on conventional or no creative labor, the method or device may include more or fewer operation steps or module units. .
  • the execution order of these steps or the module structure of the device is not limited to the execution order or the module structure shown in the embodiments of the present application and shown in the accompanying drawings.
  • the method or the module structure shown in the embodiment or the drawings may be connected to execute sequentially or in parallel (for example, a parallel processor or multi-threaded processing). Environment, or even a distributed processing environment).
  • a voice processing method provided by an embodiment of the present application may include:
  • Step 501 Separate the original voice data into multi-channel voice data
  • the original voice data can be obtained by sound data picked up by the microphone array, and then the picked up sound data is obtained by echo cancellation. Among them, the sound emitted by the interactive device itself is suppressed, for example, a program or a prompt sound played by a television or a speaker. Wait.
  • Step 502 Determine the credibility of the voice data of each channel in the multi-channel voice data
  • the credibility may be determined according to at least one of the following: the signal quality of the predetermined phrase, the signal-to-noise ratio of the voice data, and the length of time that the predetermined phrase appears.
  • the predetermined phrase may be a wake-up word
  • the wake-up word may be some sensitive words set in advance, for example: if the interactive device is a TV, then the wake-up word may be Hello TV, and if the interactive device is a speaker, the wake-up word may be Hello Speaker, if the interactive device is a vending machine, then the wake-up word can be a hello vending machine, or a name for the device, such as: miumiu, then you can set the wake-up word to be hello miumiu, or miumiu, and so on.
  • how to set the wake word can be selected according to actual needs, which is not limited in this application.
  • Step 503 Use the channel corresponding to the most reliable voice data as the target channel.
  • Step 504 Perform voice recognition on the voice data of the target channel.
  • the voice mixed with the original sound is separated by using voice separation technology.
  • Each voice that is separated will have a channel output.
  • the credibility of the signal of each channel is determined, and the highest credibility is selected.
  • the channel is used as the channel where the target voice is located, and then the voice data of this channel is sent to the speech recognition system for processing, which can reduce the impact of noise on speech data recognition, and solve the existing accurate speech recognition caused by the existence of disturbing sounds such as noise.
  • the high degree of problem achieves accurate recognition of voice data and enables more effective human-machine voice interaction.
  • performing voice recognition on the voice data of the target channel may be converting the voice data of the target channel into text content; identifying the intent of the text content, and then generating feedback data according to the intent .
  • the voice data is: Hello TV, please adjust the volume to 50, then determine that the intention is to increase the volume.
  • the corresponding generated data can be: operation data to increase the volume, and also generate the voice for feedback to the user.
  • the data for example, the owner, has been adjusted and so on.
  • the interactive device may also be other devices, such as smart speakers, smart vending machines, etc., which is not limited in this application.
  • the original voice data can be separated into multi-channel voice data in one of the following ways:
  • noise reduction processing and gain control can also be performed on the voice data of each channel in the multi-channel voice data.
  • the noise reduction processing is to suppress environmental noise, for example: Air conditioner, microwave oven, etc. smooth noise.
  • the gain control is to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition.
  • a voice processing method is also provided in this application. As shown in FIG. 6, the method may include the following steps:
  • Step 601 separate the original voice data into one or more voice data
  • Step 602 Determine the credibility of the separated voice data
  • Step 603 Perform speech recognition on the most reliable speech data.
  • determining the credibility of the separated voice data may be determined based on the wake word, for example, it may be to detect whether a predefined wake word appears for each voice data, The score of the arousal word is determined; the score of the voice data is determined according to the score of the arousal word; and the determined score of the voice data is used as the credibility of the voice data.
  • the credibility can be further determined by combining the content such as the signal-to-noise ratio. For example, the wake-up word duration and the signal-to-noise ratio can be obtained; The score of the voice data is calculated.
  • the acquired sound data may have echo, so before separating the original voice data into one or more voice data, you can obtain the sound data and then perform echo cancellation on the sound data to obtain the original Voice data so that the effects of echo can be effectively eliminated.
  • the above step 601 can separate the original voice data into multi-channel voice data. Specifically, one of the following methods can be adopted:
  • Method 1 Iteratively calculate the original voice data by using an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data; or,
  • Method 2 Separate and classify time-frequency points belonging to the same sound source in the original voice data, determine multiple sound source signals, and calculate the energy change of each sound source from the time-frequency mask of each sound source signal Sum covariance matrix to obtain the multi-channel voice data;
  • Method 3 Obtain the topology of the microphone array, use the sound source localization algorithm to determine the azimuth of each sound source among the multiple sound sources, and form a beam for each sound source through the beam forming algorithm to obtain the multi-channel voice data.
  • noise reduction processing and gain control can also be performed on the voice data of each channel in the multi-channel voice data.
  • the noise reduction processing is to suppress environmental noise, for example: Air conditioner, microwave oven, etc. smooth noise.
  • the gain control is to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition.
  • the noise reduction process can be performed first, and then the gain control can be performed.
  • FIG. 7 is a block diagram of a hardware structure of a computer terminal of a voice processing method according to an embodiment of the present invention.
  • the computer terminal 10 may include one or more (only one shown in the figure) a processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA)
  • a memory 104 for storing data
  • a transmission module 106 for communication functions.
  • the structure shown in FIG. 7 is only schematic, and it does not limit the structure of the electronic device.
  • the computer terminal 10 may further include more or fewer components than those shown in FIG. 7, or have a configuration different from that shown in FIG. 7.
  • the memory 104 may be used to store software programs and modules of application software, such as program instructions / modules corresponding to the voice processing method in the embodiment of the present invention.
  • the processor 102 executes various software programs and modules stored in the memory 104 to execute various programs. Function application and data processing, that is, the speech processing method for implementing the above application program.
  • the memory 104 may include a high-speed random access memory, and may further include a non-volatile memory, such as one or more magnetic storage devices, a flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include memory remotely disposed with respect to the processor 102, and these remote memories may be connected to the computer terminal 10 through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the transmission module 106 is configured to receive or send data via a network.
  • a specific example of the above network may include a wireless network provided by a communication provider of the computer terminal 10.
  • the transmission module 106 includes a network adapter (NIC), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission module 106 may be a radio frequency (RF) module, which is used to communicate with the Internet in a wireless manner.
  • RF radio frequency
  • the voice recognition device shown in FIG. 8 may include: a separation module 801, a first determination module 802, a second determination module 803, and a recognition module 804, where:
  • a separation module 801, configured to separate the original voice data into multi-channel voice data
  • a first determining module 802 configured to determine the credibility of the voice data of each channel in the multi-channel voice data
  • a second determining module 803, configured to use a channel corresponding to the most reliable voice data as a target channel
  • the recognition module 804 is configured to perform voice recognition on the voice data of the target channel.
  • the first determining module 802 may specifically determine the credibility according to at least one of the following: the signal quality of the predetermined phrase, the signal-to-noise ratio of the voice data, and the duration of occurrence of the predetermined phrase.
  • the above device may further include: a pick-up and cancellation module, configured to pick up sound data through a microphone array before separating the original voice data into multi-channel voice data; performing echo cancellation on the voice data to obtain the voice data Raw speech data.
  • a pick-up and cancellation module configured to pick up sound data through a microphone array before separating the original voice data into multi-channel voice data; performing echo cancellation on the voice data to obtain the voice data Raw speech data.
  • the identification module 804 may be specifically configured to convert the voice data of the target channel into text content; identify the intent of the text content; and generate feedback data according to the intent.
  • the separation module 801 may, but is not limited to, separate the original voice data into multi-channel voice data in one of the following ways:
  • the device may further perform noise reduction processing and gain control on the voice data of each channel in the multi-channel voice data after separating the original voice data into multi-channel voice data.
  • a smart TV is also provided.
  • the smart TV may include a processor and a memory for storing processor-executable instructions, and the processor implements when the instructions are executed:
  • the original voice data is separated into multi-channel voice data, and then the credibility of the voice data of each channel in the multi-channel voice data is determined, and the channel corresponding to the most reliable voice data is used as the target channel.
  • Recognize the voice data of the target channel which can reduce the impact of noise on voice data recognition, solve the existing problem of high accuracy of voice recognition caused by the existence of noise and other interference sounds, and achieve the Accurate recognition allows for more effective human-machine voice interaction.
  • the devices or modules described in the foregoing embodiments may be specifically implemented by a computer chip or entity, or may be implemented by a product having a certain function.
  • the functions are divided into various modules and described separately.
  • the functions of each module may be implemented in the same or multiple software and / or hardware.
  • a module that implements a certain function may also be implemented by combining multiple submodules or subunits.
  • the method, device or module described in this application may be implemented in a computer-readable program code by the controller in any suitable manner.
  • the controller may adopt, for example, a microprocessor or processor and the storage may be processed by the (micro) Computer-readable program code (such as software or firmware) executed by a computer, computer-readable media, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers, and embedded microcontrollers
  • ASICs Application Specific Integrated Circuits
  • controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320.
  • the memory controller can also be implemented as part of the control logic of the memory.
  • controller logic gates, switches, dedicated integrated electrical channels, programmable logic controllers, and Embedded in the form of a microcontroller, etc. can be considered as a hardware component
  • the device included in the controller for implementing various functions can also be considered as a structure within the hardware component.
  • the means for implementing various functions can be regarded as a structure that can be both a software module implementing the method and a hardware component.
  • program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types.
  • program modules may be located in local and remote computer storage media, including storage devices.
  • the present application can be implemented by means of software plus necessary hardware. Based on such an understanding, the technical solution of the present application in essence or a part that contributes to the existing technology may be embodied in the form of a software product, or may be reflected in the implementation process of data migration.
  • the computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disc, etc., and includes a number of instructions to enable a computer device (which can be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute this software. Apply for the method described in each embodiment or some parts of the embodiment.

Abstract

A voice processing method, an electronic device, and a display device, the method comprising: separating original voice data into one or more pieces of voice data (601); determining the credibility of the separated pieces of voice data (602); and performing voice recognition on the voice data having the highest credibility (603). The impact of noise on voice data recognition may be reduced, the existing problem of low voice recognition accuracy due to the existence of noise and other disturbing sounds is solved, thus achieving the accurate recognition of voice data, and more effective man-machine voice interaction may be performed.

Description

一种语音处理方法、电子设备和显示设备Voice processing method, electronic equipment and display equipment
本申请要求2018年09月03日递交的申请号为201811020120.2、发明名称为“一种语音识别方法、智能设备和智能电视”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed on September 03, 2018 with an application number of 201811020120.2 and an invention name of "a voice recognition method, a smart device, and a smart TV", the entire contents of which are incorporated herein by reference. .
技术领域Technical field
本申请属于互联网技术领域,尤其涉及一种语音处理方法、电子设备和显示设备。The present application belongs to the field of Internet technology, and particularly relates to a voice processing method, an electronic device, and a display device.
背景技术Background technique
随着计算机、互联网、移动互联网、物联网的发展,智能设备(例如:手机、电脑、智能家居、智能机器人等)的使用越来越频繁。以往基于键盘鼠标、遥控器等单一的人机交互方式已经无法满足对智能设备控制的需求,相应的,语音识别、语音控制的需求也就变得越来越广泛,语音交互也将变成一种更为广泛的人机交互方式。With the development of computers, the Internet, the mobile Internet, and the Internet of Things, the use of smart devices (such as mobile phones, computers, smart homes, smart robots, etc.) is becoming more frequent. In the past, a single human-computer interaction method based on a keyboard, a mouse, and a remote controller could not meet the needs for smart device control. Correspondingly, the needs for voice recognition and voice control have become more and more extensive, and voice interaction will become a A wider range of human-computer interaction.
人机语音交互的是智能设备通过语音识别技术将语音命令转换为文字,然后,再通过语义理解技术理解命令的意图,并给予相应的反馈。然而,人机语音交互的前提是机器必须能够听清楚语音命令的内容。The human-machine voice interaction is that the intelligent device converts the voice command into text through speech recognition technology, and then understands the intention of the command through semantic understanding technology and gives corresponding feedback. However, the premise of human-machine voice interaction is that the machine must be able to hear the content of the voice command clearly.
然而,如图1所示,在实际的语音交互场景中,除了目标对象的语音外,一般还存在着设备回声、非目标说话人的语音、外界噪声干扰、房间混响等多种不利声学因素的影响。因此,拾音设备接收到的原始音是带噪的、低信噪比的语音信号,这样的信号不利于语音识别算法进行处理,从而导致无法进行有效的人机语音交互。However, as shown in Figure 1, in the actual voice interaction scene, in addition to the voice of the target object, there are generally a variety of adverse acoustic factors such as device echo, non-target speaker's voice, external noise interference, and room reverberation. Impact. Therefore, the original sound received by the sound pickup device is a noisy, low signal-to-noise ratio speech signal. Such a signal is not conducive to the processing of the speech recognition algorithm, resulting in the inability to perform effective human-machine speech interaction.
针对上述问题,目前尚未提出有效的解决方案。In view of the above problems, no effective solution has been proposed.
发明内容Summary of the Invention
本申请目的在于提供一种语音处理方法、电子设备和显示设备,可以实现对语音数据的精准识别,从而可以得到更为有效的人机语音交互。The purpose of this application is to provide a voice processing method, an electronic device, and a display device, which can realize accurate recognition of voice data, and thus can obtain more effective human-machine voice interaction.
本申请提供一种语音处理方法、电子设备和显示设备是这样实现的:The present application provides a voice processing method, an electronic device, and a display device as follows:
一种语音处理方法,所述方法包括:A speech processing method, the method includes:
将原始语音数据分离为一路或多路语音数据;Separate the original speech data into one or more speech data;
确定分离出的各路语音数据的可信度;Determine the credibility of the separated voice data;
对可信度最高的语音数据进行语音识别。Perform speech recognition on the most reliable speech data.
一种电子设备,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现:An electronic device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the processor implements:
将原始语音数据分离为一路或多路语音数据;Separate the original speech data into one or more speech data;
确定所述各路语音数据的可信度;Determining the credibility of the various voice data;
将可信度最高的语音数据进行语音识别。Recognize the most reliable speech data.
一种显示设备,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现:A display device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the display device implements:
将原始语音数据分离为一路或多路语音数据;Separate the original speech data into one or more speech data;
确定分离出的各路语音数据的可信度;Determine the credibility of the separated voice data;
对可信度最高的语音数据进行语音识别。Perform speech recognition on the most reliable speech data.
一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现上述方法的步骤。A computer-readable storage medium stores computer instructions thereon, the steps of the above method being implemented when the instructions are executed.
本申请提供的一种语音处理方法、电子设备和显示设备,通过将原始语音数据分离为多路语音数据,然后确定各路语音数据的可信度,并对可信度最高的语音数据进行语音识别,从而可以减少噪声对语音数据识别的影响,解决了现有的因为噪声等干扰声音存在而导致的语音识别准确度高的问题,达到了对语音数据的精准识别,可以进行更为有效的人机语音交互。A voice processing method, an electronic device, and a display device provided by the present application, by separating the original voice data into multiple channels of voice data, and then determining the credibility of each channel of voice data, the voice of the most reliable voice data is voiced. Recognition, which can reduce the impact of noise on speech data recognition, solve the existing problem of high accuracy of speech recognition due to the existence of interference sounds such as noise, and achieve accurate recognition of speech data, which can be more effective Human-machine voice interaction.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present application or the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are merely These are some of the embodiments described in this application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是现有的声音传播示意图;FIG. 1 is a schematic diagram of a conventional sound transmission;
图2是本申请提供的语音交互系统架构示意图;FIG. 2 is a schematic diagram of a voice interaction system architecture provided by this application; FIG.
图3是本申请提供的通道分离和确定示意图;3 is a schematic diagram of channel separation and determination provided in the present application;
图4是本申请提供的语音处理流程的架构示意图;FIG. 4 is a schematic structural diagram of a voice processing process provided by the present application; FIG.
图5是本申请提供的语音处理方法流程图;5 is a flowchart of a voice processing method provided by the present application;
图6是本申请提供的语音识别另一方法流程图;6 is a flowchart of another method of speech recognition provided by the present application;
图7是本申请提供的计算终端的架构示意图;FIG. 7 is a schematic structural diagram of a computing terminal provided by the present application; FIG.
图8是本申请提供的语音交互装置的结构框图。FIG. 8 is a structural block diagram of a voice interaction device provided by the present application.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described The examples are only part of the examples of this application, but not all examples. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts should fall within the protection scope of this application.
考虑到如果可以对智能设备传声器阵列收集到的原始音频信号进行增强,对设备回声、非目标语音、环境噪声等噪声信号进行抑制,从而提高目标语音的信噪比,那么可以有效日升语音识别的准确率,提高人机语音交互效率。Considering that if the original audio signal collected by the microphone array of the smart device can be enhanced, noise signals such as device echo, non-target speech, and environmental noise can be suppressed, thereby improving the signal-to-noise ratio of the target speech, then the speech recognition can be effectively improved. Accuracy, improve human-machine voice interaction efficiency.
在本例中,通过语音分离,得到多通道信号,然后基于每通道信号中唤醒词的可信度确定出与智能设备进行交互的信号,进而确定出生源信号是哪通道信号,对生源信号进行语音识别,可以有效减少噪声的影响。In this example, a multi-channel signal is obtained through speech separation, and then the signal that interacts with the smart device is determined based on the credibility of the wake word in each channel signal, and then the channel signal of the birth source signal is determined. Speech recognition can effectively reduce the impact of noise.
如图2所示,在本例中,提供了一种语音交互系统,可以包括:用户X、噪声源、交互设备,其中,该交互设备中包括:处理器,其中,该处理器用于完成语音的多通道分离和基于唤醒词的通道选择。As shown in FIG. 2, in this example, a voice interaction system is provided, which may include: User X, a noise source, and an interactive device. The interactive device includes a processor, where the processor is configured to complete a voice. Multi-channel separation and wake-word-based channel selection.
上述用户X是与交互设备进行语音交互的用户,用户X发出交互语音,例如:你好电视,帮我将电视的声音调高到50。The above user X is a user who performs voice interaction with the interactive device. User X sends out interactive voices, for example: Hello TV, help me turn up the TV sound to 50.
对于交互设备而言,不仅要具备处理器,还需要设置有声音收集器,该声音收集器可以是传声器阵列,该传声器阵列用于收集声音,并将收集到的声音提供给处理器,由处理器对声音进行处理。上述传声器阵列,可以是特定形状的传声器阵列结构,也可以是规则形状的传声器阵列结构,对于传声器阵列的具体结构本申请不作具体限定,可以根据实际需要选择和设定。For an interactive device, not only a processor but also a sound collector may be provided. The sound collector may be a microphone array. The microphone array is used to collect sound and provide the collected sound to the processor for processing. The processor processes the sound. The microphone array may be a microphone array structure with a specific shape or a microphone array structure with a regular shape. The specific structure of the microphone array is not specifically limited in this application, and may be selected and set according to actual needs.
为了实现对语音的准确识别,在本例中,利用语音分离技术将混合在原始音中的语音分离出来,在语音增强阶段,被分离出来的每一个语音都会有一通道输出。之后再通过语音唤醒技术,从被唤醒的通道中选择得分最高的通道作为目标语音,再送给语音识别系统,进行语音识别。In order to achieve accurate speech recognition, in this example, the speech mixed with the original sound is separated using speech separation technology. During the speech enhancement phase, each separated speech will have a channel output. Then, through the voice wake-up technology, the channel with the highest score is selected from the awakened channels as the target voice, and then sent to the voice recognition system for voice recognition.
举例而言,如图3所示,获取用户的原始语音数据(即,传声器阵列采集到的语音数据),该原始语音数据中有用户的原始音(即,需要获取的语音数据),也有很多的 噪声数据(例如,别的用户说话的噪声干扰、其它的声音干扰等等)。获取到语音数据之后,处理器可以通过语音分离处理,将该原始语音数据分离成多通道语音数据:第1通道语音数据、第2通道语音数据、第3通道语音数据、第4通道语音数据……。For example, as shown in FIG. 3, the user's original voice data (that is, the voice data collected by the microphone array) is obtained. The original voice data includes the user's original voice (that is, the voice data that needs to be acquired), and there are many Noise data (for example, noise interference from other users speaking, other sound interference, etc.). After obtaining the voice data, the processor can separate the original voice data into multi-channel voice data through voice separation processing: the first channel voice data, the second channel voice data, the third channel voice data, the fourth channel voice data ... ….
将第1通道语音数据、第2通道语音数据、第3通道语音数据、第4通道语音数据……中的每一通道数据作为输入信号,检测每一通道数据中是否有预定义的唤醒词出现,并进行唤醒词打分,得分越高表明唤醒词的信号质量越好,例如,第1通道语音数据中唤醒词打分为20、第2通道语音数据中唤醒词打分为98、第3通道语音数据中唤醒词打分为50、第4通道语音数据中唤醒词打分为35,那么可以确定第2通道语音数据是原始音。Use each channel data in the first channel voice data, the second channel voice data, the third channel voice data, the fourth channel voice data ... as an input signal, and detect whether a predefined wake-up word appears in each channel data , And score the wake-up words. The higher the score, the better the signal quality of the wake-up words. For example, the wake-up words are scored as 20 in the first channel of voice data, the wake-up words are scored as 98 in the second channel of voice data, and the third channel is voice data. The awakening word is scored 50 in the middle, and the awakening word is scored 35 in the fourth channel of voice data. Then it can be determined that the second channel of voice data is the original sound.
因此,可以将第2通道的语音数据作为确定出的原始音送至语音识别系统,由语音识别系统通过语音识别技术将语音命令转换为文字,然后,再通过语义理解技术理解命令的意图,并作出相应的反馈。Therefore, the voice data of the second channel can be sent to the speech recognition system as the determined original sound. The speech recognition system converts the voice command into text through the voice recognition technology, and then understands the intention of the command through the semantic understanding technology, and Give feedback accordingly.
在进行语音分离,即,将原始语音数据分离成多通道语音数据的时候,可以采用但不限于以下方式之一进行分离:When performing voice separation, that is, separating original voice data into multi-channel voice data, the separation can be performed in one of the following ways, but not limited to:
方式1:由于不同的声源是由不同的物理过程产生,因此,可以假设不同的声源信号之间是统计独立的。上述的原始语音信号就是多个源信号的混合,传声器阵列各通道采集到的信号就变得不再独立,因此,可以定义一个目标函数,在迭代过程中最大化各个输出通道之间的独立性,从而达到语音分离的目的。Method 1: Since different sound sources are generated by different physical processes, it can be assumed that different sound source signals are statistically independent. The above-mentioned original speech signal is a mixture of multiple source signals, and the signals collected by each channel of the microphone array become no longer independent. Therefore, an objective function can be defined to maximize the independence between each output channel during the iteration process. To achieve the purpose of speech separation.
方式2:由于语音信号在频域上是稀疏的,因此可以假设同一个时频点上只有一个声源占主导地位。为此,可以定义一种时频掩蔽(Mask)方法,将属于同一声源的时频点分离出来并归类到一起,在从各个源信号的时频掩蔽中计算出各个源的能量变化和协方差矩阵,从而实现语音分离。Method 2: Since the speech signal is sparse in the frequency domain, it can be assumed that only one sound source is dominant at the same time-frequency point. To this end, a time-frequency masking method can be defined, which separates and classifies the time-frequency points belonging to the same sound source, and calculates the energy change and sum of each source from the time-frequency masking of each source signal. Covariance matrix to achieve speech separation.
方式3:已知传声器阵列的拓扑结构,采用声源定位算法估计出多个声源中各个声源的方位角,然后,再用波束形成算法为每个声源分别形成一个波束,以输出多通道语音信号。Method 3: Knowing the topology of the microphone array, the sound source localization algorithm is used to estimate the azimuth of each sound source among multiple sound sources, and then a beam forming algorithm is used to form a beam for each sound source to output multiple sound sources. Channel voice signal.
为了实现对原始语音数据的有效处理,在进行语音分离之前可以对数据进行回声消除处理,从而消除各通道语音数据中的回声数据,在语音分离得到多通道语音数据之后,可以对每通道语音数据进行降噪处理,然后对降噪处理的后的各通道语音数据进行增益控制,在增益控制后,再对每通道语音数据进行唤醒词概率判断,从而确定出原始音,即,实现了对通道的选择。In order to realize the effective processing of the original voice data, the echo cancellation processing can be performed on the data before the voice separation, thereby eliminating the echo data in the voice data of each channel. After the multi-channel voice data is obtained by the voice separation, the voice data of each channel can be processed. Perform noise reduction processing, and then perform gain control on the voice data of each channel after the noise reduction processing. After gain control, the wake-up word probability judgment is performed on each channel of voice data to determine the original sound, that is, the channel is realized. s Choice.
例如,确定出通道2(即,第2通道语音数据)是原始音,那么在本次语音交互中, 将通道2的语音数据作为需要传送至语音识别系统的进行语音指令识别的数据。For example, it is determined that channel 2 (that is, the second channel voice data) is the original sound. In this voice interaction, the voice data of channel 2 is used as the data for voice instruction recognition that needs to be transmitted to the voice recognition system.
在进行唤醒词判断而言,可以是确定各通道语音数据中存在唤醒词的概率,将概率最大的一通道语音数据作为原始音对应的语音数据。其中,唤醒词可以是预先设置的一些敏感词,例如:如果交互设备是电视,那么唤醒词可以是你好电视,如果交互设备是音箱,那么唤醒词可以是你好音箱,如果交互设备是售卖机,那么唤醒词可以是你好售卖机,或者是为设备起了一个名字,例如:miumiu,那么可以设置唤醒词是你好miumiu,或者miumiu等等。具体的,唤醒词如何设置可以根据实际需要选择,本申请对此不作限定。In the judgment of the wake word, the probability of the presence of the wake word in the voice data of each channel may be determined, and the voice data of the channel with the highest probability is used as the voice data corresponding to the original sound. Among them, the wake-up word can be some sensitive words set in advance, for example: if the interactive device is a TV, then the wake-up word can be Hello TV, if the interactive device is a speaker, then the wake-up word can be a Hello speaker, if the interactive device is sold Machine, then the wake-up word can be a hello vending machine or a name for the device, for example: miumiu, then you can set the wake-up word to be hello miumiu, or miumiu, and so on. Specifically, how to set the wake word can be selected according to actual needs, which is not limited in this application.
其中,增益控制(Automatic Gain Control,简称为AGC)是使通道的增益自动地随信号强度而调整的自动控制方法,自动增益控制是限幅输出的一种,它利用线性放大和压缩放大的有效组合对助听器的输出信号进行调整。当弱信号输入时,线性放大电通道工作,保证输出信号的强度;当输入信号达到一定强度时,启动压缩放大电通道,使输出幅度降低。也就是说,AGC功能可以通过改变输入输出压缩比例自动控制增益的幅度。一种是利用增加AGC电压的方式来减小增益的方式叫正向AGC,一种是利用减小AGC电压的方式来减小增益的方式叫反向AGC.正向AGC控制能力强,所需控制功率大被控放大级工作点变动范围大,放大器两端阻抗变化也大;反向AGC所需控制功率小,控制范围也小。Among them, Automatic Control (AGC) is an automatic control method that automatically adjusts the gain of the channel with the signal strength. Automatic gain control is a kind of limit output. It uses the effective of linear amplification and compression amplification. The combination adjusts the output signal of the hearing aid. When a weak signal is input, the linear amplification electrical channel works to ensure the strength of the output signal; when the input signal reaches a certain intensity, the compression amplification electrical channel is started to reduce the output amplitude. In other words, the AGC function can automatically control the amplitude of the gain by changing the input-output compression ratio. One is to increase the AGC voltage to reduce the gain. It is called forward AGC. The other is to reduce the gain by reducing the AGC voltage. It is called reverse AGC. The forward AGC has strong control capabilities and requires Large control power The range of the working point of the controlled amplifier is large, and the impedance changes at both ends of the amplifier are also large. The control power required for reverse AGC is small, and the control range is also small.
在选择通道的时候,可以仅通过对唤醒词进行打分的方式,也可以结合其它数据,例如:唤醒词时长、信噪比等信息进行通道选择,选出并输出综合排名最高的通道。具体采用哪种方式选择出目标通道可以根据实际需要选择,本申请对此不作具体。When selecting a channel, the awakening word can be scored only, or other data can be combined, such as the length of the awakening word and the signal-to-noise ratio to select the channel, and select and output the channel with the highest comprehensive ranking. Which method is used to select the target channel may be selected according to actual needs, which is not specifically described in this application.
在与语音数据交互的过程,一般都是一个用户触发一次语音交互,开始是用户说出唤醒词,从而触发一次交互流程,在交互结束后,可以触发与另一用户的交互。本例中,上述的声音识别方式,是基于确定交互用户在交互过程中位置是不移动的,例如,用户站在电视机前,对电视机说:你好电视,请调高音量,或者是,站在售卖机前说:你好售卖机,我要一张从苏州街到金融街的地铁票。In the process of interacting with voice data, generally, a user triggers a voice interaction, and the user starts a wake-up word to trigger an interaction process. After the interaction ends, an interaction with another user can be triggered. In this example, the above-mentioned voice recognition method is based on determining that the interactive user's position does not move during the interaction. For example, the user stands in front of the TV and says to the TV: Hello TV, please turn up the volume, or, Standing in front of the vending machine said: Hello vending machine, I want a subway ticket from Suzhou Street to Finance Street.
即,在这个流程中,是某人用户的位置是几乎不移动的,考虑到在实际实现的时候,这种情况容易出现用户位置移动的情况,如果持续按照确定出的通道来进行语音拾取和识别,可能会导致整个交互过程出现障碍,为此,可以增加一个语音用户识别的步骤。即,在确定原始音信道之后,后续获取到该通道的语音数据之后,先通过身份识别,确定用户是否是原始音对应的而用户,如果是的话,则继续获取该通道的语音数据,以及 对该通道的语音数据进行语音识别,如果不是原始音对应的客户,则重新确定通道。That is, in this process, the position of someone's user is hardly moved. Considering that in actual implementation, this situation is prone to the movement of the user's position. If you continue to perform voice pickup and follow the determined channel, Recognition may cause obstacles to the entire interaction process. To this end, a step of voice user recognition can be added. That is, after the original audio channel is determined, after the subsequent acquisition of the voice data of the channel, the identity recognition is performed first to determine whether the user corresponds to the original audio and the user, and if so, continue to acquire the voice data of the channel, and The channel's voice data is used for speech recognition. If it is not the customer corresponding to the original tone, the channel is determined again.
下面结合一个具体实例对上述的语音交互方法进行说明,如图4所示,该语音交互系统包括:增强系统和唤醒系统,其中,增强系统对传声器阵列接收到的原始语音信号进行增强,输出多个信噪比较高的声源信号。图4中以得到两通道信噪比比较高的声源信号(你好电视、今天的天气)为例,在实际实现的时候,可以包括两通道或者更多通道的声源信号,即,支持多通道语音输出和多路通道选择。其中,唤醒系统判断多通道信号输出中是否包含用户预定义的唤醒词,例如“你好电视”,并根据对各通道信号输出的语音数据的唤醒打分来确定输出通道。The above-mentioned voice interaction method is described below with reference to a specific example. As shown in FIG. 4, the voice interaction system includes an enhancement system and a wake-up system, wherein the enhancement system enhances the original voice signal received by the microphone array and outputs multiple signals. A high signal-to-noise sound source signal. In FIG. 4, a sound source signal with a high signal-to-noise ratio between two channels (hello TV, today's weather) is taken as an example. In actual implementation, it can include two or more channels of sound source signals, that is, support Multi-channel voice output and multi-channel selection. The wake-up system determines whether the multi-channel signal output contains a user-defined wake-up word, such as "hello TV", and determines the output channel based on the wake-up score of the voice data output from each channel signal.
其中,增强系统可以包括:回声消除模块、语音分离与降噪模块和增益控制模块,其中,回声消除模块,用于抑制交互设备自身发出的声音,例如:电视、音箱播放的节目或提示音等;语音分离与降噪模块,用于从混合信号中分离出各个源信号,并抑制环境噪声,例如:空调、微波炉等平稳噪声。增益控制模块,用于自动调整输出信号的增益,使输出信号符合唤醒词模块以及语音识别的输入要求。The enhancement system may include an echo cancellation module, a voice separation and noise reduction module, and a gain control module. The echo cancellation module is used to suppress the sound emitted by the interactive device itself, such as a program or a prompt sound played by a television or a speaker. ; Voice separation and noise reduction module, used to separate each source signal from the mixed signal, and suppress environmental noise, such as: smooth noise such as air conditioners, microwave ovens. A gain control module is used to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition.
其中,语音分离模块可以采用但不限于以下方式之一进行语音数据分离:The voice separation module may use, but is not limited to, one of the following ways to separate voice data:
方式1:由于不同的声源是由不同的物理过程产生,因此,可以假设不同的声源信号之间是统计独立的。上述的原始语音信号就是多个源信号的混合,传声器阵列各通道采集到的信号就变得不再独立,因此,可以定义一个目标函数,在迭代过程中最大化各个输出通道之间的独立性,从而达到语音分离的目的。Method 1: Since different sound sources are generated by different physical processes, it can be assumed that different sound source signals are statistically independent. The above-mentioned original speech signal is a mixture of multiple source signals, and the signals collected by each channel of the microphone array become no longer independent. Therefore, an objective function can be defined to maximize the independence between each output channel during the iteration process. To achieve the purpose of speech separation.
方式2:由于语音信号在频域上是稀疏的,因此可以假设同一个时频点上只有一个声源占主导地位。为此,可以定义一种时频掩蔽(Mask)方法,将属于同一声源的时频点分离出来并归类到一起,在从各个源信号的时频掩蔽中计算出各个源的能量变化和协方差矩阵,从而实现语音分离。Method 2: Since the speech signal is sparse in the frequency domain, it can be assumed that only one sound source is dominant at the same time-frequency point. To this end, a time-frequency masking method can be defined, which separates and classifies the time-frequency points belonging to the same sound source, and calculates the energy change and sum of each source from the time-frequency masking of each source signal. Covariance matrix to achieve speech separation.
方式3:已知传声器阵列的拓扑结构,采用声源定位算法估计出多个声源中各个声源的方位角,然后,再用波束形成算法为每个声源分别形成一个波束,以输出多通道语音信号。Method 3: Knowing the topology of the microphone array, the sound source localization algorithm is used to estimate the azimuth of each sound source among multiple sound sources, and then a beam forming algorithm is used to form a beam for each sound source to output multiple sound sources. Channel voice signal.
其中,图4中的唤醒系统可以包括:唤醒词模块和通道选择模块,其中,唤醒词模块,用于从输入信号中检测是否有预定义的唤醒词出现,并给出唤醒词的得分,得分越高表明唤醒词的信号质量越好;通道选择模块,用于根据多个唤醒词的打分,以及其它特征,例如:唤醒词时长、信噪比等信息进行通道选择,选出并输出综合排名最高的通道。The wake-up system in FIG. 4 may include: a wake-up word module and a channel selection module. The wake-up word module is configured to detect whether a predefined wake-up word appears from the input signal, and give a score and a score of the wake-up word. The higher the signal quality, the better the wake-up word; the channel selection module, which is used to select the channel based on the score of multiple wake-up words and other characteristics, such as the wake-up word length and signal-to-noise ratio, and select a comprehensive ranking The highest channel.
在本例中的上述方案,基于语音分离技术,对重叠语音造成的失真较小,并且不依赖对声源方位角的定位,即便是目标声音和干扰声音位于同一方位角,只要是目标声音和干扰声音到传声器阵列的距离存在差异,那么通过本申请的方式也可以进行有效处理。具体的,在本例中,利用语音分离技术将混合在原始音中的语音分离出来,在语音增强阶段,被分离出来的每一个语音都会有一通道输出,然后,再通过语音唤醒技术,从被唤醒的通道中选择得分最高的通道作为目标语音所在通道,然后将该通道的语音数据送至语音识别系统进行处理。因采用多通道增强输出加多通道唤醒词加唤醒后验概率打分的方式,每一通道语音增强输出都是分离出来的语音,其语音信噪比能得到有效增强,从而可以提高真实唤醒词被唤醒算法检出的概率,从而选出最有可能是目标说话人的通道用于后续的操作。In the above solution in this example, based on the speech separation technology, the distortion caused by overlapping speech is small, and it does not rely on the localization of the azimuth of the sound source, even if the target sound and the interference sound are located at the same azimuth, as long as the target sound and There is a difference in the distance between the interference sound and the microphone array, so it can be effectively processed by the method of the present application. Specifically, in this example, the voice mixed in the original sound is separated using the voice separation technology. During the voice enhancement phase, each voice that is separated will have a channel output. Then, the voice wake-up technology Among the wake-up channels, the channel with the highest score is selected as the channel where the target voice is located, and then the voice data of the channel is sent to the speech recognition system for processing. Because the multi-channel enhanced output plus multi-channel wake word plus wake-up posterior probability score is used, each channel of voice enhanced output is a separate voice, and its voice signal-to-noise ratio can be effectively enhanced, which can improve the real wake-up The probability detected by the wake-up algorithm, so that the channel most likely to be the target speaker is selected for subsequent operations.
图5是本申请所述一种语音处理方法一个实施例的方法流程图。虽然本申请提供了如下述实施例或附图所示的方法操作步骤或装置结构,但基于常规或者无需创造性的劳动在所述方法或装置中可以包括更多或者更少的操作步骤或模块单元。在逻辑性上不存在必要因果关系的步骤或结构中,这些步骤的执行顺序或装置的模块结构不限于本申请实施例描述及附图所示的执行顺序或模块结构。所述的方法或模块结构的在实际中的装置或终端产品应用时,可以按照实施例或者附图所示的方法或模块结构连接进行顺序执行或者并行执行(例如并行处理器或者多线程处理的环境,甚至分布式处理环境)。FIG. 5 is a method flowchart of an embodiment of a speech processing method described in this application. Although this application provides method operation steps or device structures as shown in the following embodiments or drawings, based on conventional or no creative labor, the method or device may include more or fewer operation steps or module units. . Among the steps or structures that do not logically have the necessary causal relationship, the execution order of these steps or the module structure of the device is not limited to the execution order or the module structure shown in the embodiments of the present application and shown in the accompanying drawings. When the method or the module structure is applied to an actual device or terminal product, the method or the module structure shown in the embodiment or the drawings may be connected to execute sequentially or in parallel (for example, a parallel processor or multi-threaded processing). Environment, or even a distributed processing environment).
具体的如图5所示,本申请一种实施例提供的语音处理方法,可以包括:Specifically, as shown in FIG. 5, a voice processing method provided by an embodiment of the present application may include:
步骤501:将原始语音数据分离为多通道语音数据;Step 501: Separate the original voice data into multi-channel voice data;
其中,这原始语音数据可以是通过传声器阵列拾取的声音数据,然后对拾取的声音数据进行回声消除后得到的,其中,抑制交互设备自身发出的声音,例如:电视、音箱播放的节目或提示音等。The original voice data can be obtained by sound data picked up by the microphone array, and then the picked up sound data is obtained by echo cancellation. Among them, the sound emitted by the interactive device itself is suppressed, for example, a program or a prompt sound played by a television or a speaker. Wait.
步骤502:确定所述多通道语音数据中各通道语音数据的可信度;Step 502: Determine the credibility of the voice data of each channel in the multi-channel voice data;
具体的,可以根据以下至少之一确定可信度:预定词组的信号质量、语音数据的信噪比、预定词组出现的时长。其中,预定词组可以是唤醒词,唤醒词可以是预先设置的一些敏感词,例如:如果交互设备是电视,那么唤醒词可以是你好电视,如果交互设备是音箱,那么唤醒词可以是你好音箱,如果交互设备是售卖机,那么唤醒词可以是你好售卖机,或者是为设备起了一个名字,例如:miumiu,那么可以设置唤醒词是你好miumiu,或者miumiu等等。具体的,唤醒词如何设置可以根据实际需要选择,本申请对此不作限定。Specifically, the credibility may be determined according to at least one of the following: the signal quality of the predetermined phrase, the signal-to-noise ratio of the voice data, and the length of time that the predetermined phrase appears. Among them, the predetermined phrase may be a wake-up word, and the wake-up word may be some sensitive words set in advance, for example: if the interactive device is a TV, then the wake-up word may be Hello TV, and if the interactive device is a speaker, the wake-up word may be Hello Speaker, if the interactive device is a vending machine, then the wake-up word can be a hello vending machine, or a name for the device, such as: miumiu, then you can set the wake-up word to be hello miumiu, or miumiu, and so on. Specifically, how to set the wake word can be selected according to actual needs, which is not limited in this application.
步骤503:将可信度最高的语音数据对应的通道作为目标通道;Step 503: Use the channel corresponding to the most reliable voice data as the target channel.
步骤504:对所述目标通道的语音数据进行语音识别。Step 504: Perform voice recognition on the voice data of the target channel.
在上例中,利用语音分离技术将混合在原始音中的语音分离出来,被分离出来的每一个语音都会有一通道输出,然后,再确定每通道信号的可信度,选择可信度最高的通道作为目标语音所在通道,然后将该通道的语音数据送至语音识别系统进行处理,从而可以减少噪声对语音数据识别的影响,解决了现有的因为噪声等干扰声音存在而导致的语音识别准确度高的问题,达到了对语音数据的精准识别,可以进行更为有效的人机语音交互。In the above example, the voice mixed with the original sound is separated by using voice separation technology. Each voice that is separated will have a channel output. Then, the credibility of the signal of each channel is determined, and the highest credibility is selected. The channel is used as the channel where the target voice is located, and then the voice data of this channel is sent to the speech recognition system for processing, which can reduce the impact of noise on speech data recognition, and solve the existing accurate speech recognition caused by the existence of disturbing sounds such as noise. The high degree of problem achieves accurate recognition of voice data and enables more effective human-machine voice interaction.
在上述步骤504中,对所述目标通道的语音数据进行语音识别,可以是将所述目标通道的语音数据转换为文本内容;识别所述文本内容的意图,然后根据所述意图,生成反馈数据。例如,语音数据是:你好电视,请将音量调整至50,那么确定意图是要提升音量,相应的生成的数据可以是:提升音量的操作数据,同时也可以生成用于反馈至用户的语音数据,例如,主人,已调整完毕等等。In step 504, performing voice recognition on the voice data of the target channel may be converting the voice data of the target channel into text content; identifying the intent of the text content, and then generating feedback data according to the intent . For example, the voice data is: Hello TV, please adjust the volume to 50, then determine that the intention is to increase the volume. The corresponding generated data can be: operation data to increase the volume, and also generate the voice for feedback to the user. The data, for example, the owner, has been adjusted and so on.
上述是以电视作为交互设备,在实际实现的时候,交互设备还可以是其它的设备,例如:智能音箱、智能贩卖机等等,本申请对此不作限定。The foregoing uses television as an interactive device. In actual implementation, the interactive device may also be other devices, such as smart speakers, smart vending machines, etc., which is not limited in this application.
在实现的时候,可以但不限于通过以下方式之一将原始语音数据分离为多通道语音数据:When implemented, the original voice data can be separated into multi-channel voice data in one of the following ways:
1)通过最大化各个输出通道之间的独立性的目标函数,对所述原始语音数据进行迭代计算,以得到所述多通道语音数据;1) Iteratively calculate the original voice data by using an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data;
2)将所述原始语音数据中属于同一声源的时频点进行分离和归类,确定出多个声源信号,从各个声源信号的时频掩蔽中计算出各个声源的能量变化和协方差矩阵,以得到所述多通道语音数据;2) Separate and classify time-frequency points belonging to the same sound source in the original voice data, determine multiple sound source signals, and calculate the energy change and sum of each sound source from the time-frequency mask of each sound source signal. Covariance matrix to obtain the multi-channel speech data;
或,3)获取传声器阵列的拓扑结构,采用声源定位算法确定出多个声源中各个声源的方位角,通过波束形成算法为每个声源分别形成一个波束,以得到所述多通道语音数据。Or, 3) Obtain the topology of the microphone array, determine the azimuth of each of the multiple sound sources using a sound source localization algorithm, and form a beam for each sound source through a beam forming algorithm to obtain the multi-channel Voice data.
具体的,在将原始语音数据分离为多通道语音数据之后,还可以对多通道语音数据中的各通道语音数据进行降噪处理和增益控制,其中,降噪处理是为了抑制环境噪声,例如:空调、微波炉等平稳噪声。增益控制是为了自动调整输出信号的增益,使得输出信号符合唤醒词模块以及语音识别的输入要求。Specifically, after the original voice data is separated into multi-channel voice data, noise reduction processing and gain control can also be performed on the voice data of each channel in the multi-channel voice data. The noise reduction processing is to suppress environmental noise, for example: Air conditioner, microwave oven, etc. smooth noise. The gain control is to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition.
在本申请中还提供了一种语音处理方法,如图6所示,可以包括如下步骤:A voice processing method is also provided in this application. As shown in FIG. 6, the method may include the following steps:
步骤601:将原始语音数据分离为一路或多路语音数据;Step 601: separate the original voice data into one or more voice data;
步骤602:确定分离出的各路语音数据的可信度;Step 602: Determine the credibility of the separated voice data;
步骤603:对可信度最高的语音数据进行语音识别。Step 603: Perform speech recognition on the most reliable speech data.
即,通过将原始语音数据分离为多路语音数据之后,选择可信度最高的语音数据进行语音识别,从而可以解决现有的语音识别过程中因为有噪声等导致的识别准确度较低的问题。That is, by separating the original speech data into multiple speech data, and selecting the speech data with the highest reliability for speech recognition, the problem of low recognition accuracy due to noise and the like in the existing speech recognition process can be solved .
具体的,在上述步骤602中,确定分离出的各路语音数据的可信度,可以是基于唤醒词确定的,例如,可以是对各路语音数据,检测是否有预定义的唤醒词出现,并确定出唤醒词的得分;根据唤醒词得分确定语音数据的得分;将确定的语音数据的得分,作为语音数据的可信度。Specifically, in step 602, determining the credibility of the separated voice data may be determined based on the wake word, for example, it may be to detect whether a predefined wake word appears for each voice data, The score of the arousal word is determined; the score of the voice data is determined according to the score of the arousal word; and the determined score of the voice data is used as the credibility of the voice data.
为了提升可信度确认的准确度,可以进一步结合信噪比等内容确定可信度,例如,可以获取唤醒词时长和信噪比;根据所述唤醒词得分、唤醒词时长和信噪比,计算得到所述语音数据的得分。In order to improve the accuracy of the credibility confirmation, the credibility can be further determined by combining the content such as the signal-to-noise ratio. For example, the wake-up word duration and the signal-to-noise ratio can be obtained; The score of the voice data is calculated.
考虑到在实际实现的时候,获取的声音数据可能是存在回声的,因此在将原始语音数据分离为一路或多路语音数据之前,可以获取声音数据,然后对声音数据进行回声消除,从而得到原始语音数据,这样便可以有效消除回声的影响。Considering that in actual implementation, the acquired sound data may have echo, so before separating the original voice data into one or more voice data, you can obtain the sound data and then perform echo cancellation on the sound data to obtain the original Voice data so that the effects of echo can be effectively eliminated.
在实现的时候,上述步骤601将原始语音数据分离为多通道语音数据具体可以采用以下方式之一:In implementation, the above step 601 can separate the original voice data into multi-channel voice data. Specifically, one of the following methods can be adopted:
方式1:通过最大化各个输出通道之间的独立性的目标函数,对所述原始语音数据进行迭代计算,以得到所述多通道语音数据;或者,Method 1: Iteratively calculate the original voice data by using an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data; or,
方式2:将所述原始语音数据中属于同一声源的时频点进行分离和归类,确定出多个声源信号,从各个声源信号的时频掩蔽中计算出各个声源的能量变化和协方差矩阵,以得到所述多通道语音数据;Method 2: Separate and classify time-frequency points belonging to the same sound source in the original voice data, determine multiple sound source signals, and calculate the energy change of each sound source from the time-frequency mask of each sound source signal Sum covariance matrix to obtain the multi-channel voice data;
方式3:获取传声器阵列的拓扑结构,采用声源定位算法确定出多个声源中各个声源的方位角,通过波束形成算法为每个声源分别形成一个波束,以得到所述多通道语音数据。Method 3: Obtain the topology of the microphone array, use the sound source localization algorithm to determine the azimuth of each sound source among the multiple sound sources, and form a beam for each sound source through the beam forming algorithm to obtain the multi-channel voice data.
具体的,在将原始语音数据分离为多通道语音数据之后,还可以对多通道语音数据中的各通道语音数据进行降噪处理和增益控制,其中,降噪处理是为了抑制环境噪声,例如:空调、微波炉等平稳噪声。增益控制是为了自动调整输出信号的增益,使得输出信号符合唤醒词模块以及语音识别的输入要求。在实现的时候,可以先进行降噪处理, 再进行增益控制。Specifically, after the original voice data is separated into multi-channel voice data, noise reduction processing and gain control can also be performed on the voice data of each channel in the multi-channel voice data. The noise reduction processing is to suppress environmental noise, for example: Air conditioner, microwave oven, etc. smooth noise. The gain control is to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wake word module and speech recognition. In the implementation, the noise reduction process can be performed first, and then the gain control can be performed.
本申请实施例所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在计算机终端上为例,图7是本发明实施例的一种语音处理方法的计算机终端的硬件结构框图。如图7所示,计算机终端10可以包括一个或多个(图中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的存储器104、以及用于通信功能的传输模块106。本领域普通技术人员可以理解,图7所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,计算机终端10还可包括比图7中所示更多或者更少的组件,或者具有与图7所示不同的配置。The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a computer terminal as an example, FIG. 7 is a block diagram of a hardware structure of a computer terminal of a voice processing method according to an embodiment of the present invention. As shown in FIG. 7, the computer terminal 10 may include one or more (only one shown in the figure) a processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) A memory 104 for storing data, and a transmission module 106 for communication functions. A person of ordinary skill in the art can understand that the structure shown in FIG. 7 is only schematic, and it does not limit the structure of the electronic device. For example, the computer terminal 10 may further include more or fewer components than those shown in FIG. 7, or have a configuration different from that shown in FIG. 7.
存储器104可用于存储应用软件的软件程序以及模块,如本发明实施例中的语音处理方法对应的程序指令/模块,处理器102通过运行存储在存储器104内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的应用程序的语音处理方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 may be used to store software programs and modules of application software, such as program instructions / modules corresponding to the voice processing method in the embodiment of the present invention. The processor 102 executes various software programs and modules stored in the memory 104 to execute various programs. Function application and data processing, that is, the speech processing method for implementing the above application program. The memory 104 may include a high-speed random access memory, and may further include a non-volatile memory, such as one or more magnetic storage devices, a flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely disposed with respect to the processor 102, and these remote memories may be connected to the computer terminal 10 through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
传输模块106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输模块106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输模块106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。The transmission module 106 is configured to receive or send data via a network. A specific example of the above network may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a network adapter (NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission module 106 may be a radio frequency (RF) module, which is used to communicate with the Internet in a wireless manner.
在软件层面,上述语音识别装置如图8所示,可以包括:分离模块801、第一确定模块802、第二确定模块803和识别模块804,其中:At the software level, the voice recognition device shown in FIG. 8 may include: a separation module 801, a first determination module 802, a second determination module 803, and a recognition module 804, where:
分离模块801,用于将原始语音数据分离为多通道语音数据;A separation module 801, configured to separate the original voice data into multi-channel voice data;
第一确定模块802,用于确定所述多通道语音数据中各通道语音数据的可信度;A first determining module 802, configured to determine the credibility of the voice data of each channel in the multi-channel voice data;
第二确定模块803,用于将可信度最高的语音数据对应的通道作为目标通道;A second determining module 803, configured to use a channel corresponding to the most reliable voice data as a target channel;
识别模块804,用于对所述目标通道的语音数据进行语音识别。The recognition module 804 is configured to perform voice recognition on the voice data of the target channel.
在一个实施方式中,第一确定模块802具体可以根据以下至少之一确定可信度:预定词组的信号质量、语音数据的信噪比、预定词组出现的时长。In an implementation manner, the first determining module 802 may specifically determine the credibility according to at least one of the following: the signal quality of the predetermined phrase, the signal-to-noise ratio of the voice data, and the duration of occurrence of the predetermined phrase.
在一个实施方式中,上述装置还可以包括:拾取消除模块,用于在将原始语音数据分离为多通道语音数据之前,通过传声器阵列拾取声音数据;对所述声音数据进行回声消除,得到所述原始语音数据。In one embodiment, the above device may further include: a pick-up and cancellation module, configured to pick up sound data through a microphone array before separating the original voice data into multi-channel voice data; performing echo cancellation on the voice data to obtain the voice data Raw speech data.
在一个实施方式中,识别模块804具体可以用于将所述目标通道的语音数据转换为文本内容;识别所述文本内容的意图;根据所述意图,生成反馈数据。In one embodiment, the identification module 804 may be specifically configured to convert the voice data of the target channel into text content; identify the intent of the text content; and generate feedback data according to the intent.
在一个实施方式中,分离模块801可以但不限于通过以下方式之一将原始语音数据分离为多通道语音数据:In one embodiment, the separation module 801 may, but is not limited to, separate the original voice data into multi-channel voice data in one of the following ways:
1)通过最大化各个输出通道之间的独立性的目标函数,对所述原始语音数据进行迭代计算,以得到所述多通道语音数据;1) Iteratively calculate the original voice data by using an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data;
2)将所述原始语音数据中属于同一声源的时频点进行分离和归类,确定出多个声源信号,从各个声源信号的时频掩蔽中计算出各个声源的能量变化和协方差矩阵,以得到所述多通道语音数据;2) Separate and classify time-frequency points belonging to the same sound source in the original voice data, determine multiple sound source signals, and calculate the energy change and sum of each sound source from the time-frequency mask of each sound source signal. Covariance matrix to obtain the multi-channel speech data;
3)获取传声器阵列的拓扑结构,采用声源定位算法确定出多个声源中各个声源的方位角,通过波束形成算法为每个声源分别形成一个波束,以得到所述多通道语音数据。3) Obtain the topology of the microphone array, determine the azimuth of each of the multiple sound sources using the sound source localization algorithm, and form a beam for each sound source through the beam forming algorithm to obtain the multi-channel voice data .
在一个实施方式中,上述装置还可以在将原始语音数据分离为多通道语音数据之后,对所述多通道语音数据中的各通道语音数据进行降噪处理和增益控制。In one embodiment, the device may further perform noise reduction processing and gain control on the voice data of each channel in the multi-channel voice data after separating the original voice data into multi-channel voice data.
本例中,还提供了一种智能电视,该智能电视可以包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现:In this example, a smart TV is also provided. The smart TV may include a processor and a memory for storing processor-executable instructions, and the processor implements when the instructions are executed:
将原始语音数据分离为多通道语音数据;Separating the original voice data into multi-channel voice data;
确定所述多通道语音数据中各通道语音数据的可信度;Determining the credibility of the voice data of each channel in the multi-channel voice data;
将可信度最高的语音数据对应的通道作为目标通道;Use the channel corresponding to the most reliable voice data as the target channel;
对所述目标通道的语音数据进行语音识别。Perform speech recognition on the speech data of the target channel.
在上述实施例中,通过将原始语音数据分离为多通道语音数据,然后确定多通道语音数据中各通道语音数据的可信度,并将将可信度最高的语音数据对应的通道作为目标通道,对目标通道的语音数据进行语音识别,从而可以减少噪声对语音数据识别的影响,解决了现有的因为噪声等干扰声音存在而导致的语音识别准确度高的问题,达到了对语音数据的精准识别,可以进行更为有效的人机语音交互。In the above embodiment, the original voice data is separated into multi-channel voice data, and then the credibility of the voice data of each channel in the multi-channel voice data is determined, and the channel corresponding to the most reliable voice data is used as the target channel. , Recognize the voice data of the target channel, which can reduce the impact of noise on voice data recognition, solve the existing problem of high accuracy of voice recognition caused by the existence of noise and other interference sounds, and achieve the Accurate recognition allows for more effective human-machine voice interaction.
虽然本申请提供了如实施例或流程图所述的方法操作步骤,但基于常规或者无创造性的劳动可以包括更多或者更少的操作步骤。实施例中列举的步骤顺序仅仅为众多步骤执行顺序中的一种方式,不代表唯一的执行顺序。在实际中的装置或客户端产品执行时, 可以按照实施例或者附图所示的方法顺序执行或者并行执行(例如并行处理器或者多线程处理的环境)。Although the present application provides method operation steps as described in the embodiments or flowcharts, more or less operation steps may be included based on conventional or non-creative labor. The sequence of steps listed in the embodiments is only one way of executing the steps, and does not represent the only sequence of execution. When the actual device or client product is executed, it may be executed sequentially or in parallel according to the method shown in the embodiment or the drawings (for example, a parallel processor or a multi-threaded environment).
上述实施例阐明的装置或模块,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。为了描述的方便,描述以上装置时以功能分为各种模块分别描述。在实施本申请时可以把各模块的功能在同一个或多个软件和/或硬件中实现。当然,也可以将实现某功能的模块由多个子模块或子单元组合实现。The devices or modules described in the foregoing embodiments may be specifically implemented by a computer chip or entity, or may be implemented by a product having a certain function. For the convenience of description, when describing the above device, the functions are divided into various modules and described separately. When implementing this application, the functions of each module may be implemented in the same or multiple software and / or hardware. Of course, a module that implements a certain function may also be implemented by combining multiple submodules or subunits.
本申请中所述的方法、装置或模块可以以计算机可读程序代码方式实现控制器按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电通道(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电通道、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内部包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。The method, device or module described in this application may be implemented in a computer-readable program code by the controller in any suitable manner. For example, the controller may adopt, for example, a microprocessor or processor and the storage may be processed by the (micro) Computer-readable program code (such as software or firmware) executed by a computer, computer-readable media, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers, and embedded microcontrollers Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320. The memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art also know that, in addition to implementing the controller in a purely computer-readable program code manner, it is entirely possible to make the controller logic gates, switches, dedicated integrated electrical channels, programmable logic controllers, and Embedded in the form of a microcontroller, etc. to achieve the same function. Therefore, such a controller can be considered as a hardware component, and the device included in the controller for implementing various functions can also be considered as a structure within the hardware component. Or even, the means for implementing various functions can be regarded as a structure that can be both a software module implementing the method and a hardware component.
本申请所述装置中的部分模块可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构、类等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。Some modules in the apparatus described in this application may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types. The present application can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules may be located in local and remote computer storage media, including storage devices.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的硬件的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,也可以通过数据迁移的实施过程中体现出来。该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,移动终端,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。It can be known from the description of the foregoing embodiments that those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary hardware. Based on such an understanding, the technical solution of the present application in essence or a part that contributes to the existing technology may be embodied in the form of a software product, or may be reflected in the implementation process of data migration. The computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disc, etc., and includes a number of instructions to enable a computer device (which can be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute this software. Apply for the method described in each embodiment or some parts of the embodiment.
本说明书中的各个实施例采用递进的方式描述,各个实施例之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。本申请的全部或者部分可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、移动通信终端、多处理器系统、基于微处理器的系统、可编程的电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。Each embodiment in this specification is described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other. Each embodiment focuses on differences from other embodiments. All or part of this application can be used in many general-purpose or special-purpose computer system environments or configurations. For example: personal computer, server computer, handheld device or portable device, tablet device, mobile communication terminal, multi-processor system, microprocessor-based system, programmable electronic device, network PC, small computer, mainframe computer, including Distributed computing environment for any of the above systems or devices, etc.
虽然通过实施例描绘了本申请,本领域普通技术人员知道,本申请有许多变形和变化而不脱离本申请的精神,希望所附的权利要求包括这些变形和变化而不脱离本申请的精神。Although the present application is described through the examples, those skilled in the art know that there are many variations and changes in the present application without departing from the spirit of the present application, and it is expected that the appended claims include these variations and changes without departing from the spirit of the present application.

Claims (18)

  1. 一种语音处理方法,其中,所述方法包括:A speech processing method, wherein the method includes:
    将原始语音数据分离为一路或多路语音数据;Separate the original speech data into one or more speech data;
    确定分离出的各路语音数据的可信度;Determine the credibility of the separated voice data;
    对可信度最高的语音数据进行语音识别。Perform speech recognition on the most reliable speech data.
  2. 根据权利要求1所述的方法,其中,确定分离出的各路语音数据的可信度,包括:The method according to claim 1, wherein determining the credibility of the separated voice data comprises:
    对各路语音数据,检测是否有预定义的唤醒词出现,并确定出唤醒词的得分;For each voice data, detect whether a predefined wake-up word appears and determine the score of the wake-up word;
    根据唤醒词得分确定语音数据的得分;Determine the score of the speech data according to the wake word score;
    将确定的语音数据的得分,作为语音数据的可信度。The determined score of the voice data is used as the credibility of the voice data.
  3. 根据权利要求2所述的方法,其中,根据唤醒词得分确定语音数据的得分,包括:The method according to claim 2, wherein determining the score of the speech data according to the wake word score comprises:
    获取唤醒词时长和信噪比;Get the wake-up word duration and signal-to-noise ratio;
    根据所述唤醒词得分、唤醒词时长和信噪比,计算得到所述语音数据的得分。The score of the speech data is calculated and calculated according to the wake word score, the wake word duration, and the signal-to-noise ratio.
  4. 根据权利要求1所述的方法,其中,在将原始语音数据分离为一路或多路语音数据之前,所述方法还包括:The method according to claim 1, wherein before separating the original voice data into one or more voice data, the method further comprises:
    获取声音数据;Obtaining sound data;
    对所述声音数据进行回声消除,得到原始语音数据。Echo cancellation is performed on the sound data to obtain original speech data.
  5. 根据权利要求1所述的方法,其中,将原始语音数据分离为多通道语音数据:The method according to claim 1, wherein the original voice data is separated into multi-channel voice data:
    通过最大化各个输出通道之间的独立性的目标函数,对所述原始语音数据进行迭代计算,以得到所述多通道语音数据;或者,Performing an iterative calculation on the original voice data by an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data; or,
    将所述原始语音数据中属于同一声源的时频点进行分离和归类,确定出多个声源信号,从各个声源信号的时频掩蔽中计算出各个声源的能量变化和协方差矩阵,以得到所述多通道语音数据。Separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine multiple sound source signals, and calculating the energy change and covariance of each sound source from the time-frequency masking of each sound source signal Matrix to obtain the multi-channel voice data.
  6. 根据权利要求1所述的方法,其中,将原始语音数据分离为多通道语音数据,包括:The method according to claim 1, wherein separating the original voice data into multi-channel voice data comprises:
    获取传声器阵列的拓扑结构,采用声源定位算法确定出多个声源中各个声源的方位角,通过波束形成算法为每个声源分别形成一个波束,以得到所述多通道语音数据。The topological structure of the microphone array is obtained, and the azimuth angle of each sound source among the multiple sound sources is determined using a sound source localization algorithm, and a beam is formed for each sound source through a beam forming algorithm to obtain the multi-channel voice data.
  7. 根据权利要求1所述的方法,其中,在将原始语音数据分离为一路或多路语音数据之后,所述方法还包括:The method according to claim 1, wherein after separating the original voice data into one or more voice data, the method further comprises:
    对分离出的各路语音数据进行降噪处理和/或增益控制。Perform noise reduction processing and / or gain control on the separated voice data.
  8. 一种电子设备,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现:An electronic device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the processor implements:
    将原始语音数据分离为一路或多路语音数据;Separate the original speech data into one or more speech data;
    确定所述各路语音数据的可信度;Determining the credibility of the various voice data;
    将可信度最高的语音数据进行语音识别。Recognize the most reliable speech data.
  9. 根据权利要求8所述的设备,其中,所述处理器确定分离出的各路语音数据的可信度,包括:The device according to claim 8, wherein the processor determining the credibility of the separated voice data comprises:
    对各路语音数据,检测是否有预定义的唤醒词出现,并确定出唤醒词的得分;For each voice data, detect whether a predefined wake-up word appears and determine the score of the wake-up word;
    根据唤醒词得分确定语音数据的得分;Determine the score of the speech data according to the wake word score;
    将确定的语音数据的得分,作为语音数据的可信度。The determined score of the voice data is used as the credibility of the voice data.
  10. 根据权利要求9所述的设备,其中,所述处理器根据唤醒词得分确定语音数据的得分,包括:The device according to claim 9, wherein the determining the score of the voice data according to the wake word score comprises:
    获取唤醒词时长和信噪比;Get the wake-up word duration and signal-to-noise ratio;
    根据所述唤醒词得分、唤醒词时长和信噪比,计算得到所述语音数据的得分。The score of the speech data is calculated and calculated according to the wake word score, the wake word duration, and the signal-to-noise ratio.
  11. 根据权利要求8所述的设备,其中,所述处理器在将原始语音数据分离为一路或多路语音数据之前,还用于:The device according to claim 8, wherein the processor is further configured to: before separating the original voice data into one or more voice data:
    获取声音数据;Obtaining sound data;
    对所述声音数据进行回声消除,得到原始语音数据。Echo cancellation is performed on the sound data to obtain original speech data.
  12. 根据权利要求8所述的设备,其中,所述处理器将原始语音数据分离为多通道语音数据:The device according to claim 8, wherein the processor separates the original voice data into multi-channel voice data:
    通过最大化各个输出通道之间的独立性的目标函数,对所述原始语音数据进行迭代计算,以得到所述多通道语音数据;或者,Performing an iterative calculation on the original voice data by an objective function that maximizes the independence between each output channel to obtain the multi-channel voice data; or,
    将所述原始语音数据中属于同一声源的时频点进行分离和归类,确定出多个声源信号,从各个声源信号的时频掩蔽中计算出各个声源的能量变化和协方差矩阵,以得到所述多通道语音数据。Separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine multiple sound source signals, and calculating the energy change and covariance of each sound source from the time-frequency masking of each sound source signal Matrix to obtain the multi-channel voice data.
  13. 根据权利要求8所述的设备,其中,所述处理器将原始语音数据分离为多通道语音数据,包括:The device according to claim 8, wherein the processor separating the original voice data into multi-channel voice data comprises:
    获取传声器阵列的拓扑结构,采用声源定位算法确定出多个声源中各个声源的方位角,通过波束形成算法为每个声源分别形成一个波束,以得到所述多通道语音数据。The topological structure of the microphone array is obtained, and the azimuth angle of each sound source among the multiple sound sources is determined using a sound source localization algorithm, and a beam is formed for each sound source through a beam forming algorithm to obtain the multi-channel voice data.
  14. 根据权利要求8所述的设备,其中,所述处理器在将原始语音数据分离为一路或多路语音数据之后,还用于:The device according to claim 8, wherein after the processor separates the original voice data into one or more voice data, the processor is further configured to:
    对分离出的各路语音数据进行降噪处理和/或增益控制。Perform noise reduction processing and / or gain control on the separated voice data.
  15. 一种显示设备,包括处理器以及用于存储处理器可执行指令的存储器,所述处理器执行所述指令时实现:A display device includes a processor and a memory for storing processor-executable instructions. When the processor executes the instructions, the display device implements:
    将原始语音数据分离为一路或多路语音数据;Separate the original speech data into one or more speech data;
    确定分离出的各路语音数据的可信度;Determine the credibility of the separated voice data;
    对可信度最高的语音数据进行语音识别。Perform speech recognition on the most reliable speech data.
  16. 一种数据处理系统,包括:增强模块和唤醒模块,其中:A data processing system includes: an enhancement module and a wake-up module, wherein:
    所述增强模块,用于将原始语音数据分离为一路或多路语音数据;The enhancement module is configured to separate the original voice data into one or more voice data;
    所述唤醒模块,用于确定分离出的各路语音数据的可信度,并对可信度最高的语音数据进行语音识别。The wake-up module is configured to determine the credibility of the separated voice data, and perform voice recognition on the voice data with the highest credibility.
  17. 根据权利要求16所述的系统,其中,所述增强模块包括:The system according to claim 16, wherein the enhancement module comprises:
    回声消除单元,用于对获取的声音数据进行回声消除,得到原始语音数据;An echo cancellation unit, configured to perform echo cancellation on the acquired sound data to obtain the original voice data;
    语音分离单元,用于将原始语音数据分离为一路或多路语音数据;A speech separation unit for separating original speech data into one or more speech data;
    降噪单元,用于对分离后的一路或多路语音数据进行降噪处理;Noise reduction unit, which is used to perform noise reduction processing on one or more speech data after separation;
    增益控制单元,用于对降噪处理后的数据进行增益控制。A gain control unit is configured to perform gain control on the data after the noise reduction process.
  18. 一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现权利要求1至7中任一项所述方法的步骤。A computer-readable storage medium having computer instructions stored thereon that, when executed, implement the steps of the method of any one of claims 1 to 7.
PCT/CN2019/104081 2018-09-03 2019-09-03 Voice processing method, electronic device and display device WO2020048431A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811020120.2 2018-09-03
CN201811020120.2A CN110875045A (en) 2018-09-03 2018-09-03 Voice recognition method, intelligent device and intelligent television

Publications (1)

Publication Number Publication Date
WO2020048431A1 true WO2020048431A1 (en) 2020-03-12

Family

ID=69716878

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/104081 WO2020048431A1 (en) 2018-09-03 2019-09-03 Voice processing method, electronic device and display device

Country Status (2)

Country Link
CN (1) CN110875045A (en)
WO (1) WO2020048431A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402883B (en) * 2020-03-31 2023-05-26 云知声智能科技股份有限公司 Nearby response system and method in distributed voice interaction system under complex environment
CN111615035B (en) * 2020-05-22 2021-05-14 歌尔科技有限公司 Beam forming method, device, equipment and storage medium
CN112397083A (en) * 2020-11-13 2021-02-23 Oppo广东移动通信有限公司 Voice processing method and related device
CN113555033A (en) * 2021-07-30 2021-10-26 乐鑫信息科技(上海)股份有限公司 Automatic gain control method, device and system of voice interaction system
CN113608449B (en) * 2021-08-18 2023-09-15 四川启睿克科技有限公司 Speech equipment positioning system and automatic positioning method in smart home scene
CN113782024B (en) * 2021-09-27 2024-03-12 上海互问信息科技有限公司 Method for improving accuracy of automatic voice recognition after voice awakening
CN114220454B (en) * 2022-01-25 2022-12-09 北京荣耀终端有限公司 Audio noise reduction method, medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100217590A1 (en) * 2009-02-24 2010-08-26 Broadcom Corporation Speaker localization system and method
CN106531179A (en) * 2015-09-10 2017-03-22 中国科学院声学研究所 Multi-channel speech enhancement method based on semantic prior selective attention
CN107464565A (en) * 2017-09-20 2017-12-12 百度在线网络技术(北京)有限公司 A kind of far field voice awakening method and equipment
CN108109617A (en) * 2018-01-08 2018-06-01 深圳市声菲特科技技术有限公司 A kind of remote pickup method
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2115743A1 (en) * 2007-02-26 2009-11-11 QUALCOMM Incorporated Systems, methods, and apparatus for signal separation
US8831936B2 (en) * 2008-05-29 2014-09-09 Qualcomm Incorporated Systems, methods, apparatus, and computer program products for speech signal processing using spectral contrast enhancement
CN104637494A (en) * 2015-02-02 2015-05-20 哈尔滨工程大学 Double-microphone mobile equipment voice signal enhancing method based on blind source separation
CN104882140A (en) * 2015-02-05 2015-09-02 宇龙计算机通信科技(深圳)有限公司 Voice recognition method and system based on blind signal extraction algorithm
CN108447498B (en) * 2018-03-19 2022-04-19 中国科学技术大学 Speech enhancement method applied to microphone array

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100217590A1 (en) * 2009-02-24 2010-08-26 Broadcom Corporation Speaker localization system and method
CN106531179A (en) * 2015-09-10 2017-03-22 中国科学院声学研究所 Multi-channel speech enhancement method based on semantic prior selective attention
CN107464565A (en) * 2017-09-20 2017-12-12 百度在线网络技术(北京)有限公司 A kind of far field voice awakening method and equipment
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA
CN108109617A (en) * 2018-01-08 2018-06-01 深圳市声菲特科技技术有限公司 A kind of remote pickup method

Also Published As

Publication number Publication date
CN110875045A (en) 2020-03-10

Similar Documents

Publication Publication Date Title
WO2020048431A1 (en) Voice processing method, electronic device and display device
CN108351872B (en) Method and system for responding to user speech
US20210005197A1 (en) Detecting Self-Generated Wake Expressions
CN107464564B (en) Voice interaction method, device and equipment
US9940949B1 (en) Dynamic adjustment of expression detection criteria
CN109599124B (en) Audio data processing method and device and storage medium
EP3923273B1 (en) Voice recognition method and device, storage medium, and air conditioner
CN111192591B (en) Awakening method and device of intelligent equipment, intelligent sound box and storage medium
CN106898348B (en) Dereverberation control method and device for sound production equipment
US20200058293A1 (en) Object recognition method, computer device, and computer-readable storage medium
US9494683B1 (en) Audio-based gesture detection
CN107450390B (en) intelligent household appliance control device, control method and control system
WO2020228270A1 (en) Speech processing method and device, computer device and storage medium
CN110265020B (en) Voice wake-up method and device, electronic equipment and storage medium
WO2020062900A1 (en) Sound processing method, apparatus and device
WO2020088153A1 (en) Speech processing method and apparatus, storage medium and electronic device
TWI711035B (en) Method, device, audio interaction system, and storage medium for azimuth estimation
CN110610718B (en) Method and device for extracting expected sound source voice signal
US11393490B2 (en) Method, apparatus, device and computer-readable storage medium for voice interaction
CN110517702A (en) The method of signal generation, audio recognition method and device based on artificial intelligence
CN112382279B (en) Voice recognition method and device, electronic equipment and storage medium
WO2023103693A1 (en) Audio signal processing method and apparatus, device, and storage medium
CN112666522A (en) Awakening word sound source positioning method and device
US11659332B2 (en) Estimating user location in a system including smart audio devices
CN112017662A (en) Control instruction determination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19858306

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19858306

Country of ref document: EP

Kind code of ref document: A1