CN110875045A - Voice recognition method, intelligent device and intelligent television - Google Patents

Voice recognition method, intelligent device and intelligent television Download PDF

Info

Publication number
CN110875045A
CN110875045A CN201811020120.2A CN201811020120A CN110875045A CN 110875045 A CN110875045 A CN 110875045A CN 201811020120 A CN201811020120 A CN 201811020120A CN 110875045 A CN110875045 A CN 110875045A
Authority
CN
China
Prior art keywords
voice data
data
voice
channel
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811020120.2A
Other languages
Chinese (zh)
Inventor
纳跃跃
刘鑫
刘勇
高杰
付强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811020120.2A priority Critical patent/CN110875045A/en
Priority to PCT/CN2019/104081 priority patent/WO2020048431A1/en
Publication of CN110875045A publication Critical patent/CN110875045A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application provides a voice recognition method, intelligent equipment and an intelligent television, wherein the method comprises the following steps: separating the original voice data into one or more paths of voice data; determining the credibility of each separated path of voice data; and performing voice recognition on the voice data with the highest credibility. By the scheme, the influence of noise on voice data identification can be reduced, the problem of high voice identification accuracy caused by the existence of noise and other interference sounds in the prior art is solved, accurate identification on voice data is achieved, and more effective man-machine voice interaction can be carried out.

Description

Voice recognition method, intelligent device and intelligent television
Technical Field
The application belongs to the technical field of internet, and particularly relates to a voice recognition method, intelligent equipment and an intelligent television.
Background
With the development of computers, the internet, the mobile internet and the internet of things, intelligent devices (such as mobile phones, computers, intelligent homes, intelligent robots and the like) are used more and more frequently. In the past, a single man-machine interaction mode based on a keyboard, a mouse, a remote controller and the like cannot meet the requirement for controlling intelligent equipment, correspondingly, the requirements for voice recognition and voice control become more and more extensive, and voice interaction becomes a more extensive man-machine interaction mode.
The man-machine voice interaction is that the intelligent device converts a voice command into characters through a voice recognition technology, then understands the intention of the command through a semantic understanding technology and gives corresponding feedback. However, a prerequisite for human-computer voice interaction is that the machine must be able to hear clearly the contents of the voice command.
However, as shown in fig. 1, in an actual voice interaction scenario, besides the voice of the target object, there are generally several adverse acoustic factors, such as device echo, voice of a non-target speaker, external noise interference, room reverberation, etc. Therefore, the original sound received by the sound pickup device is a noisy speech signal with a low signal-to-noise ratio, and such a signal is not conducive to processing by a speech recognition algorithm, thereby resulting in failure of effective human-computer speech interaction.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The application aims to provide a voice recognition method, intelligent equipment and an intelligent television, which can realize accurate recognition of voice data, so that more effective human-computer voice interaction can be obtained.
The application provides a voice recognition method, intelligent equipment and an intelligent television, which are realized as follows:
a method of speech recognition, the method comprising:
separating the original voice data into one or more paths of voice data;
determining the credibility of each separated path of voice data;
and performing voice recognition on the voice data with the highest credibility.
An electronic device comprising a processor and a memory for storing processor-executable instructions that when executed by the processor implement:
separating the original voice data into one or more paths of voice data;
determining the credibility of each path of voice data;
and performing voice recognition on the voice data with the highest credibility.
A display device comprising a processor and a memory for storing processor-executable instructions that when executed by the processor implement:
separating the original voice data into one or more paths of voice data;
determining the credibility of each separated path of voice data;
and performing voice recognition on the voice data with the highest credibility.
A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the above-described method.
According to the voice recognition method, the intelligent device and the intelligent television, original voice data are separated into multiple paths of voice data, then the credibility of each path of voice data is determined, and voice recognition is carried out on the voice data with the highest credibility, so that the influence of noise on voice data recognition can be reduced, the problem that the existing voice recognition accuracy is high due to the existence of noise and other interference sounds is solved, accurate recognition of the voice data is achieved, and more effective human-computer voice interaction can be carried out.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
FIG. 1 is a schematic diagram of a prior art sound transmission;
FIG. 2 is a schematic diagram of a voice interaction system architecture provided in the present application;
FIG. 3 is a schematic illustration of the channel separation and determination provided herein;
FIG. 4 is a block diagram of a speech processing flow provided herein;
FIG. 5 is a flow chart of a speech recognition method provided herein;
FIG. 6 is a flow diagram of another method of speech recognition provided herein;
FIG. 7 is a schematic diagram of an architecture of a computing terminal provided herein;
fig. 8 is a block diagram of a voice interaction apparatus provided in the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Considering that if the original audio signal collected by the microphone array of the intelligent device can be enhanced, noise signals such as echo of the device, non-target voice, environmental noise and the like are suppressed, so that the signal to noise ratio of the target voice is improved, the accuracy of voice recognition can be effectively raised daily, and the human-computer voice interaction efficiency is improved.
In the embodiment, the multi-channel signals are obtained through voice separation, then the signals interacting with the intelligent device are determined based on the credibility of the awakening words in each channel signal, and then the channel signal of the biogenic source signal is determined, voice recognition is carried out on the biogenic source signal, and the influence of noise can be effectively reduced.
As shown in fig. 2, in this example, a voice interaction system is provided, which may include: user X, noise source, interactive device, wherein, this interactive device includes: a processor, wherein the processor is configured to perform multi-channel separation of speech and channel selection based on a wake-up word.
The user X is a user performing voice interaction with the interactive device, and the user X sends out interactive voice, for example: you get the television good and help me to turn the sound of the television up to 50.
For the interactive device, not only the processor is required, but also a sound collector, which may be a microphone array, is required to collect sound and provide the collected sound to the processor, and the processor processes the sound. The microphone array may be a microphone array structure having a specific shape or a microphone array structure having a regular shape, and the specific structure of the microphone array is not specifically limited in this application and may be selected and set according to actual needs.
In order to realize accurate recognition of the speech, in this example, the speech mixed in the original speech is separated by using a speech separation technique, and each separated speech has a channel output in the speech enhancement stage. And then selecting the channel with the highest score from the awakened channels as the target voice through a voice awakening technology, and sending the target voice to a voice recognition system for voice recognition.
For example, as shown in fig. 3, raw voice data of a user (i.e., voice data collected by a microphone array) is obtained, the raw voice data includes raw voice of the user (i.e., voice data to be obtained), and there is a lot of noise data (e.g., noise interference of other users speaking, other voice interference, etc.). After acquiring the voice data, the processor may separate the original voice data into multi-channel voice data through a voice separation process: 1 st channel voice data, 2 nd channel voice data, 3 rd channel voice data, 4 th channel voice data … ….
Taking each channel data of the 1 st channel voice data, the 2 nd channel voice data, the 3 rd channel voice data and the 4 th channel voice data … … as an input signal, detecting whether a predefined wakeup word appears in each channel data, and scoring the wakeup words, wherein the higher the score is, the better the signal quality of the wakeup word is, for example, the wakeup word in the 1 st channel voice data is scored as 20, the wakeup word in the 2 nd channel voice data is scored as 98, the wakeup word in the 3 rd channel voice data is scored as 50, and the wakeup word in the 4 th channel voice data is scored as 35, so that it can be determined that the 2 nd channel voice data is an original sound.
Therefore, the voice data of the 2 nd channel can be used as the determined original voice to be sent to the voice recognition system, the voice recognition system converts the voice command into characters through the voice recognition technology, then the intention of the command is understood through the semantic understanding technology, and corresponding feedback is made.
When performing voice separation, i.e., separating original voice data into multi-channel voice data, the separation can be performed in one of, but not limited to:
mode 1: since different sound sources are generated by different physical processes, it can be assumed that the different sound source signals are statistically independent. The original voice signal is the mixture of a plurality of source signals, and the signals collected by each channel of the microphone array become non-independent, so that an objective function can be defined, and the independence among each output channel is maximized in the iterative process, thereby achieving the purpose of voice separation.
Mode 2: because the voice signals are sparse in the frequency domain, it can be assumed that only one sound source is dominant in the same time frequency point. For this purpose, a time-frequency masking (Mask) method may be defined, which separates and classifies time-frequency points belonging to the same sound source, and calculates the energy variation and covariance matrix of each source from the time-frequency masking of each source signal, thereby implementing voice separation.
Mode 3: the topological structure of a microphone array is known, the azimuth angle of each sound source in a plurality of sound sources is estimated by adopting a sound source positioning algorithm, and then a beam is formed for each sound source by adopting a beam forming algorithm so as to output a multi-channel voice signal.
In order to realize effective processing of original voice data, echo cancellation processing can be carried out on the data before voice separation is carried out, echo data in voice data of each channel are eliminated, noise reduction processing can be carried out on voice data of each channel after the voice separation is carried out to obtain multi-channel voice data, then gain control is carried out on voice data of each channel after the noise reduction processing, and awakening word probability judgment is carried out on the voice data of each channel after the gain control is carried out, so that original voice is determined, namely, channel selection is realized.
For example, if it is determined that channel 2 (i.e., channel 2 voice data) is the original voice, then in this voice interaction, the voice data of channel 2 is used as the data to be transmitted to the voice recognition system for voice command recognition.
For the judgment of the awakening word, the probability that the awakening word exists in each channel of voice data can be determined, and the channel of voice data with the highest probability is used as the voice data corresponding to the original voice. The awakening word may be some preset sensitive words, such as: if the interactive device is a television, the wake-up word may be a hello television, if the interactive device is a sound box, the wake-up word may be a hello sound box, if the interactive device is a vending machine, the wake-up word may be a hello vending machine, or a name given to the device, for example: miumiu, then the wake-up word can be set to be your miumiu, or miumiu, etc. Specifically, how to set the wakeup word may be selected according to actual needs, which is not limited in this application.
Among them, the Gain Control (AGC) is an Automatic Control method for automatically adjusting the channel Gain according to the signal intensity, and the AGC is a kind of amplitude-limited output, which adjusts the output signal of the hearing aid by using an effective combination of linear amplification and compression amplification. When weak signals are input, the linear amplification electric channel works to ensure the strength of output signals; when the input signal reaches a certain intensity, the compression and amplification electric channel is started, so that the output amplitude is reduced. That is, the AGC function can automatically control the magnitude of the gain by changing the input-output compression ratio. The forward AGC has strong control capability, large control power, large variation range of the working point of a controlled amplifier, and large impedance change at two ends of the amplifier; the control power needed by the reverse AGC is small, and the control range is also small.
When selecting the channel, the channel may be obtained by only scoring the wakeup word, or may be combined with other data, such as: and (4) awakening word duration, signal-to-noise ratio and other information to select channels, and selecting and outputting the channel with the highest comprehensive ranking. Specifically, which way is adopted to select the target channel can be selected according to actual needs, which is not specifically made by the present application.
In the process of interacting with voice data, generally, one user triggers one voice interaction, and starts to speak a wakeup word, so that one interaction process is triggered, and after the interaction is finished, the interaction with another user can be triggered. In this example, the voice recognition mode is based on determining that the position of the interactive user is not moving during the interaction, for example, the user stands in front of the television set, and for the television set: hello tv, ask for high volume, or, standing in front of the vending machine say: you are selling a machine, i want a subway ticket from Suzhou street to finance street.
That is, in this flow, the position of a certain user is hardly moved, and considering that the situation is easy to occur when the user is actually implemented, if voice pickup and recognition are continuously performed according to the determined channel, it may cause an obstacle to the whole interaction process, and therefore, a voice user recognition step may be added. That is, after the channel of the original tone is determined, after the voice data of the channel is subsequently acquired, firstly, through identification, it is determined whether the user corresponds to the original tone, if so, the voice data of the channel is continuously acquired, and the voice data of the channel is subjected to voice recognition, if not, the channel is re-determined.
The above-mentioned voice interaction method is described with reference to a specific example, as shown in fig. 4, the voice interaction system includes: the system comprises an enhancement system and a wake-up system, wherein the enhancement system enhances the original voice signals received by the microphone array and outputs a plurality of sound source signals with high signal-to-noise ratio. In fig. 4, for example, a sound source signal with a relatively high signal-to-noise ratio of two channels (hello tv, weather today) is obtained, and in practical implementation, a sound source signal with two or more channels may be included, that is, multi-channel voice output and multi-channel selection are supported. The wake-up system judges whether a wake-up word predefined by a user, such as 'hello television', is included in the multi-channel signal output, and determines an output channel according to a wake-up score of voice data output by each channel signal.
Wherein, the enhancement system can include: the device comprises an echo cancellation module, a voice separation and noise reduction module and a gain control module, wherein the echo cancellation module is used for suppressing the sound emitted by the interactive device, for example: programs or prompt tones played by a television and a sound box; a voice separation and noise reduction module, configured to separate each source signal from the mixed signal and suppress ambient noise, for example: air conditioning, microwave oven, etc. And the gain control module is used for automatically adjusting the gain of the output signal so that the output signal meets the input requirements of the awakening word module and the voice recognition.
The voice separation module can perform voice data separation by adopting but not limited to one of the following modes:
mode 1: since different sound sources are generated by different physical processes, it can be assumed that the different sound source signals are statistically independent. The original voice signal is the mixture of a plurality of source signals, and the signals collected by each channel of the microphone array become non-independent, so that an objective function can be defined, and the independence among each output channel is maximized in the iterative process, thereby achieving the purpose of voice separation.
Mode 2: because the voice signals are sparse in the frequency domain, it can be assumed that only one sound source is dominant in the same time frequency point. For this purpose, a time-frequency masking (Mask) method may be defined, which separates and classifies time-frequency points belonging to the same sound source, and calculates the energy variation and covariance matrix of each source from the time-frequency masking of each source signal, thereby implementing voice separation.
Mode 3: the topological structure of a microphone array is known, the azimuth angle of each sound source in a plurality of sound sources is estimated by adopting a sound source positioning algorithm, and then a beam is formed for each sound source by adopting a beam forming algorithm so as to output a multi-channel voice signal.
The wake-up system in fig. 4 may include: the device comprises a wake-up word module and a channel selection module, wherein the wake-up word module is used for detecting whether a predefined wake-up word appears in an input signal and giving a score of the wake-up word, and the higher the score is, the better the signal quality of the wake-up word is; a channel selection module for scoring a plurality of wake words, and other features, such as: and (4) awakening word duration, signal-to-noise ratio and other information to select channels, and selecting and outputting the channel with the highest comprehensive ranking.
In the above-described scheme of the present invention, based on the voice separation technique, distortion caused by overlapped voices is small, and it is not dependent on localization of the azimuth of the sound source, and even if the target sound and the interfering sound are located at the same azimuth, the method of the present invention can effectively process the overlapped voices as long as the distances from the target sound and the interfering sound to the microphone array are different. Specifically, in this example, the voice mixed in the original voice is separated by using a voice separation technique, each separated voice has a channel to be output in the voice enhancement stage, then, the channel with the highest score is selected from the wakened channels by using a voice wakening technique as the channel where the target voice is located, and then, the voice data of the channel is sent to the voice recognition system for processing. Because the mode of multi-channel enhanced output plus multi-channel awakening words plus awakening posterior probability scoring is adopted, the voice enhanced output of each channel is separated voice, the voice signal-to-noise ratio can be effectively enhanced, the probability that the real awakening words are detected by the awakening algorithm can be improved, and the channel which is most likely to be the target speaker is selected for subsequent operation.
FIG. 5 is a flow chart of a method of one embodiment of a speech recognition method described herein. Although the present application provides method operational steps or apparatus configurations as illustrated in the following examples or figures, more or fewer operational steps or modular units may be included in the methods or apparatus based on conventional or non-inventive efforts. In the case of steps or structures which do not logically have the necessary cause and effect relationship, the execution sequence of the steps or the module structure of the apparatus is not limited to the execution sequence or the module structure described in the embodiments and shown in the drawings of the present application. When the described method or module structure is applied in an actual device or end product, the method or module structure according to the embodiments or shown in the drawings can be executed sequentially or executed in parallel (for example, in a parallel processor or multi-thread processing environment, or even in a distributed processing environment).
Specifically, as shown in fig. 5, a speech recognition method provided in an embodiment of the present application may include:
step 501: separating original voice data into multi-channel voice data;
the original voice data may be sound data picked up by a microphone array, and then obtained by performing echo cancellation on the picked-up sound data, wherein the sound emitted by the interactive device itself is suppressed, for example: television, programs or prompt tones played by the sound box, and the like.
Step 502: determining the credibility of each channel of voice data in the multi-channel voice data;
specifically, the confidence level may be determined based on at least one of: signal quality of the predetermined phrase, signal-to-noise ratio of the voice data, and a time duration of occurrence of the predetermined phrase. The predetermined word group may be a wakeup word, and the wakeup word may be some preset sensitive words, for example: if the interactive device is a television, the wake-up word may be a hello television, if the interactive device is a sound box, the wake-up word may be a hello sound box, if the interactive device is a vending machine, the wake-up word may be a hello vending machine, or a name given to the device, for example: miumiu, then the wake-up word can be set to be your miumiu, or miumiu, etc. Specifically, how to set the wakeup word may be selected according to actual needs, which is not limited in this application.
Step 503: taking a channel corresponding to the voice data with the highest credibility as a target channel;
step 504: and performing voice recognition on the voice data of the target channel.
In the above example, the voice mixed in the original voice is separated by using the voice separation technology, each separated voice has a channel to be output, then, the credibility of each channel signal is determined, the channel with the highest credibility is selected as the channel where the target voice is located, and then, the voice data of the channel is sent to the voice recognition system to be processed, so that the influence of noise on the voice data recognition can be reduced, the problem of high voice recognition accuracy caused by the existence of interference sounds such as noise and the like in the prior art is solved, the accurate recognition of the voice data is achieved, and more effective human-computer voice interaction can be performed.
In step 504, performing speech recognition on the speech data of the target channel, which may be converting the speech data of the target channel into text content; an intent of the textual content is identified, and then feedback data is generated based on the intent. For example, the voice data is: you get you've to adjust the volume to 50, then determine that the intention is to raise the volume, the corresponding generated data may be: the volume up operating data may also be generated as well as voice data for feedback to the user, e.g., owner, adjusted, etc.
The above is that the television is used as an interactive device, and when the interactive device is actually implemented, the interactive device may also be other devices, for example: smart speakers, smart vending machines, and the like, to which this application is not limited.
When implemented, the original speech data may be separated into multi-channel speech data by, but is not limited to, one of the following:
1) performing iterative computation on the original voice data by maximizing an objective function of independence among all output channels to obtain the multi-channel voice data;
2) separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine a plurality of sound source signals, and calculating energy change and covariance matrix of each sound source from time-frequency masking of each sound source signal to obtain the multi-channel voice data;
or, 3) acquiring a topological structure of the microphone array, determining an azimuth angle of each sound source in the plurality of sound sources by adopting a sound source positioning algorithm, and forming a beam for each sound source by a beam forming algorithm to obtain the multi-channel voice data.
Specifically, after the original voice data is separated into multi-channel voice data, noise reduction processing and gain control may be performed on each channel of voice data in the multi-channel voice data, where the noise reduction processing is to suppress ambient noise, for example: air conditioning, microwave oven, etc. The gain control is to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wakeup word module and the voice recognition.
The present application further provides a speech processing method, as shown in fig. 6, which may include the following steps:
step 601: separating the original voice data into one or more paths of voice data;
step 602: determining the credibility of each separated path of voice data;
step 603: and performing voice recognition on the voice data with the highest credibility.
That is, the original voice data is separated into multiple paths of voice data, and then the voice data with the highest reliability is selected for voice recognition, so that the problem of low recognition accuracy caused by noise and the like in the existing voice recognition process can be solved.
Specifically, in step 602, the reliability of each separated voice data may be determined based on the wakeup word, for example, whether a predefined wakeup word appears for each voice data is detected, and a score of the wakeup word is determined; determining the score of the voice data according to the score of the awakening word; and taking the score of the determined voice data as the credibility of the voice data.
In order to improve the accuracy of the confidence level confirmation, the confidence level can be further determined by combining the contents such as the signal to noise ratio, for example, the duration of the awakening word and the signal to noise ratio can be obtained; and calculating to obtain the score of the voice data according to the score of the awakening word, the time length of the awakening word and the signal to noise ratio.
In consideration of the fact that echo may exist in the acquired voice data in actual implementation, the voice data can be acquired before the original voice data is separated into one or more paths of voice data, and then the echo cancellation is performed on the voice data to obtain the original voice data, so that the influence of the echo can be effectively eliminated.
When the method is implemented, the step 601 may specifically adopt one of the following manners to separate the original voice data into multi-channel voice data:
mode 1: performing iterative computation on the original voice data by maximizing an objective function of independence among all output channels to obtain the multi-channel voice data; or,
mode 2: separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine a plurality of sound source signals, and calculating energy change and covariance matrix of each sound source from time-frequency masking of each sound source signal to obtain the multi-channel voice data;
mode 3: and acquiring a topological structure of the microphone array, determining the azimuth angle of each sound source in the plurality of sound sources by adopting a sound source positioning algorithm, and forming a beam for each sound source by a beam forming algorithm to obtain the multi-channel voice data.
Specifically, after the original voice data is separated into multi-channel voice data, noise reduction processing and gain control may be performed on each channel of voice data in the multi-channel voice data, where the noise reduction processing is to suppress ambient noise, for example: air conditioning, microwave oven, etc. The gain control is to automatically adjust the gain of the output signal so that the output signal meets the input requirements of the wakeup word module and the voice recognition. When the method is realized, noise reduction processing can be performed firstly, and then gain control can be performed.
The method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal or a similar operation device. Taking the example of being operated on a computer terminal, fig. 7 is a hardware structure block diagram of a computer terminal of a speech recognition method according to an embodiment of the present invention. As shown in fig. 7, the computer terminal 10 may include one or more (only one shown) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission module 106 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 7 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 7, or have a different configuration than shown in FIG. 7.
The memory 104 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the speech recognition method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by executing the software programs and modules stored in the memory 104, that is, implements the speech recognition method of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission module 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission module 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In the software aspect, the speech recognition apparatus as shown in fig. 8 may include: a separation module 801, a first determination module 802, a second determination module 803, and an identification module 804, wherein:
a separation module 801, configured to separate original voice data into multi-channel voice data;
a first determining module 802, configured to determine a reliability of each channel of the multi-channel voice data;
a second determining module 803, configured to use a channel corresponding to the voice data with the highest reliability as a target channel;
and the recognition module 804 is configured to perform voice recognition on the voice data of the target channel.
In one embodiment, the first determining module 802 may specifically determine the confidence level based on at least one of: signal quality of the predetermined phrase, signal-to-noise ratio of the voice data, and a time duration of occurrence of the predetermined phrase.
In one embodiment, the apparatus may further include: a pick-up cancellation module for picking up sound data by the microphone array before separating the original voice data into multi-channel voice data; and carrying out echo cancellation on the sound data to obtain the original voice data.
In one embodiment, the recognition module 804 may be specifically configured to convert the voice data of the target channel into text content; identifying an intent of the textual content; generating feedback data according to the intention.
In one embodiment, the separation module 801 may separate the raw speech data into multi-channel speech data by, but is not limited to, one of the following:
1) performing iterative computation on the original voice data by maximizing an objective function of independence among all output channels to obtain the multi-channel voice data;
2) separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine a plurality of sound source signals, and calculating energy change and covariance matrix of each sound source from time-frequency masking of each sound source signal to obtain the multi-channel voice data;
3) and acquiring a topological structure of the microphone array, determining the azimuth angle of each sound source in the plurality of sound sources by adopting a sound source positioning algorithm, and forming a beam for each sound source by a beam forming algorithm to obtain the multi-channel voice data.
In one embodiment, the apparatus may further perform noise reduction processing and gain control on each channel voice data in the multi-channel voice data after separating the original voice data into the multi-channel voice data.
In this example, there is also provided a smart tv, which may include a processor and a memory for storing processor-executable instructions, the processor implementing when executing the instructions:
separating original voice data into multi-channel voice data;
determining the credibility of each channel of voice data in the multi-channel voice data;
taking a channel corresponding to the voice data with the highest credibility as a target channel;
and performing voice recognition on the voice data of the target channel.
In the embodiment, the original voice data is separated into the multi-channel voice data, the credibility of each channel of voice data in the multi-channel voice data is determined, the channel corresponding to the voice data with the highest credibility is used as the target channel, and the voice data of the target channel is subjected to voice recognition, so that the influence of noise on the voice data recognition can be reduced, the problem of high voice recognition accuracy caused by the existence of interference sounds such as noise is solved, accurate recognition of the voice data is achieved, and more effective man-machine voice interaction can be performed.
Although the present application provides method steps as described in an embodiment or flowchart, additional or fewer steps may be included based on conventional or non-inventive efforts. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.
The apparatuses or modules illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. The functionality of the modules may be implemented in the same one or more software and/or hardware implementations of the present application. Of course, a module that implements a certain function may be implemented by a plurality of sub-modules or sub-units in combination.
The methods, apparatus or modules described herein may be implemented in computer readable program code means for a controller implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, AtmelAT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated electrical channels, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
Some of the modules in the apparatus described herein may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary hardware. Based on such understanding, the technical solutions of the present application may be embodied in the form of software products or in the implementation process of data migration, which essentially or partially contributes to the prior art. The computer software product may be stored in a storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. All or portions of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
While the present application has been described with examples, those of ordinary skill in the art will appreciate that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims (18)

1. A method of speech processing, wherein the method comprises:
separating the original voice data into one or more paths of voice data;
determining the credibility of each separated path of voice data;
and performing voice recognition on the voice data with the highest credibility.
2. The method of claim 1, wherein determining the trustworthiness of the separated paths of voice data comprises:
detecting whether a predefined awakening word appears in each path of voice data, and determining the score of the awakening word;
determining the score of the voice data according to the score of the awakening word;
and taking the score of the determined voice data as the credibility of the voice data.
3. The method of claim 2, wherein determining a score for the voice data from the wake word score comprises:
acquiring the time length and the signal-to-noise ratio of the awakening word;
and calculating to obtain the score of the voice data according to the score of the awakening word, the time length of the awakening word and the signal to noise ratio.
4. The method of claim 1, wherein prior to separating the original voice data into one or more voice data, the method further comprises:
acquiring sound data;
and carrying out echo cancellation on the sound data to obtain original voice data.
5. The method of claim 1, wherein the raw speech data is separated into multi-channel speech data:
performing iterative computation on the original voice data by maximizing an objective function of independence among all output channels to obtain the multi-channel voice data; or,
and separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine a plurality of sound source signals, and calculating the energy change and covariance matrix of each sound source from the time-frequency masking of each sound source signal to obtain the multi-channel voice data.
6. The method of claim 1, wherein separating the raw speech data into multi-channel speech data comprises:
and acquiring a topological structure of the microphone array, determining the azimuth angle of each sound source in the plurality of sound sources by adopting a sound source positioning algorithm, and forming a beam for each sound source by a beam forming algorithm to obtain the multi-channel voice data.
7. The method of claim 1, wherein after separating the original voice data into one or more voice data, the method further comprises:
and performing noise reduction processing and/or gain control on each path of separated voice data.
8. An electronic device comprising a processor and a memory for storing processor-executable instructions that when executed by the processor implement:
separating the original voice data into one or more paths of voice data;
determining the credibility of each path of voice data;
and performing voice recognition on the voice data with the highest credibility.
9. The apparatus of claim 8, wherein the processor determines the trustworthiness of the separated paths of voice data, comprising:
detecting whether a predefined awakening word appears in each path of voice data, and determining the score of the awakening word;
determining the score of the voice data according to the score of the awakening word;
and taking the score of the determined voice data as the credibility of the voice data.
10. The device of claim 9, wherein the processor determines a score for the voice data from the wake word score, comprising:
acquiring the time length and the signal-to-noise ratio of the awakening word;
and calculating to obtain the score of the voice data according to the score of the awakening word, the time length of the awakening word and the signal to noise ratio.
11. The apparatus of claim 8, wherein the processor, prior to separating the raw voice data into one or more voice data, is further configured to:
acquiring sound data;
and carrying out echo cancellation on the sound data to obtain original voice data.
12. The apparatus of claim 8, wherein the processor separates raw speech data into multi-channel speech data:
performing iterative computation on the original voice data by maximizing an objective function of independence among all output channels to obtain the multi-channel voice data; or,
and separating and classifying time-frequency points belonging to the same sound source in the original voice data to determine a plurality of sound source signals, and calculating the energy change and covariance matrix of each sound source from the time-frequency masking of each sound source signal to obtain the multi-channel voice data.
13. The apparatus of claim 8, wherein the processor separates raw speech data into multi-channel speech data, comprising:
and acquiring a topological structure of the microphone array, determining the azimuth angle of each sound source in the plurality of sound sources by adopting a sound source positioning algorithm, and forming a beam for each sound source by a beam forming algorithm to obtain the multi-channel voice data.
14. The apparatus of claim 8, wherein the processor, after separating the raw voice data into one or more voice data, is further configured to:
and performing noise reduction processing and/or gain control on each path of separated voice data.
15. A display device comprising a processor and a memory for storing processor-executable instructions that when executed by the processor implement:
separating the original voice data into one or more paths of voice data;
determining the credibility of each separated path of voice data;
and performing voice recognition on the voice data with the highest credibility.
16. A data processing system comprising: reinforcing module and awaken the module, wherein:
the enhancement module is used for separating the original voice data into one path or multiple paths of voice data;
and the awakening module is used for determining the credibility of each separated path of voice data and carrying out voice recognition on the voice data with the highest credibility.
17. The system of claim 16, wherein the augmentation module comprises:
the echo cancellation unit is used for carrying out echo cancellation on the acquired sound data to obtain original voice data;
the voice separation unit is used for separating the original voice data into one path or multiple paths of voice data;
the noise reduction unit is used for carrying out noise reduction processing on the separated one-way or multi-way voice data;
and the gain control unit is used for carrying out gain control on the data subjected to the noise reduction processing.
18. A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 7.
CN201811020120.2A 2018-09-03 2018-09-03 Voice recognition method, intelligent device and intelligent television Pending CN110875045A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811020120.2A CN110875045A (en) 2018-09-03 2018-09-03 Voice recognition method, intelligent device and intelligent television
PCT/CN2019/104081 WO2020048431A1 (en) 2018-09-03 2019-09-03 Voice processing method, electronic device and display device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811020120.2A CN110875045A (en) 2018-09-03 2018-09-03 Voice recognition method, intelligent device and intelligent television

Publications (1)

Publication Number Publication Date
CN110875045A true CN110875045A (en) 2020-03-10

Family

ID=69716878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811020120.2A Pending CN110875045A (en) 2018-09-03 2018-09-03 Voice recognition method, intelligent device and intelligent television

Country Status (2)

Country Link
CN (1) CN110875045A (en)
WO (1) WO2020048431A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402883A (en) * 2020-03-31 2020-07-10 云知声智能科技股份有限公司 Nearby response system and method in distributed voice interaction system in complex environment
CN111615035A (en) * 2020-05-22 2020-09-01 歌尔科技有限公司 Beam forming method, device, equipment and storage medium
CN112397083A (en) * 2020-11-13 2021-02-23 Oppo广东移动通信有限公司 Voice processing method and related device
CN113555033A (en) * 2021-07-30 2021-10-26 乐鑫信息科技(上海)股份有限公司 Automatic gain control method, device and system of voice interaction system
CN113608449A (en) * 2021-08-18 2021-11-05 四川启睿克科技有限公司 Voice equipment positioning system and automatic positioning method under intelligent home scene
CN113782024A (en) * 2021-09-27 2021-12-10 上海互问信息科技有限公司 Method for improving automatic voice recognition accuracy rate after voice awakening
CN114220454A (en) * 2022-01-25 2022-03-22 荣耀终端有限公司 Audio noise reduction method, medium and electronic equipment
CN118365506A (en) * 2024-06-18 2024-07-19 北京象帝先计算技术有限公司 MMU configuration method, graphics processing system, electronic component and equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101622669A (en) * 2007-02-26 2010-01-06 高通股份有限公司 Systems, methods, and apparatus for signal separation
CN102047326A (en) * 2008-05-29 2011-05-04 高通股份有限公司 Systems, methods, apparatus, and computer program products for spectral contrast enhancement
CN104637494A (en) * 2015-02-02 2015-05-20 哈尔滨工程大学 Double-microphone mobile equipment voice signal enhancing method based on blind source separation
CN104882140A (en) * 2015-02-05 2015-09-02 宇龙计算机通信科技(深圳)有限公司 Voice recognition method and system based on blind signal extraction algorithm
CN106531179A (en) * 2015-09-10 2017-03-22 中国科学院声学研究所 Multi-channel speech enhancement method based on semantic prior selective attention
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA
CN108447498A (en) * 2018-03-19 2018-08-24 中国科学技术大学 Sound enhancement method applied to microphone array

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100217590A1 (en) * 2009-02-24 2010-08-26 Broadcom Corporation Speaker localization system and method
CN107464565B (en) * 2017-09-20 2020-08-04 百度在线网络技术(北京)有限公司 Far-field voice awakening method and device
CN108109617B (en) * 2018-01-08 2020-12-15 深圳市声菲特科技技术有限公司 Remote pickup method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101622669A (en) * 2007-02-26 2010-01-06 高通股份有限公司 Systems, methods, and apparatus for signal separation
CN102047326A (en) * 2008-05-29 2011-05-04 高通股份有限公司 Systems, methods, apparatus, and computer program products for spectral contrast enhancement
CN104637494A (en) * 2015-02-02 2015-05-20 哈尔滨工程大学 Double-microphone mobile equipment voice signal enhancing method based on blind source separation
CN104882140A (en) * 2015-02-05 2015-09-02 宇龙计算机通信科技(深圳)有限公司 Voice recognition method and system based on blind signal extraction algorithm
CN106531179A (en) * 2015-09-10 2017-03-22 中国科学院声学研究所 Multi-channel speech enhancement method based on semantic prior selective attention
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA
CN108447498A (en) * 2018-03-19 2018-08-24 中国科学技术大学 Sound enhancement method applied to microphone array

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李先伟,黄凤岗,张丽丹,于大刚: "基于频域去相关的语音信号分离", 应用科技, no. 02, pages 1 - 18 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111402883B (en) * 2020-03-31 2023-05-26 云知声智能科技股份有限公司 Nearby response system and method in distributed voice interaction system under complex environment
CN111402883A (en) * 2020-03-31 2020-07-10 云知声智能科技股份有限公司 Nearby response system and method in distributed voice interaction system in complex environment
CN111615035A (en) * 2020-05-22 2020-09-01 歌尔科技有限公司 Beam forming method, device, equipment and storage medium
CN111615035B (en) * 2020-05-22 2021-05-14 歌尔科技有限公司 Beam forming method, device, equipment and storage medium
CN112397083A (en) * 2020-11-13 2021-02-23 Oppo广东移动通信有限公司 Voice processing method and related device
CN112397083B (en) * 2020-11-13 2024-05-24 Oppo广东移动通信有限公司 Voice processing method and related device
CN113555033A (en) * 2021-07-30 2021-10-26 乐鑫信息科技(上海)股份有限公司 Automatic gain control method, device and system of voice interaction system
CN113608449B (en) * 2021-08-18 2023-09-15 四川启睿克科技有限公司 Speech equipment positioning system and automatic positioning method in smart home scene
CN113608449A (en) * 2021-08-18 2021-11-05 四川启睿克科技有限公司 Voice equipment positioning system and automatic positioning method under intelligent home scene
CN113782024A (en) * 2021-09-27 2021-12-10 上海互问信息科技有限公司 Method for improving automatic voice recognition accuracy rate after voice awakening
CN113782024B (en) * 2021-09-27 2024-03-12 上海互问信息科技有限公司 Method for improving accuracy of automatic voice recognition after voice awakening
CN114220454A (en) * 2022-01-25 2022-03-22 荣耀终端有限公司 Audio noise reduction method, medium and electronic equipment
CN118365506A (en) * 2024-06-18 2024-07-19 北京象帝先计算技术有限公司 MMU configuration method, graphics processing system, electronic component and equipment

Also Published As

Publication number Publication date
WO2020048431A1 (en) 2020-03-12

Similar Documents

Publication Publication Date Title
CN110875045A (en) Voice recognition method, intelligent device and intelligent television
CN111223497B (en) Nearby wake-up method and device for terminal, computing equipment and storage medium
CN106898348B (en) Dereverberation control method and device for sound production equipment
CN111192591B (en) Awakening method and device of intelligent equipment, intelligent sound box and storage medium
CN111161714B (en) Voice information processing method, electronic equipment and storage medium
CN110364156A (en) Voice interactive method, system, terminal and readable storage medium storing program for executing
CN112581960A (en) Voice wake-up method and device, electronic equipment and readable storage medium
CN112687286A (en) Method and device for adjusting noise reduction model of audio equipment
CN112562742A (en) Voice processing method and device
CN112185425A (en) Audio signal processing method, device, equipment and storage medium
CN104464746A (en) Voice filtering method and device and electron equipment
CN113766385B (en) Earphone noise reduction method and device
CN112466305B (en) Voice control method and device of water dispenser
CN113889116A (en) Voice information processing method and device, storage medium and electronic device
US20210110838A1 (en) Acoustic aware voice user interface
CN113889084A (en) Audio recognition method and device, electronic equipment and storage medium
CN106231109A (en) A kind of communication means and terminal
CN113870879A (en) Sharing method of microphone of intelligent household appliance, intelligent household appliance and readable storage medium
CN112885341A (en) Voice wake-up method and device, electronic equipment and storage medium
US11917386B2 (en) Estimating user location in a system including smart audio devices
CN109716432A (en) Gain process method and device thereof, electronic equipment, signal acquisition method and its system
CN117012202B (en) Voice channel recognition method and device, storage medium and electronic equipment
CN113571038B (en) Voice dialogue method and device, electronic equipment and storage medium
CN112992137B (en) Voice interaction method and device, storage medium and electronic device
EP4383253A2 (en) Relevance based source selection for far-field voice systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40025320

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20200310

RJ01 Rejection of invention patent application after publication