WO2020227955A1

WO2020227955A1 - Sound recognition method, interaction method, sound recognition system, computer-readable storage medium and mobile platform

Info

Publication number: WO2020227955A1
Application number: PCT/CN2019/086979
Authority: WO
Inventors: 吴俊峰; 赵文泉; 李皓宇; 周事成; 吴晟
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2020-11-19
Also published as: CN111684522A

Abstract

Disclosed in the present application are a sound recognition method, an interaction method, a sound recognition system, a computer-readable storage medium and a mobile platform. The sound recognition method is used to recognize a percussive sound. The sound recognition method comprises: acquiring at least one sound snippet of a sound signal to be recognized and first feature information of the sound snippet, the first feature information being an energy value of the sound snippet, and if the energy value of a central region of the sound snippet is greater than an energy threshold, extracting second feature information from the sound snippet; according to second feature information of the at least one sound snippet, recognizing whether the sound signal to be recognized comprises a percussive sound.

Description

Voice recognition method, interaction method, voice recognition system, computer readable storage medium and removable platform

Technical field

This application relates to the field of voice recognition, in particular to a voice recognition method, an interactive method, a voice recognition system, a computer-readable storage medium, and a removable platform.

Background technique

With the popularization of smart hardware in applications such as home life and education, sound has gradually become an important human-computer interaction method, such as voice interaction. However, due to hardware limitations, when the distance is long, for example, when the distance from the hardware device is more than 2 meters, due to the low signal-to-noise ratio, the mixed environmental noise in the voice signal will bring great challenges to voice recognition. Compared with the voice signal, the slap sound signal is single, has stronger anti-interference ability, and has stronger instantaneous energy. Therefore, you can use tapping sounds, such as applause, to control hardware devices, such as voice-activated switches. However, the existing voice-activated switches based on the waveform comparison circuit are not robust enough in use. Most high-volume sounds can trigger them, and false triggers are too frequent, which is unreliable as a human-computer interaction method.

Summary of the invention

This application provides improved voice recognition methods, interactive methods, voice recognition systems, computer-readable storage media, and removable platforms.

According to one aspect of the embodiments of the present application, there is provided a voice recognition method for recognizing slap sounds. The voice recognition method includes: acquiring at least one voice segment of the voice signal to be recognized and first feature information of the voice segment, so The first feature information is the energy value of the sound segment, and if the energy value of the middle region of the sound segment is greater than the energy threshold, extract second feature information from the sound segment; and according to at least one of the sound segments The second characteristic information of identifying whether the to-be-identified sound signal includes a tapping sound.

According to one aspect of the embodiments of the present application, an interaction method is provided, including: acquiring a voice signal to be recognized; a voice recognition method; and if it is recognized according to the voice recognition method that the voice signal to be recognized includes a tapping sound, The slap sound outputs corresponding control commands.

According to another aspect of the embodiments of the present application, there is provided a voice recognition system including one or more processors for implementing a voice recognition method.

According to another aspect of the embodiments of the present application, there is provided a computer-readable storage medium having a program stored thereon, and when the program is executed by a processor, a voice recognition method is implemented.

According to another aspect of the embodiments of the present application, there is provided a movable platform, including: a body; a power system, which is provided in the body, and is used to provide power to the movable platform; and a microphone, which is used to receive sounds to be recognized, And generate a corresponding voice signal to be recognized; and one or more processors for implementing a voice recognition method, and if it is recognized according to the voice recognition method that the voice signal to be recognized includes a slap sound, then according to the beat Click the sound to output the corresponding control command.

In the voice recognition method of the embodiment of the present application, if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and the voice signal to be recognized is initially screened according to the second feature information Recognize whether the sound signal to be recognized includes a slap sound, so that in a long distance range, the recognition rate of the slap sound is high, the robustness is good, and the possibility of false triggering is low, and it is suitable as a reliable human-computer interaction method.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present invention, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

Fig. 1 shows a flowchart of an embodiment of the voice recognition method of this application.

FIG. 2 shows a sub-flow chart of an embodiment of the voice recognition method of this application.

Fig. 3 shows a flowchart of an embodiment of the interaction method of this application.

Fig. 4 is a schematic diagram of an embodiment of the voice recognition system of this application.

Fig. 5 is a block diagram of a module of an embodiment of the mobile platform of this application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Here, exemplary embodiments will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present application. On the contrary, they are only examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.

The terms used in this application are only for the purpose of describing specific embodiments and are not intended to limit the application. The singular forms of "a", "said" and "the" used in this application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any or all possible combinations of one or more associated listed items. Unless otherwise indicated, similar words such as "front", "rear", "lower" and/or "upper" are only for convenience of description, and are not limited to one position or one spatial orientation. Similar words such as "connected" or "connected" are not limited to physical or mechanical connections, and may include electrical connections, whether direct or indirect. Similar words such as "plurality" or "several" mean at least two.

The voice recognition method of the embodiment of the present application is used to recognize the slap sound. The sound recognition method includes: acquiring at least one sound segment of the sound signal to be recognized and first feature information of the sound segment. The first feature information is the energy value of the sound segment. If the energy value of the central region of the sound segment is greater than the energy threshold, then Extracting second characteristic information from the sound segment; and identifying whether the sound signal to be recognized includes a slap sound according to the second characteristic information of at least one sound segment.

In the voice recognition method, if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and thus the voice signal to be recognized is initially screened, and then the voice signal to be recognized is recognized based on the second feature information Whether the slap sound is included, so that the recognition rate of the slap sound is high, the robustness is good, and the possibility of false triggering is low in a long distance range, and it is suitable as a reliable human-computer interaction method.

An interaction method in an embodiment of the present application includes: acquiring a voice signal to be recognized; the foregoing voice recognition method includes acquiring first characteristic information of at least one sound segment, the first characteristic information being the energy value of the sound segment, if the middle of the sound segment If the energy value of the region is greater than the energy threshold, extract the second characteristic information from the sound segment; and according to the second characteristic information of the at least one sound segment, identify whether the sound signal to be identified includes a tapping sound; and if it is identified according to the sound recognition method The sound signal to be recognized includes a slap sound, and a corresponding control command is output according to the slap sound.

The voice recognition method has a high recognition rate for slap sounds, good robustness, and low possibility of false triggering, thus making the interactive method reliable. Moreover, the instantaneous energy of the slap sound is stronger than that of the voice, and it is not easy to be attenuated in the air. Therefore, the recognition effect of the slap sound for a certain distance, such as a distance of more than 2 meters, will be better than the voice recognition effect. Within the distance range, the slap sound can also be used to realize human-computer interaction, which has a higher recognition rate and stronger anti-interference.

The voice recognition system of the embodiment of the present application includes one or more processors for implementing the above voice recognition method.

The machine-readable storage medium of the embodiment of the present application has a program stored thereon, and when the program is executed by a processor, the above voice recognition method is realized.

The movable platform of the embodiment of the present application includes a body, a power system, a microphone, and one or more processors. The power system is arranged in the body to provide power for the movable platform. The microphone is used to receive the sound to be recognized and generate a corresponding sound signal to be recognized. One or more processors are configured to implement the above-mentioned voice recognition method, and if the voice signal to be recognized includes a slap sound according to the voice recognition method, output a corresponding control instruction according to the slap sound.

FIG. 1 shows a flowchart of an embodiment of a voice recognition method 100. The voice recognition method 100 is used to recognize tapping sounds. In some embodiments, the frequency range of the slap sound is 300 Hz to 8000 Hz, the sound is crisp, the instantaneous energy is stronger than the voice, it is not easy to attenuate in the air, it is easy to recognize, the recognition effect is good, the recognition rate is high, and the anti-interference is strong. In one embodiment, the clapping sound includes at least one of clapping sounds and clapping sounds. The beating sound may include the sound of beating something, such as the sound of beating on a wall, a table, etc. The beating sound is similar to the waveform of applause. Slap sounds such as applause and/or tapping have a high recognition rate, strong anti-interference, and can be recognized at a longer distance. In this embodiment, the voice recognition method 100 includes

steps

101 and 102.

In step 101, at least one sound segment of the sound signal to be recognized and first feature information of the sound segment are acquired. The first feature information is the energy value of the sound segment. If the energy value of the middle region of the sound segment is greater than the energy threshold, then The second feature information is extracted from the sound segment.

In an embodiment, the sound signal to be recognized may be one or more sound signals in a real-time sound signal stream. In one embodiment, the voice recognition method 100 may include acquiring a voice signal to be recognized. The sound signal to be recognized can be intercepted from the real-time sound signal stream. In one embodiment, the sound signal between two adjacent silent periods exceeding the silent time threshold may be intercepted as the to-be-identified sound signal. The sound signal in the mute period can indicate no sound or low sound, which can be called a "silent signal", and its energy value is lower than the minimum energy value of the slap sound. The energy value of the sound signal in the real-time sound signal stream can be compared with the set mute energy threshold. If the energy value of the sound signal is less than the mute energy threshold, the sound signal is determined to be a mute signal, and the duration of the mute signal can be determined, namely Silent period. The mute energy threshold does not exceed the minimum energy value of the slap sound. The mute time threshold can be preset. In one embodiment, the silent time threshold exceeds the interval time between two consecutive slaps. In one example, the silent time threshold is any value greater than or equal to 2 seconds. When the interval between two adjacent slaps is less than 2 seconds, it is regarded as a continuous slap. In another embodiment, the sound signals to be recognized are all sound signals of the real-time sound signal stream.

In another embodiment, the energy value may be used to perform preliminary screening of the sound signal to be identified. If the energy value of the middle region of the sound segment of the sound signal to be identified is greater than the energy threshold, it indicates that the sound corresponding to the middle region of the sound segment is loud, then It shows that the sound clip may include slap sounds, which can be a good preliminary screening of slap sounds. For example, in one embodiment, the middle region of the sound segment has a sharp peak, and the two end regions have small and gentle values. Thus, the waveform of the sound segment is a high and low waveform in the middle, which means that the sound The clip may include tapping sounds. In an embodiment, the middle area of the sound segment may be the exact center of the sound segment, or it may be an area where the exact center of the sound segment extends to one end or to both ends respectively. In one embodiment, the energy threshold is a preset fixed value or a value that changes in real time. In one embodiment, the energy threshold may be determined according to the energy value of one or both ends outside the middle region of the sound segment, so the energy threshold may be different for different sound segments. In another embodiment, the energy threshold is a preset value, and a fixed energy threshold can be set according to the characteristics and experience of the slap sound.

Figure 2 shows a sub-flow diagram of an embodiment of step 101. Step 101 includes sub-steps 111 and 112. In sub-step 111, the sound signal to be identified is framed and windowed to obtain multiple sound frames corresponding to the sound signal to be identified.

By performing windowing processing on the sound signal to be recognized, framing is realized to obtain multiple sound frames. Generally, a slap sound lasts approximately 80-160 milliseconds. In one embodiment, the sound signal to be recognized is divided into frames of 11-23 milliseconds, and each time a continuous 4-15 sound frame is judged. In one example, the sound signal to be recognized is divided into frames of 16 milliseconds, and 7 consecutive sound frames are judged each time. In other examples, it is possible to divide frames according to one frame at other times, and/or judge sound frames including other frames, which is not limited here.

In sub-step 112, if the energy value of the sound frame in the middle region of the multiple sound frames corresponding to the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment.

In one embodiment, the sound frame in the middle region includes the center sound frame. In another embodiment, the sound frame in the middle region includes a center sound frame and one or more sound frames on one or both sides of the center sound frame. In one embodiment, the sound segment includes odd-numbered sound frames, and the center sound frame is a sound frame in the center of the sound segment. In another embodiment, the sound segment includes even-numbered sound frames, and the center sound frame is one or two sound frames closest to the center of the sound segment. When the sound frame in the middle area is one frame, the energy value of the sound frame in the middle area is the energy value of the sound frame in that frame; when the sound frame in the middle area is multiple frames, the energy value of the sound frame in the middle area The energy value of the sound frame of the multiple frames can be calculated by a suitable algorithm, such as an algorithm for calculating the average value, the median value, and the variance, which is not limited herein.

In some embodiments, the framed and windowed window can be sequentially slid between multiple sound frames to judge multiple consecutive sound frames, which can avoid missing slap sounds, and the judgment is more accurate and robust Sex is better. In one example, when the window is slid multiple times, several consecutive sound frames that have been judged are multiple sound frames of one sound segment. For example, the window slides three times, one frame each time, and 7 consecutive frames are judged each time. After sliding three times from the initial position of the window, a total of 10 consecutive sound frames are judged, and the 10 frames are regarded as a sound Multiple sound frames of the clip. Obtain sound clips by sliding the window. In one embodiment, the window slides one frame at a time. Of course, in other embodiments, according to actual needs, the window may also slide two frames, three frames or more frames at a time, which is not limited herein.

In an embodiment, the energy value includes the frequency spectrum value of the sound frame, and the fast Fourier transform may be performed on multiple sound frames to obtain the frequency spectrum value of the multiple sound frames. If the spectral value of the sound frame in the middle region of the multiple sound frames of the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment. The frequency spectrum value can reflect the energy of the sound. When the frequency spectrum value of the sound frame in the middle region is greater than the energy threshold, it indicates that the sound segment may include a slap sound, so the second characteristic information is extracted from the sound segment. The method of obtaining the spectrum value is simple. The spectrum value can be used to perform a preliminary screening of the sound to be recognized, and some sound segments that obviously do not include the slap sound can be removed. The method is simple and effective.

In one embodiment, if the spectral value of the sound frame in the middle area of the window is greater than the energy threshold, a trigger signal is generated; when the window slides through several sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, Then, the second feature information is extracted from the sound segments of the consecutive frames where the several sound frames are located. When the window slides through several sound frames in sequence, if multiple trigger signals are generated continuously, it means that the sound clip may contain slap sounds. The window can move one frame at a time, and a slap sound may be triggered repeatedly to generate multiple trigger signals, thus avoiding the omission of the slap sound and enhancing the accuracy of judgment.

In an embodiment, when the spectral value of the sound frame in the middle region of the window is greater than the first energy value, and the spectral value of the sound frame in the two end regions of the window is less than the second energy value, the trigger signal is generated, wherein the first energy The value is greater than the second energy value; when the window slides through several sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, the second characteristic information is extracted from the sound segment. In some embodiments, the first energy value includes a preset fixed value; and/or a value related to the spectral value of the sound frames in the two end regions. For example, the first energy value follows the sound frames in the two end regions. The value of the spectrum changes, such as greater than the spectrum value of the sound frame at both ends. In some embodiments, the second energy value includes a preset fixed value; and/or a value related to the spectral value of the sound frame in the middle region. For example, the second energy value may follow the frequency spectrum of the sound frame in the middle region. The value changes, such as the spectral value of the sound frame smaller than the middle area. Of course, in other embodiments, the spectral value of the sound frame at one end of the window can also be set to be smaller than the second energy value, and the spectral value of the sound frame at the other end changes with the spectral value of the sound frame at one end. The changes are not limited here.

In one example, in the sound to be recognized, a continuous 7-frame sound frame is used for judgment, and the sound frame in the middle area of the 7-frame sound frame is the fourth frame. Suppose the spectrum value of the x-th frame among these 7 frames is M(x), and the minimum value in the third to fifth frames is MI. For example, in one embodiment, it is preset: if M(4)>2*MI and M(4)>5*M(2) and M(4)>3*M(6) and M(4) )>20*M(1) and M(4)>7*M(7) and M(4)>0.05, then it is judged to trigger once and a trigger signal is generated; M(4) is the sound frame in the middle area. Furthermore, regardless of whether it is triggered or not, the window will slide to the next frame to perform the above judgment again; if it is triggered 4 times in a row, it is considered that this sound segment containing several consecutive sound frames that have been judged contains slap sounds, that is Ten sound frames from the first frame to the tenth frame corresponding to the initial position of the window contain the slap sound, and then the second characteristic information is extracted from the sound segment. In this example, the threshold for the number of triggers is 4, but it is not limited to this. In other examples, other thresholds for the number of triggers can be set. In this example, the energy threshold includes multiple energy thresholds, which are 2*MI, 5*M(2), 3*M(6), 20*M(1), 7*M(7), 0.05. The energy threshold includes a fixed threshold of 0.05, and the energy threshold related to the spectral values of the sound frames at both ends, which may be a multiple of the spectral values of the sound frames at both ends. In this way, the preliminary screening can be carried out more accurately, and the slapping sound can be avoided.

Continuing to refer to FIG. 1, in step 102, according to the second feature information of at least one sound segment, it is recognized whether the sound signal to be recognized includes a slap sound.

At least one sound segment that may contain the slap sound after the preliminary screening is further identified to determine whether the to-be-identified sound signal includes the slap sound. In the voice recognition method 100 of the embodiment of the present application, if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and the voice signal to be recognized is initially screened according to the second feature The information identifies whether the sound to be recognized includes a slap sound, so that the slap sound can achieve a high recognition rate even in a long distance range, and the possibility of false triggering is low, and it is suitable as a reliable human-computer interaction method.

In some embodiments, when the sound signal to be recognized includes a slap sound, the type of the slap sound is recognized. Further, the type of the slap sound corresponds to a corresponding control instruction. The category of the slap sound includes at least one of the number of slaps, the duration of the slap, and the frequency of the slap. The number of slaps can be the number of consecutive slaps in a slap sound. The duration of the slap may be the total duration of continuous slaps in a segment of slap sound. The frequency of the slap can reflect the speed of the slap. In this way, it not only recognizes whether the sound signal to be recognized includes a slap sound, but also when the sound signal to be recognized includes a slap sound, the type of the slap sound can be further recognized, which can be better used in human-computer interaction for different types of slap sounds. Click sound can realize different interactions.

In some embodiments, the second feature information is input into the recognition model for recognition, so as to further recognize whether the sound signal to be recognized includes a slap sound. The recognition model can be accurately and quickly recognized. In some embodiments, the second feature information includes acoustic features, and the acoustic features include Mel Frequency Cepstral Coefficient (MFCC) features, Linear Prediction Coefficient (LPC) features, Filterbank (filter bank) features, and bottleneck features At least one of (Bottleneck feature). One or more of the above-mentioned acoustic features can be used to recognize the slap sound in the recognition model.

In some embodiments, the recognition model includes multiple sound categories. Determine the likelihood of the second feature information and the feature information of multiple voice categories respectively; sort the likelihoods, and determine the voice category with the highest likelihood as the category of the voice to be recognized to identify whether the voice to be recognized includes a beat Click sound. This can be quickly identified. In one embodiment, the sound category includes a slap sound category and a non-slap sound category. The likelihood of the second characteristic information and the characteristic information of the slap sound category and the likelihood of the characteristic information of the non-slap sound category can be determined, and the sound category with the highest likelihood is determined as the category of the sound to be recognized. In this way, it can be determined whether the sound to be recognized includes a tapping sound, and the recognition accuracy is high and the speed is fast.

In some embodiments, the slap sound category includes at least two slap sound categories representing different consecutive times of slap. For example, the slap sound category represents two consecutive slaps, the slap sound category represents three consecutive slaps, and the slap sound category represents more consecutive slaps. The second feature information is input into the recognition model, and it can be determined whether the sound to be recognized includes a slap sound, and the number of consecutive slaps can be determined. In this way, the tapping sounds of different consecutive tapping times can be recognized, and the recognition can be more accurate. In some other embodiments, the slap sound category may include at least two slap sound categories representing different durations and/or frequencies of the slap.

In some embodiments, the slap sound training data and the non-slap sound training data are used to train the recognition model. In this way, the slap sound category and the non-slap sound category can be obtained. The non-slap sound training data may include data of sounds other than the slap, such as noise and speech sounds. A large amount of slap sound training data and non-slap sound training data can be collected to train the recognition model. In some embodiments, the recognition model may be trained multiple times to obtain a recognition model with better performance.

In some embodiments, the slap sound training data includes first slap sound training data and second slap sound training data, the first slap sound training data and the second slap sound training data represent the number of slaps, At least one of the duration of the slap and the frequency of the slap is different. Using the first slap sound training data and the second slap sound training data, the recognition model is trained. In this way, different types of slap sounds can be obtained, which can be used to identify the types of slap sounds. In one embodiment, the first slap sound training data and the second slap sound training data indicate different numbers of consecutive slaps. In an example, the first slap sound training data represents two consecutive slaps, and the second slap sound training data represents three consecutive slaps, but it is not limited to this example. The recognition model can be trained according to actual applications to obtain different slap sound categories.

In some embodiments, the recognition model includes at least one of a deep model and a shallow model, and the recognition rate is high through the above recognition model. In some embodiments, the deep model includes at least one of the following: Deep Neural Networks (DNN), Long Short Term Memory networks (LSTM), and Convolutional Neural Networks (CNN) .

In one embodiment, the shallow model includes a Gaussian Mixture Model-Hidden Markov (GMM-HMM) model. The signal to be recognized is recognized by the Gaussian Mixture Model-Hidden Markov Model, with a high recognition rate and fast recognition speed. In some embodiments, the slap sound training data and the non-slap sound training data are used to train the Gaussian mixture model-hidden Markov model, where the slap sound training data includes the first slap sound training data and the second slap sound training data. In the sound training data, the first slap sound training data and the second slap sound training data indicate that at least one of the number of slaps, the duration of the slap, and the frequency of the slap is different. The Gaussian mixture model-hidden Markov model trained in this way includes the non-slap sound category and the slap sound category, where the slap sound category includes at least the number of times of the slap, the duration of the slap, and the frequency of the slap. A different first slap sound category and second slap sound category. See the above for details. The MFCC features are extracted from the training data of the slap sound and the training data of the non-slap sound, and used for the Gaussian mixture model-hidden Markov model training.

In some embodiments, parameter estimation is performed on a hidden Markov (HMM) model. In one embodiment, the method for estimating the parameters of the hidden Markov model includes: Baum-welch algorithm and/or genetic algorithm (Genetic Algorithm). The parameters of the hidden Markov model are estimated by Baum-welch algorithm and/or genetic algorithm. The Baum-welch algorithm is also known as the forward-backward algorithm. The Baum-Welch algorithm first makes an initial estimate of the parameters of the HMM model, but this is likely to be a wrong guess, and then evaluates these for the given training data The validity of the parameters (such as cross-validation) and reduce the errors they cause to update the parameters of the HMM model, so that the error with the given training data becomes smaller. Genetic algorithm is a computational model that simulates the biological evolution process of natural selection and genetic mechanism of Darwin's biological evolution theory. It is a method to search for the optimal solution by simulating the natural evolution process.

In some embodiments, the Gaussian mixture model-hidden Markov model has a Gaussian number ranging from 3 to 12, which is suitable for recognizing slap sounds, balancing recognition performance and recognition speed, and the recognition accuracy is as high as possible and the recognition speed is as fast as possible . The first slap sound training data includes the slap sound training data of two consecutive slaps. The Gaussian mixture model-hidden Markov model corresponds to the first slap sound training data. The number of states ranges from 6 to 14. The performance is as good as possible, and the recognition speed is as fast as possible. In some embodiments, the second slap sound training data includes the slap sound training data of three consecutive slaps, and the number of states corresponding to the second slap sound training data of the Gaussian Mixture Model-Hidden Markov Model ranges from 9 to 21. The performance of the recognition model is as good as possible, and the recognition speed is as fast as possible. In some embodiments, the number of states of the corresponding non-slap sound training data of the Gaussian mixture model-hidden Markov model ranges from 7 to 18, the performance of the recognition model is as good as possible, and the recognition speed is as fast as possible. In an example, the number of states of the first slap sound training data is 10, the number of states of the second slap sound training data is 15, the number of states of the non-slap sound training data is 12, and the number of Gaussians is 3. The above is only an example and is not limited to this example. In other examples, the number of states and/or the number of Gaussians may be other values, for example, the number of Gaussians may be 5 or 8.

In some embodiments, the Gaussian Mixture Model (GMM) model in the Gaussian Mixture Model-Hidden Markov Model may be trained multiple times, so as to obtain a model with high recognition accuracy. In some embodiments, the method of training the Gaussian mixture model-hidden Markov model for multiple times includes: Expectation Maximization (EM) or Maximum Likelihood. Expectation maximization method or maximum likelihood method trains Gaussian mixture model-hidden Markov model many times to obtain a model with high recognition accuracy. The expectation maximization method is a method to obtain the maximum likelihood estimation of parameters. The expectation maximization method is a maximum likelihood estimation method for solving the parameters of the probability model from incomplete data or data sets with data loss (with hidden variables). The maximum likelihood method (Maximum Likelihood, ML) is also called the most likely estimation, also called the maximum likelihood estimation. It is a theoretical point estimation method that can be used to estimate the parameters of the model.

In this way, after training the Gaussian mixture model-hidden Markov model for many times, a Gaussian mixture model-hidden Markov model recognition model with better performance is obtained.

FIG. 3 shows a flowchart of an embodiment of an interaction method 200 of this application. The interaction method 200 includes steps 201-203.

In step 201, a voice signal to be recognized is obtained. The sound signal to be recognized can be obtained from the real-time sound signal stream. For detailed description, please refer to the above description, which will not be repeated here.

In step 202, the voice recognition method 100 as described above is executed to recognize the acquired voice signal to be recognized.

In step 203, if it is recognized according to the voice recognition method 100 that the voice signal to be recognized includes a slap sound, a corresponding control instruction is output according to the slap sound.

The interaction method 200 uses tapping sounds for interaction. The voice recognition method has a high recognition rate for tapping sounds, good robustness, and low possibility of false triggering, thus making the interactive method reliable. Moreover, the instantaneous energy of the slap sound is stronger than that of the voice, and it is not easy to be attenuated in the air. Therefore, the recognition effect of the slap sound for a certain distance, such as a distance of 2 meters or more, will be better than the voice recognition effect, so it can be farther away. The human-computer interaction is realized by tapping sound within the distance range, which has a higher recognition rate and stronger anti-interference.

In some embodiments, the control instruction includes a control instruction for controlling the movable platform when the sound signal to be recognized includes a tapping sound. The control instructions can control the movable platform, for example, the movable platform can be controlled to move forward, backward, turn, rotate, stand still, and fire bullets. Movable platforms may include mobile cars, unmanned aerial vehicles, automobiles, robots, or other movable devices. Use the slap sound to interact with the movable platform to control the movable platform. The recognition rate of the slap sound is high, the control of the movable platform is more accurate, the probability of false control is low, and multiple interactions within a longer distance range can be realized Ways to improve user experience.

In some embodiments, the control instruction includes a control instruction for controlling the visual system of the movable platform when the sound signal to be recognized includes a tapping sound. When the sound signal to be recognized includes the tapping sound, the working state of the visual system can be controlled. Using slap sound to control the visual system can improve the accuracy of controlling the visual system. In one embodiment, the control instruction includes a control instruction for controlling the vision system to start visual tracking, and/or a control instruction for controlling the vision system to end visual tracking. The visual tracking can be started and/or ended by tapping sound, and the control of visual tracking can be accurately realized.

In some other embodiments, the control instruction can control other systems of the movable platform, for example, the power device of the movable platform can be controlled to control the movement of the movable platform; the camera of the movable platform can be controlled to take pictures. In some other embodiments, the control command can control other devices, and is not limited to a movable platform.

In some embodiments, at least one of the number of slaps of the slap sound, the duration of the slap, and the frequency of the slap is acquired; according to the number of slaps of the slap sound, the duration and the frequency of the slap At least one of them outputs different control commands. At least one of the number of times of the slap of the slap sound, the duration of the slap, and the frequency of the slap are different, and different control instructions are output, so that different control instructions can be generated according to different slap sounds to realize different controls. For example, different control commands can be generated according to different tapping sounds to control the start and end of visual tracking respectively.

In one embodiment, different control commands are generated according to different times of continuous slap. In an example, the user claps his palms twice in a row, the interactive method 200 recognizes the clapping sound representing two consecutive slaps, controls the movable platform to start visual tracking, and the movable platform starts to move with the user. During following, the user claps his palms three times in a row, and the interactive method 200 recognizes the clapping sound representing three consecutive slaps, and controls the movable platform to stop moving. The above is only an example, and is not limited to the above example. In an embodiment, the mapping relationship between the type of the slap sound and the control instruction may be preset, or may be independently set by the user, thereby enhancing the flexibility of interactive control and improving user experience.

FIG. 4 is a schematic diagram of an embodiment of the voice recognition system 300 of this application. The voice recognition system 300 includes one or more processors for implementing a voice recognition method. The processor 301 of the voice recognition system 300 can implement the voice recognition method 100 described above. In some embodiments, the voice recognition system 300 may include a computer-readable storage medium 304, which may store a program that can be called by the processor 301, and may include a non-volatile storage medium. In some embodiments, the voice recognition system 300 may include a memory 303 and an interface 302. In some embodiments, the voice recognition system 300 may also include other hardware according to actual applications.

The computer-readable storage medium 304 of this application has a program stored thereon, and when the program is executed by the processor 301, the voice recognition method 100 is implemented.

FIG. 5 shows a block diagram of an embodiment of the mobile platform 400 of the present application. The movable platform 400 includes a body 401, a power system 402, a microphone 403, and one or more processors 404. The movable platform 400 may include a mobile car, an unmanned aerial vehicle, a car, a robot, or other movable devices. The power system 402 is provided in the body 401 and used to provide power for the movable platform. In some embodiments, the power system 402 may include an electric motor. In one embodiment, the movable platform 400 is an unmanned aerial vehicle, and the power system 402 includes a propeller connected with a motor. In another embodiment, the movable platform 400 is a mobile trolley, and the power system 402 includes wheels connected to motors, such as universal wheels.

The microphone 403 is used to receive the voice to be recognized and generate a corresponding voice signal to be recognized. The microphone 403 may be installed in the body 401. Since the instantaneous energy of the slap sound is stronger than the voice, it is less likely to be attenuated in the air, and the slap sound can be better received by the microphone 403. The number of microphones can be one or more. In an embodiment, the microphone may also include windproof accessories, such as a windproof hair cover, a shock absorber, etc., to better receive the sound to be recognized.

One or more processors 404 are configured to implement a voice recognition method, and if the voice signal to be recognized includes a slap sound according to the voice recognition method, output a corresponding control command according to the slap sound. The processor 404 can control the power system 402.

In one embodiment, the control instruction includes a control instruction for controlling the movable platform 400 when the sound signal to be recognized includes a tapping sound. In one embodiment, the movable platform 400 includes a vision system 405, and the control instruction includes a control instruction for controlling the vision system when the sound signal to be recognized includes a slap sound. The processor 404 can control the vision system 405. In one embodiment, the control instruction includes a control instruction for controlling the vision system 405 to start visual tracking, and/or a control instruction for controlling the vision system 405 to end visual tracking. Specific description please refer to the above.

In one embodiment, the processor 404 is configured to obtain at least one of the number of times of the slap of the slap sound, the duration of the slap, and the frequency of the slap; according to the number of slaps of the slap sound, the duration of the slap A control command different from at least one of the frequencies of the tapping is output. Specific description please refer to the above.

This application can take the form of a computer program product implemented on one or more storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing program codes. Computer-readable storage media include permanent and non-permanent, removable and non-removable media, and information storage can be achieved by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer-readable storage media include, but are not limited to: phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only Memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical storage , Magnetic cassette tape, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.

It should be understood that each part of this application can be implemented by hardware, software or a combination thereof. In the above embodiments, multiple steps or methods can be implemented by software or hardware stored in a memory and executed by a suitable instruction execution system. For example, if it is implemented by hardware, it can be implemented by any one of the following technologies or a combination of them: discrete logic circuits with logic gates for realizing logic functions on data signals, and dedicated logic gates with suitable combinational logic gates Integrated circuit, programmable gate array (PGA), field programmable gate array (FPGA), etc.

A person of ordinary skill in the art can understand that all or part of the steps carried in the implementation method described above can be completed by a program instructing relevant hardware. The program can be stored in a computer-readable storage medium. When it includes one of the steps of the method embodiment or a combination thereof.

It should be noted that in this article, relational terms such as "first" and "second" are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is any such actual relationship or sequence between entities or operations. The terms "include", "include", or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements that are not explicitly listed. Elements, or also include elements inherent to such processes, methods, articles, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or equipment including the element.

The methods and devices provided by the embodiments of the present invention are described in detail above. Specific examples are used in this article to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and methods of the present invention. Core idea; At the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as a limitation of the present invention .

The content disclosed in this patent document contains copyrighted material. The copyright belongs to the copyright owner. The copyright owner does not object to anyone copying the patent document or the patent disclosure in the official records and archives of the Patent and Trademark Office.

Claims

A voice recognition method for recognizing tapping sounds, characterized in that the voice recognition method includes:

Acquire at least one sound segment of the sound signal to be recognized and first characteristic information of the sound segment, where the first characteristic information is the energy value of the sound segment, if the energy value of the middle region of the sound segment is greater than the energy threshold , Extract second feature information from the sound segment; and

According to the second feature information of at least one of the sound segments, it is recognized whether the sound signal to be recognized includes a slap sound.
2. The voice recognition method according to claim 1, wherein if the energy value of the middle region of the sound segment is greater than an energy threshold, extracting second feature information from the sound segment comprises:

Performing frame division and windowing processing on the voice signal to be recognized to obtain multiple sound frames corresponding to the voice signal to be recognized;

If the energy value of the sound frame in the middle region of the plurality of sound frames corresponding to the sound segment is greater than the energy threshold, the second characteristic information is extracted from the sound segment.
The voice recognition method according to claim 2, wherein the energy value includes the frequency spectrum value of the sound frame, and if the sound segment corresponds to the sound frame in the middle region of the sound frame, If the energy value is greater than the energy threshold, extracting the second characteristic information from the sound segment includes:

Performing fast Fourier transform on a plurality of the sound frames to obtain the spectral values of the plurality of sound frames;

If the frequency spectrum value of the sound frame in the middle region of the multiple sound frames corresponding to the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment.
The voice recognition method according to claim 3, wherein the window can be sequentially slid between a plurality of the sound frames, and the middle region of the plurality of the sound frames corresponding to the sound segment If the spectral value of the sound frame of is greater than the energy threshold, extracting the second characteristic information from the sound segment includes:

If the spectral value of the sound frame in the middle area of the window is greater than the energy threshold, generating a trigger signal;

When the window slides through several sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, the second characteristic information is extracted from the sound segment.
The voice recognition method according to claim 3, wherein the window can be sequentially slid between a plurality of the sound frames, and the middle region of the plurality of the sound frames corresponding to the sound segment If the spectral value of the sound frame of is greater than the energy threshold, extracting the second characteristic information from the sound segment includes:

When the spectral value of the sound frame in the middle region of the window is greater than the first energy value, and the spectral value of the sound frame in the two end regions of the window is smaller than the second energy value, a trigger signal is generated, wherein the first energy Value is greater than the second energy value;

When the window slides through a number of the sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, the second characteristic information is extracted from the sound segment.
The voice recognition method according to claim 1, wherein the recognizing whether the voice signal to be recognized includes a tapping sound according to the second characteristic information of at least one of the voice fragments comprises:

The second feature information is input into a recognition model for recognition, so as to recognize whether the sound signal to be recognized includes a tapping sound.
The voice recognition method according to claim 6, wherein the second feature information includes acoustic features, and the acoustic features include Mel frequency cepstral coefficient features, linear prediction coefficient features, Filterbank features, and bottleneck features. At least one.
The voice recognition method according to claim 6, wherein the recognition model includes multiple voice categories, and the second feature information is input into the recognition model for recognition, so as to recognize whether the voice signal to be recognized is Including slap sounds, including:

Respectively determining the likelihoods of the second feature information and feature information of multiple types of the sound;

The likelihood is sorted, and the sound category with the highest likelihood is determined as the category of the sound segment, so as to identify whether the sound signal to be recognized includes a slap sound.
8. The voice recognition method according to claim 8, wherein the voice category includes a slap sound category and a non-slap sound category.
The voice recognition method according to claim 9, wherein the slap sound category includes at least two slap sound categories representing different consecutive times of slaps.
The voice recognition method according to claim 6, wherein the voice recognition method comprises: training the recognition model using slap sound training data and non-slap sound training data.
The voice recognition method according to claim 11, wherein the slap sound training data includes first slap sound training data and second slap sound training data, and the first slap sound training data and the At least one of the number of times of slap, the duration of slap, and the frequency of slap represented by the second slap sound training data is different;

The training the recognition model using the slap sound training data and the non-slap sound training data includes:

Training the recognition model using the first slap sound training data and the second slap sound training data.
The voice recognition method according to claim 6, wherein the recognition model includes at least one of a deep model and a shallow model.
The voice recognition method according to claim 13, wherein the deep model comprises at least one of the following: a deep neural network, a long and short-term memory network, and a convolutional neural network.
The voice recognition method according to claim 13, wherein the shallow model comprises a Gaussian mixture model-hidden Markov model.
The voice recognition method according to claim 15, wherein the Gaussian number of the Gaussian mixture model-hidden Markov model ranges from 3 to 12.
The voice recognition method according to claim 15, wherein the slap sound training data and non-slap sound training data are used to train the Gaussian mixture model-hidden Markov model, wherein the slap sound training data It includes first slap sound training data and second slap sound training data. The first slap sound training data and the second slap sound training data indicate the number of slaps, the duration of the slap, and the slap At least one of the frequencies is different.
The voice recognition method according to claim 17, wherein the first slap sound training data includes the slap sound training data of two consecutive slaps, and the corresponding Gaussian mixture model-hidden Markov model The number of states of the first slap sound training data ranges from 6 to 14.
The voice recognition method according to claim 17, wherein the second slap sound training data comprises three consecutive slaps slap sound training data, and the Gaussian mixture model-hidden Markov model corresponds to The number of states of the second slap sound training data ranges from 9 to 21.
The voice recognition method of claim 17, wherein the number of states of the Gaussian mixture model-hidden Markov model corresponding to the non-slap voice training data ranges from 7 to 18.
The voice recognition method according to claim 17, wherein the training the Gaussian mixture model-hidden Markov model using slap sound training data and non-slap sound training data comprises:

Estimate the parameters of the hidden Markov model.
The voice recognition method according to claim 21, wherein the method for estimating the parameters of the hidden Markov model comprises: Baum-welch algorithm and/or genetic algorithm.
The voice recognition method according to claim 17, wherein the training the Gaussian mixture model-hidden Markov model using slap sound training data and non-slap sound training data comprises:

The Gaussian mixture model in the Gaussian mixture model-hidden Markov model is trained multiple times.
The voice recognition method according to claim 23, wherein the method of training the Gaussian mixture model in the Gaussian mixture model-hidden Markov model for multiple times comprises: an expectation maximization method or a maximum likelihood law.
The voice recognition method according to claim 1, wherein the frequency range of the slap sound is 300 Hz to 8000 Hz.
The voice recognition method according to claim 1, wherein the clapping sound comprises at least one of applause and knocking sound.
The voice recognition method according to claim 1, wherein the recognizing whether the voice signal to be recognized includes a slap sound according to the second feature information of at least one of the voice segments further comprises:

When the sound signal to be recognized includes the slap sound, the category of the slap sound is identified, and the category of the slap sound corresponds to the corresponding control instruction; wherein, the category of the slap sound includes the slap At least one of the number of times, the duration of the slap, and the frequency of the slap.
An interactive method, characterized in that it includes:

Acquire the sound signal to be recognized;

The voice recognition method of any one of claims 1-27; and

If it is recognized according to the voice recognition method that the sound signal to be recognized includes a slap sound, a corresponding control instruction is output according to the slap sound.
The interaction method according to claim 28, wherein the control instruction comprises a control instruction for controlling a movable platform when the sound signal to be recognized includes the tapping sound.
The interaction method according to claim 29, wherein the control instruction comprises a control instruction for controlling the visual system of the movable platform when the sound signal to be recognized includes the slap sound.
The interaction method according to claim 30, wherein the control instruction comprises a control instruction for controlling the vision system to start visual tracking, and/or a control instruction for controlling the vision system to end the visual tracking.
The interaction method according to claim 28, wherein said outputting a corresponding control instruction according to said tapping sound comprises:

Acquiring at least one of the number of times of the slap of the slap sound, the duration of the slap, and the frequency of the slap;

Different control commands are output according to at least one of the number of slaps of the slap sound, the duration of the slap, and the frequency of the slap.
A voice recognition system, characterized in that it comprises one or more processors for implementing the voice recognition method according to any one of claims 1-27.
A computer-readable storage medium, characterized in that a program is stored thereon, and when the program is executed by a processor, the voice recognition method according to any one of claims 1-27 is realized.
A movable platform, characterized in that it comprises:

Body

The power system is provided in the body and used to provide power to the movable platform;

The microphone is used to receive the voice to be recognized and generate the corresponding voice signal to be recognized; and

One or more processors, configured to implement the voice recognition method of any one of claims 1-27, and if it is recognized according to the voice recognition method that the voice signal to be recognized includes a tapping sound, then The slap sound outputs corresponding control commands.
The movable platform according to claim 35, wherein the control instruction comprises a control instruction for controlling the movable platform when the sound signal to be recognized includes the tapping sound.
The movable platform according to claim 36, wherein the movable platform includes a vision system, and the control instruction includes controlling the vision system when the sound signal to be recognized includes the tapping sound. Control instruction.
The movable platform according to claim 37, wherein the control instruction comprises a control instruction for controlling the vision system to start visual tracking, and/or a control instruction for controlling the vision system to end the visual tracking.
The mobile platform according to claim 35, wherein the processor is configured to:

Acquiring at least one of the number of times of slap of the slap sound, the duration of the slap, and the frequency of the slap;

Different control commands are output according to at least one of the number of slaps of the slap sound, the duration of the slap, and the frequency of the slap.