CN111684522A

CN111684522A - Voice recognition method, interaction method, voice recognition system, computer-readable storage medium, and removable platform

Info

Publication number: CN111684522A
Application number: CN201980009292.6A
Authority: CN
Inventors: 吴俊峰; 赵文泉; 李皓宇; 周事成; 吴晟
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd; Shenzhen DJ Innovation Industry Co Ltd
Priority date: 2019-05-15
Filing date: 2019-05-15
Publication date: 2020-09-18
Also published as: WO2020227955A1

Abstract

The application discloses a voice recognition method, an interaction method, a voice recognition system, a computer readable storage medium and a movable platform. The sound recognition method is used to recognize the slap sound. The voice recognition method comprises the following steps: acquiring at least one sound segment of a sound signal to be identified and first characteristic information of the sound segment, wherein the first characteristic information is an energy value of the sound segment, and if the energy value of the middle area of the sound segment is greater than an energy threshold value, extracting second characteristic information from the sound segment; and identifying whether the sound signal to be identified comprises the clapping sound according to the second characteristic information of the at least one sound fragment.

Description

Voice recognition method, interaction method, voice recognition system, computer-readable storage medium, and removable platform

Technical Field

The present application relates to the field of voice recognition, and in particular, to a voice recognition method, an interaction method, a voice recognition system, a computer-readable storage medium, and a mobile platform.

Background

With the popularization of intelligent hardware in application occasions such as home life, education and the like, voice gradually becomes an important man-machine interaction mode, such as voice interaction. However, due to hardware limitations, when the distance is long, for example, when the distance is more than 2 meters from the hardware device, the environmental noise mixed in the speech signal can bring great challenges to speech recognition due to the low signal-to-noise ratio. Compared with a voice signal, the beat voice signal is single, has stronger anti-interference capability and stronger instantaneous energy. Thus, a slap sound, such as a clapping sound, may be used to control a hardware device, such as a voice activated switch. However, the existing voice-operated switch based on the waveform comparison circuit has insufficient robustness in use, most of high-volume sounds can trigger the voice-operated switch, and the voice-operated switch is triggered by mistake too frequently, so that the voice-operated switch is unreliable as a man-machine interaction mode.

Disclosure of Invention

The present application provides improved voice recognition methods, interaction methods, voice recognition systems, computer-readable storage media, and removable platforms.

According to an aspect of an embodiment of the present application, there is provided a sound recognition method for recognizing a clapping sound, the sound recognition method including: acquiring at least one sound segment of a sound signal to be identified and first characteristic information of the sound segment, wherein the first characteristic information is an energy value of the sound segment, and if the energy value of a middle area of the sound segment is greater than an energy threshold value, extracting second characteristic information from the sound segment; and identifying whether the sound signal to be identified comprises a clapping sound according to the second characteristic information of at least one sound fragment.

According to an aspect of an embodiment of the present application, there is provided an interaction method, including: acquiring a voice signal to be identified; a voice recognition method; and if the sound signal to be identified comprises the clapping sound according to the sound identification method, outputting a corresponding control instruction according to the clapping sound.

According to another aspect of embodiments of the present application, there is provided a voice recognition system, comprising one or more processors, for implementing a voice recognition method.

According to another aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a voice recognition method.

According to another aspect of an embodiment of the present application, there is provided a movable platform including: a body; the power system is arranged on the machine body and used for providing power for the movable platform; the microphone is used for receiving the voice to be identified and generating a corresponding voice signal to be identified; and one or more processors, which are used for realizing a voice recognition method, and outputting a corresponding control instruction according to the clapping voice if the voice signal to be recognized is recognized to comprise the clapping voice according to the voice recognition method.

In the voice recognition method, if the energy value of the middle area of the voice segment is larger than the energy threshold, second characteristic information is extracted from the voice segment, so that the voice signal to be recognized is preliminarily screened, and whether the voice signal to be recognized comprises the clapping voice is recognized according to the second characteristic information, so that in a far distance range, the clapping voice recognition rate is high, robustness is good, the possibility of false triggering is low, and the method is suitable for being used as a reliable man-machine interaction mode.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a flow chart of an embodiment of the voice recognition method of the present application.

FIG. 2 is a sub-flow diagram of an embodiment of the speech recognition method of the present application.

FIG. 3 is a flow chart illustrating an embodiment of an interaction method of the present application.

FIG. 4 is a schematic diagram of an embodiment of a voice recognition system of the present application.

FIG. 5 is a block diagram illustrating modules of one embodiment of a movable platform of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. Unless otherwise indicated, "front", "rear", "lower" and/or "upper" and the like are for convenience of description and are not limited to one position or one spatial orientation. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The word "plurality" or "a number" and the like mean at least two.

The sound identification method is used for identifying the clapping sound. The voice recognition method comprises the following steps: acquiring at least one sound segment of a sound signal to be identified and first characteristic information of the sound segment, wherein the first characteristic information is an energy value of the sound segment, and if the energy value of the middle area of the sound segment is greater than an energy threshold value, extracting second characteristic information from the sound segment; and identifying whether the sound signal to be identified comprises the clapping sound according to the second characteristic information of the at least one sound fragment.

In the sound identification method, if the energy value of the middle area of the sound fragment is larger than the energy threshold, second characteristic information is extracted from the sound fragment, the sound signal to be identified is subjected to preliminary screening, and whether the sound signal to be identified comprises the clapping sound is identified according to the second characteristic information, so that in a longer distance range, the identification rate of the clapping sound is high, robustness is good, the possibility of false triggering is low, and the sound identification method is suitable for being used as a reliable man-machine interaction mode.

An interaction method of the embodiment of the application comprises the following steps: acquiring a voice signal to be identified; the voice identification method comprises the steps of obtaining first characteristic information of at least one voice fragment, wherein the first characteristic information is an energy value of the voice fragment, and if the energy value of a middle area of the voice fragment is larger than an energy threshold value, second characteristic information is extracted from the voice fragment; identifying whether the sound signal to be identified comprises the clapping sound or not according to the second characteristic information of the at least one sound fragment; and if the sound signal to be recognized comprises the clapping sound according to the sound recognition method, outputting a corresponding control instruction according to the clapping sound.

The sound identification method has high identification rate of the clapping sound, good robustness and low false triggering possibility, so that the interaction method is reliable. And the instantaneous energy of the clapping sound is stronger than that of the voice and is not easy to be attenuated completely in the air, so that the identification effect of the clapping sound transmitted at a certain distance, such as a distance of more than 2 meters, is better than that of the voice identification, thereby realizing human-computer interaction by utilizing the clapping sound in a longer distance range and having higher identification rate and stronger anti-interference performance.

The voice recognition system of the embodiment of the application comprises one or more processors and is used for realizing the voice recognition method.

A machine-readable storage medium of an embodiment of the present application stores thereon a program that, when executed by a processor, implements the above-described voice recognition method.

The movable platform of the embodiments of the present application includes a body, a power system, a microphone, and one or more processors. The power system is arranged on the machine body and used for providing power for the movable platform. The microphone is used for receiving the voice to be identified and generating a corresponding voice signal to be identified. And the one or more processors are used for realizing the voice recognition method, and if the voice signal to be recognized comprises the clapping voice according to the voice recognition method, outputting a corresponding control instruction according to the clapping voice.

FIG. 1 is a flow diagram illustrating one embodiment of a voice recognition method 100. The sound recognition method 100 is used to recognize slap sounds. In some embodiments, the frequency range of the slapping sound is 300Hz to 8000Hz, the sound is crisp, the instantaneous energy is stronger than the voice, the slapping sound is not easy to be attenuated in the air and easy to identify, the identification effect is good, the identification rate is high, and the anti-interference performance is strong. In one embodiment, the slap sound includes at least one of a clapping sound and a tapping sound. The tapping sound may include the sound of tapping something, such as a wall, a table, etc., which is similar in waveform to a clapping sound. The clapping sound such as the clapping sound and/or the beating sound has high recognition rate and strong anti-interference performance, and can realize the recognition at a longer distance. In the present embodiment, the voice recognition method 100 includes

steps

101 and 102.

In step 101, at least one sound segment of a sound signal to be identified and first feature information of the sound segment are obtained, the first feature information is an energy value of the sound segment, and if the energy value of a middle area of the sound segment is greater than an energy threshold, second feature information is extracted from the sound segment.

In one embodiment, the sound signal to be identified may be one or more segments of a sound signal in a real-time sound signal stream. In one embodiment, the voice recognition method 100 may include acquiring a voice signal to be recognized. The voice signal to be recognized may be intercepted from the real-time voice signal stream. In one embodiment, the sound signal between two adjacent mute periods exceeding the mute time threshold may be intercepted as the sound signal to be identified. The sound signal within the mute period may represent no sound or little sound and may be referred to as a "mute signal" having an energy value below the minimum energy value of the clapping sound. The energy value of the sound signal of the real-time sound signal stream may be compared with a set mute energy threshold, and if the energy value of the sound signal is smaller than the mute energy threshold, the sound signal is determined to be a mute signal, and the duration of the mute signal, i.e., a mute time period, may be determined. The mute energy threshold does not exceed a minimum energy value of the clapping sound. The mute time threshold may be preset. In one embodiment, the time-to-mute threshold exceeds the interval between two adjacent ones of the consecutive taps. In one example, the mute time threshold is any value greater than or equal to 2 seconds, and consecutive taps are considered to be occurring when the interval between two consecutive taps is less than 2 seconds. In another embodiment, the sound signals to be identified are all sound signals of a real-time sound signal stream.

In another embodiment, the energy value may be used to perform preliminary screening on the sound signal to be recognized, and if the energy value of the middle region of the sound segment of the sound signal to be recognized is greater than the energy threshold, which indicates that the sound corresponding to the middle region of the sound segment is large, it indicates that the sound segment may include the slapping sound, and the slapping sound may be well preliminary screened. For example, in one embodiment, the middle region of the sound segment is peaked, and the two end regions are small and flat, so that the waveform of the sound segment is higher in the middle and lower in the two ends, indicating that the sound segment may include a slapping sound. In one embodiment, the middle region of the sound segment may be the center of the sound segment, or may be a region where the center of the sound segment extends toward one end or both ends. In one embodiment, the energy threshold is a fixed value that is preset or a value that varies in real time. In one embodiment, the energy threshold may be determined based on the energy values of one or both end regions outside the middle region of the sound segment, and thus the energy threshold may be different for different sound segments. In another embodiment, the energy threshold is a predetermined value, and a fixed energy threshold may be set based on the characteristics and experience of the slap sound.

FIG. 2 is a sub-flow diagram illustrating one embodiment of step 101. Step 101 comprises sub-steps 111 and 112. In the sub-step 111, the audio signal to be recognized is subjected to frame-wise windowing to obtain a plurality of audio frames corresponding to the audio signal to be recognized.

And performing windowing processing on the voice signal to be recognized to realize framing, so as to obtain a plurality of voice frames. Generally, one slap sound lasts about 80-160 milliseconds, and in one embodiment, the sound signal to be recognized is framed by 11-23 milliseconds for one frame, with 4-15 consecutive frames of sound being determined each time. In one example, the voice signal to be recognized is framed in 16 ms frames, and the judgment is performed on the continuous 7-frame voice frames each time. In other examples, the frame may be divided according to other time and/or the sound frame including other number of frames may be determined, which is not limited herein.

In sub-step 112, if the energy value of the sound frame in the middle region of the plurality of sound frames corresponding to the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment.

In one embodiment, the sound frame of the center region includes a center sound frame. In another embodiment, the sound frames of the center region include a center sound frame and one or more sound frames on one or both sides of the center sound frame. In one embodiment, the sound segment includes odd frames of sound frames, and the center frame of sound is a frame of sound frame in the center of the sound segment. In another embodiment, the sound segment includes an even number of sound frames, and the center sound frame is the one or two sound frames closest to the center of the sound segment. When the sound frame of the middle area is a frame, the energy value of the sound frame of the middle area is the energy value of the sound frame of the frame; when the sound frame in the middle area is a plurality of frames, the energy value of the sound frame in the middle area may be calculated by a suitable algorithm, for example, an algorithm such as an average, a median, and a variance, and is not limited herein.

In some embodiments, the windowed window can slide between multiple sound frames in sequence to judge the continuous multiple sound frames, so as to avoid missing slapping sound, and the judgment is more accurate and more robust. In one example, the window is slid multiple times, and the determined consecutive sound frames are a plurality of sound frames of one sound segment. For example, the window is slid three times, and the determination is performed for 7 consecutive frames each time for one frame, and after the window is slid three times from the initial position, the determination is performed for 10 consecutive frames of audio frames in total, and the 10 frames are set as a plurality of audio frames of one audio clip. The sound clip is obtained by sliding the window. In one embodiment, the window is slid one frame at a time. Of course, in other embodiments, the window may also be slid two frames, three frames or more frames at a time according to actual needs, and is not limited herein.

In one embodiment, the energy values include spectral values of a sound frame, and a fast fourier transform may be performed on a plurality of sound frames to obtain the spectral values of the plurality of sound frames. And if the spectrum values of the sound frames in the middle areas of the sound frames of the sound segment are larger than the energy threshold, extracting second characteristic information from the sound segment. The spectral values may represent the energy of the sound, and when the spectral values of the sound frame in the middle region are greater than the energy threshold, it is indicated that the sound segment may include a slap sound, and thus the second feature information is extracted from the sound segment. The method for obtaining the frequency spectrum value is simple, the sound to be identified can be primarily screened by utilizing the frequency spectrum value, some sound segments which obviously do not include the slapping sound can be removed, and the method is simple and effective.

In one embodiment, if the spectral value of the sound frame in the middle area of the window is greater than the energy threshold, a trigger signal is generated; when the window slides through a plurality of sound frames in sequence, if the number of the continuously generated trigger signals reaches a trigger number threshold value, second characteristic information is extracted from sound segments of the continuous frames where the plurality of sound frames are located. When the window slides through a plurality of sound frames in sequence, if a plurality of trigger signals are generated continuously, it indicates that the sound segment may contain the slap sound. The window can move one frame each time, one clapping sound can be triggered repeatedly, a plurality of trigger signals are generated, therefore, the clapping sound is prevented from being missed, and the judgment accuracy is enhanced.

In one embodiment, the trigger signal is generated when the spectral values of the sound frames in the middle area of the window are greater than a first energy value and the spectral values of the sound frames in the two end areas of the window are less than a second energy value, wherein the first energy value is greater than the second energy value; when the window slides through a plurality of sound frames in sequence, if the number of the trigger signals generated continuously reaches a trigger number threshold value, second characteristic information is extracted from the sound fragment. In some embodiments, the first energy value comprises a predetermined fixed value; and/or a value associated with the spectral values of the sound frame of the two end regions, e.g., the first energy value varies with a variation in the spectral values of the sound frame of the two end regions, e.g., is greater than the spectral values of the sound frame of the two end regions. In some embodiments, the second energy value comprises a predetermined fixed value; and/or a value associated with a spectral value of the sound frame in the center region, e.g., the second energy value, may vary with a change in the spectral value of the sound frame in the center region, such as a spectral value of the sound frame that is smaller than the center region. Of course, in other embodiments, the spectral value of the sound frame at one end of the two end regions of the window may be set to be smaller than the second energy value, and the spectral value of the sound frame at the other end varies with the variation of the spectral value of the sound frame at one end, which is not limited herein.

In one example, in the sound to be recognized, continuous 7 frames of sound frames are taken for judgment, and the sound frame in the middle area of the 7 frames of sound frames is the fourth frame. Let the spectral value of the xth frame of the 7 frames be m (x), and the minimum value of the third to fifth frames be MI. For example, in one embodiment, it is preset that: if M (4) >2 MI and M (4) > 5M (2) and M (4) > 3M (6) and M (4) > 20M (1) and M (4) > 7M (7) and M (4) >0.05, determining to trigger once and generating a trigger signal; m (4) is a sound frame of the middle zone. Furthermore, no matter whether the trigger is carried out or not, the window slides to the next frame to carry out the judgment again; if the triggering is continuously carried out for 4 times, the sound fragment containing a plurality of judged continuous sound frames contains the clapping sound, namely, the ten-frame sound frames from the first frame to the tenth frame corresponding to the initial position of the window contain the clapping sound, and then the second characteristic information is extracted from the sound fragment. In this example, the trigger number threshold is 4 times, but is not limited thereto, and in other examples, other trigger number thresholds may be set. In this example, the energy threshold includes a plurality of energy thresholds, 2 × MI, 5 × M (2), 3 × M (6), 20 × M (1), 7 × M (7), 0.05. The energy threshold includes a fixed threshold of 0.05, and the energy threshold related to the spectral values of the two end sound frames may be multiples of the spectral values of the two end sound frames. Therefore, the initial screening can be accurately carried out, and the omission of the slapping sound is avoided.

With continued reference to fig. 1, in step 102, it is identified whether the sound signal to be identified comprises a clapping sound based on the second characteristic information of the at least one sound clip.

At least one sound segment after the preliminary screening, which may contain a clapping sound, is further identified to determine whether the sound signal to be identified comprises a clapping sound. In the voice recognition method 100 of the embodiment of the application, if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, so that the voice signal to be recognized is preliminarily screened, and then whether the voice to be recognized includes the clapping voice is recognized according to the second feature information, so that the clapping voice can achieve a high recognition rate even in a relatively long distance range, is low in false triggering possibility, and is suitable for being used as a reliable man-machine interaction mode.

In some embodiments, when the sound signal to be recognized includes a clapping sound, a class of the clapping sound is recognized, and further, the class of the clapping sound corresponds to the corresponding control instruction. Wherein the category of the slap sound includes at least one of a number of slaps, a duration of the slap, and a frequency of the slap. The number of taps may be the number of consecutive taps in a period of tap sound. The duration of the slap may be the total duration of successive slaps in a segment of slap sound. The frequency of the slapping can reflect the speed of the slapping. Therefore, whether the sound signal to be identified comprises the clapping sound or not is identified, and when the sound signal to be identified comprises the clapping sound, the type of the clapping sound can be further identified, so that the method can be better used in man-machine interaction, and different interactions can be realized by the clapping sounds of different types.

In some embodiments, the second feature information is input into the recognition model for recognition to further identify whether the sound signal to be recognized includes a clapping sound. The model can be identified accurately and quickly by identifying the model. In some embodiments, the second feature information includes acoustic features including at least one of mel-frequency cepstral Coefficient (MFCC) features, Linear Prediction Coefficient features (LPC), Filterbank features, and Bottleneck features (bottleeck features). The clapping sound may be identified in the recognition model using one or more of the acoustic features described above.

In some embodiments, the recognition model includes a plurality of sound classes. Respectively determining the likelihood of the second characteristic information and the characteristic information of the multiple sound categories; and sequencing the likelihoods, and determining the sound category with the highest likelihood as the category of the sound to be recognized so as to recognize whether the sound to be recognized comprises the slap sound. This allows for rapid identification. In one embodiment, the sound categories include a clapping sound category and a non-clapping sound category. The likelihood of the second feature information and the feature information of the clapping sound category and the likelihood of the feature information of the non-clapping sound category can be determined, and the sound category with the highest likelihood is determined as the category of the sound to be identified. Therefore, whether the sound to be recognized comprises the clapping sound or not can be determined, and the recognition accuracy and the speed are high.

In some embodiments, the clapping sound categories include at least two clapping sound categories representing different numbers of consecutive clapping. For example, a clapping sound category representing two claps in succession, a clapping sound category representing three claps in succession, and a clapping sound category representing more claps in succession. The second feature information is input into the recognition model, it may be determined whether the sound to be recognized includes a clapping sound, and the number of consecutive clapping may be determined. Therefore, the method can identify the clapping sound with different continuous clapping times, and the identification is more accurate. In other embodiments, the clapping sound categories may include at least two clapping sound categories representing different durations and/or frequencies of clapping.

In some embodiments, the recognition model is trained using clapping sound training data and non-clapping sound training data. Thus, a clapping sound class and a non-clapping sound class can be obtained. The non-slapping sound training data may include data for sounds other than slapping, such as noise, spoken sounds, and the like. A large amount of clapping sound training data and non-clapping sound training data may be collected to train the recognition model. In some embodiments, the recognition model may be trained multiple times to obtain a better performing recognition model.

In some embodiments, the clapping sound training data comprises first clapping sound training data and second clapping sound training data, the first clapping sound training data and the second clapping sound training data representing different at least one of a number of claps, a duration of the claps, and a frequency of the claps. The recognition model is trained using the first clapping acoustic training data and the second clapping acoustic training data. Different types of clapping sounds are thus obtained and can thus be used for identifying the type of clapping sound. In one embodiment, the first clapping sound training data and the second clapping sound training data represent different numbers of consecutive claps. In one example, the first clapping sound training data represents two clapping consecutively and the second clapping sound training data represents three clapping consecutively, but is not limited to this example. The recognition model can be trained according to practical application to obtain different types of the clapping sound.

In some embodiments, the recognition model includes at least one of a depth model and a shallow model, by which a recognition rate is high. In some embodiments, the depth model includes at least one of: deep Neural Networks (DNN), Long Short Term Memory Networks (LSTM), and Convolutional Neural Networks (CNN).

In one embodiment, the shallow model comprises a Gaussian mixture model-hidden Markov (GMM-HMM) model, and the signal to be recognized is recognized through the Gaussian mixture model-hidden Markov model, so that the recognition rate is high, and the recognition speed is high. In some embodiments, the gaussian mixture model-hidden markov model is trained using clapping acoustic training data and non-clapping acoustic training data, wherein the clapping acoustic training data comprises first clapping acoustic training data and second clapping acoustic training data, the first and second clapping acoustic training data representing different at least one of a number of clapping, a duration of clapping, and a frequency of clapping. The Gaussian mixture model-hidden Markov model so trained includes a non-clapping sound class and a clapping sound class, wherein the clapping sound class includes first and second clapping sound classes that differ in at least one of a number of times a clapping is represented, a duration of the clapping, and a frequency of the clapping. See in particular the above. MFCC characteristics are extracted for training of the slapping sound training data and the non-slapping sound training data and are used for training of a Gaussian mixture model-hidden Markov model.

In some embodiments, a hidden markov (HMM) model is parameter estimated. In one embodiment, a method of parameter estimation for hidden markov models includes: baum-welch algorithm and/or genetic algorithm (genetic Algorithm). And performing parameter estimation on the hidden Markov model through a Baum-welch algorithm and/or a genetic algorithm. The Baum-Welch algorithm, also known as the forward-backward algorithm, first makes an initial estimate of the parameters of the HMM model, but this is likely a wrong guess, and then updates the parameters of the HMM model by evaluating the validity of these parameters (e.g., cross-validation) for the given training data and reducing the errors they cause, so that the error with the given training data is reduced. The genetic algorithm is a calculation model of a biological evolution process for simulating natural selection and genetic mechanism of Darwinian biological evolution theory, and is a method for searching an optimal solution by simulating the natural evolution process.

In some embodiments, the gaussian mixture model-hidden markov model has a gaussian number in the range of 3 to 12, is suitable for recognizing the clapping sound, balances recognition performance and recognition speed, and has recognition accuracy as high as possible and recognition speed as high as possible. The first clapping sound training data comprises clapping sound training data of two continuous clapping, the state quantity range of the corresponding first clapping sound training data of the Gaussian mixture model-hidden Markov model is 6-14, the performance of the recognition model is as good as possible, and the recognition speed is as fast as possible. In some embodiments, the second clapping sound training data comprises clapping sound training data of three clapping times in succession, the number of states of the corresponding second clapping sound training data of the gaussian mixture model-hidden markov model ranges from 9 to 21, the performance of the recognition model is as good as possible, and the recognition speed is as fast as possible. In some embodiments, the number of states of the corresponding non-clapping acoustic training data of the gaussian mixture model-hidden markov model ranges from 7 to 18, the performance of the recognition model is as good as possible, and the recognition speed is as fast as possible. In one example, the number of states of the first clapping sound training data is 10, the number of states of the second clapping sound training data is 15, the number of states of the non-clapping sound training data is 12, and the gaussian number is 3. The above is merely an example, and is not limited to this example, and in other examples, the number of states and/or the number of gaussians may be other values, for example, the number of gaussians may be 5 or 8.

In some embodiments, a Gaussian Mixture Model (GMM) model in a gaussian mixture model-hidden markov model may be trained multiple times, thus obtaining a model with high recognition accuracy. In some embodiments, a method of training a gaussian mixture model-hidden markov model multiple times comprises: expectation Maximization (EM) or maximum likelihood (ml) methods. And (3) performing multiple times of training on the Gaussian mixture model-hidden Markov model by using an expectation maximization method or a maximum likelihood method to obtain a model with high identification accuracy. The expectation maximization method is a method for solving parameter maximum likelihood estimation. The expectation-maximization method is a maximum likelihood estimation method for solving probability model parameters from incomplete data or data sets with data loss (hidden variables exist). The Maximum Likelihood Method (ML), also called Maximum Likelihood estimation, is a theoretical point estimation method and can be used to estimate parameters of a model.

Therefore, after the Gaussian mixture model-hidden Markov model is trained for multiple times, the Gaussian mixture model-hidden Markov model recognition model with better performance is obtained.

FIG. 3 is a flow chart illustrating an embodiment of an interaction method 200 of the present application. The interaction method 200 comprises step 201 and step 203.

In step 201, a sound signal to be recognized is acquired. The sound signal to be recognized may be obtained from a real-time sound signal stream. The detailed description may refer to the above description and will not be repeated herein.

In step 202, the voice recognition method 100 as described above is executed to recognize the acquired voice signal to be recognized.

In step 203, if it is recognized that the voice signal to be recognized includes a clapping voice according to the voice recognition method 100, a corresponding control command is output according to the clapping voice.

The interaction method 200 interacts with a clapping sound. The sound identification method has high identification rate of the clapping sound, good robustness and low false triggering possibility, so that the interaction method is reliable. And the instantaneous energy of the clapping sound is stronger than that of the voice, and the clapping sound is not easy to be attenuated completely in the air, so that the recognition effect of the clapping sound transmitted at a certain distance, such as a distance of more than 2 meters, is better than that of the voice recognition, human-computer interaction can be realized by utilizing the clapping sound in a longer distance range, and the clapping sound has higher recognition rate and stronger anti-interference performance.

In some embodiments, the control instructions include control instructions to control the movable platform when the sound signal to be recognized includes a clapping sound. The control instructions may control the movable platform, for example, may control the movable platform to advance, retreat, turn, rotate, stand still, fire bullets, and the like. The movable platform may include a mobile cart, unmanned aerial vehicle, automobile, robot, or other movable device. The movable platform is controlled by utilizing the interaction of the clapping sound and the movable platform, the identification rate of the clapping sound is high, the movable platform is controlled more accurately, the probability of error control is low, various interaction modes in a longer distance range can be realized, and the user experience is improved.

In some embodiments, the control instructions include control instructions to control a vision system of the movable platform when the sound signal to be recognized includes a clapping sound. When the sound signal to be recognized includes a clapping sound, the operating state of the vision system and the like can be controlled. The accuracy of controlling the visual system can be improved by controlling the visual system by using the clapping sound. In one embodiment, the control instructions include control instructions that control the vision system to initiate the visual tracking and/or control instructions that control the vision system to end the visual tracking. The visual tracking can be started and/or ended by utilizing the clapping sound, and the control on the visual tracking can be accurately realized.

In other embodiments, the control instructions may control other systems of the movable platform, for example, may control a motive device of the movable platform, thereby controlling movement of the movable platform; the camera of the movable platform can be controlled to take pictures and the like. In other embodiments, the control instructions may control other devices and are not limited to a movable platform.

In some embodiments, at least one of a number of taps of the tap sound, a duration of the taps, and a frequency of the taps is obtained; different control instructions are output according to at least one of the number of times of slapping of the slapping sound, the duration of the slapping, and the frequency of the slapping. Different control instructions are output when at least one of the times of the slapping sound, the slapping duration and the slapping frequency is different, so that different control instructions can be generated according to different slapping sounds to realize different controls. For example, different control commands may be generated according to different clapping sounds to control the start and end of the visual tracking respectively.

In one embodiment, different control commands are generated based on different numbers of consecutive taps. In one example, the user claps twice in succession, and the interaction method 200 recognizes a clapping sound representing the clapping twice in succession, controls the movable platform to initiate visual tracking, and the movable platform begins to follow the user movement. In the following, the user claps three times in succession, and the interactive method 200 recognizes the clapping sound representing the three times of clapping in succession and controls the movable platform to stop moving. The above is merely an example, and is not limited to the above example. In one embodiment, the mapping relationship between the type of the clapping sound and the control instruction can be preset or can be set by the user, so that the flexibility of interactive control is enhanced, and the user experience is improved.

FIG. 4 is a schematic diagram of an embodiment of a voice recognition system 300 of the present application. The voice recognition system 300 includes one or more processors for implementing the voice recognition method. The processor 301 of the voice recognition system 300 may implement the voice recognition method 100 described above. In some embodiments, the voice recognition system 300 may include a computer-readable storage medium 304, which may store a program that may be invoked by the processor 301, which may include a non-volatile storage medium. In some embodiments, the voice recognition system 300 may include a memory 303 and an interface 302. In some embodiments, the voice recognition system 300 may also include other hardware depending on the application.

The present application provides a computer readable storage medium 304 having stored thereon a program which, when executed by a processor 301, implements the voice recognition method 100.

FIG. 5 is a block diagram illustrating modules of one embodiment of a movable platform 400 of the present application. Moveable platform 400 includes a body 401, a power system 402, a microphone 403, and one or more processors 404. The movable platform 400 may include a mobile cart, an unmanned aerial vehicle, an automobile, a robot, or other movable device. A power system 402 is disposed in body 401 for providing power to the movable platform. In some embodiments, the power system 402 may include an electric machine. In one embodiment, the movable platform 400 is an unmanned aerial vehicle and the power system 402 includes a propeller coupled to a motor. In another embodiment, the movable platform 400 is a mobile cart and the power system 402 includes wheels, such as universal wheels, coupled to a motor.

The microphone 403 is used for receiving the voice to be recognized and generating a corresponding voice signal to be recognized. A microphone 403 may be mounted to the body 401. Since the clapping sound is instantaneously more energetic than speech and is less attenuated in air, the clapping sound can be better received by microphone 403. The number of microphones may be one or more. In one embodiment, the microphone may also include a wind shield accessory, such as a wind shield canopy, a shock mount, or the like, to better receive the sound to be recognized.

And one or more processors 404, configured to implement the voice recognition method, and if it is recognized that the voice signal to be recognized includes a clapping sound according to the voice recognition method, output a corresponding control instruction according to the clapping sound. Processor 404 may control power system 402.

In one embodiment, the control instructions include control instructions to control the movable platform 400 when the sound signal to be recognized includes a clapping sound. In one embodiment, the movable platform 400 includes a vision system 405 and the control instructions include control instructions to control the vision system when the sound signal to be recognized includes a clapping sound. The processor 404 may control a vision system 405. In one embodiment, the control instructions include control instructions that control the vision system 405 to initiate a visual trace and/or control instructions that control the vision system 405 to end a visual trace. The detailed description is made with reference to the above description.

In one embodiment, the processor 404 is configured to obtain at least one of a number of taps of the tap sound, a duration of the taps, and a frequency of the taps; different control instructions are output according to at least one of the number of times of slapping of the slapping sound, the duration of the slapping, and the frequency of the slapping. The detailed description is made with reference to the above description.

This application may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Computer-readable storage media include permanent and non-permanent, removable and non-removable media and may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer readable storage media include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.

It should be understood that portions of the present application may be implemented in hardware, software, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or hardware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, any one or a combination of the following techniques may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried out to implement the above-described implementation method can be implemented by hardware related to instructions of a program, which can be stored in a computer-readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The method and apparatus provided by the embodiments of the present invention are described in detail above, and the principle and the embodiments of the present invention are explained in detail herein by using specific examples, and the description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

The disclosure of this patent document contains material which is subject to copyright protection. The copyright is owned by the copyright owner. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office official records and records.

Claims

1. A sound recognition method for recognizing a clapping sound, the sound recognition method comprising:

acquiring at least one sound segment of a sound signal to be identified and first characteristic information of the sound segment, wherein the first characteristic information is an energy value of the sound segment, and if the energy value of a middle area of the sound segment is greater than an energy threshold value, extracting second characteristic information from the sound segment; and

and identifying whether the sound signal to be identified comprises clapping sound according to the second characteristic information of at least one sound segment.

2. The method according to claim 1, wherein the extracting second feature information from the sound segment if the energy value of the middle region of the sound segment is greater than the energy threshold value comprises:

performing frame windowing on the voice signal to be recognized to obtain a plurality of voice frames corresponding to the voice signal to be recognized;

if the energy value of a sound frame in the middle area of the plurality of sound frames corresponding to the sound segment is greater than the energy threshold, extracting the second feature information from the sound segment.

3. The method of claim 2, wherein the energy values comprise spectral values of the sound frames, and the extracting the second feature information from the sound segment if the energy values of the sound frames in the middle regions of the sound frames corresponding to the sound segment are greater than the energy threshold comprises:

performing fast Fourier transform on the plurality of sound frames to obtain spectral values of the plurality of sound frames;

if the spectrum values of the sound frames in the middle areas of the plurality of sound frames corresponding to the sound segments are greater than the energy threshold, extracting the second feature information from the sound segments.

4. The method according to claim 3, wherein the window is sequentially slidable between a plurality of the sound frames, and the extracting the second feature information from the sound segment if the spectrum value of the sound frame in the middle region of the plurality of the sound frames corresponding to the sound segment is greater than the energy threshold comprises:

if the frequency spectrum value of the sound frame in the middle area of the window is larger than the energy threshold value, generating a trigger signal;

when the window slides through a plurality of sound frames in sequence, if the number of the trigger signals generated continuously reaches a trigger number threshold value, second characteristic information is extracted from the sound fragment.

5. The method according to claim 3, wherein the window is sequentially slidable between a plurality of the sound frames, and the extracting the second feature information from the sound segment if the spectrum value of the sound frame in the middle region of the plurality of the sound frames corresponding to the sound segment is greater than the energy threshold comprises:

when the frequency spectrum value of the sound frame in the middle area of the window is larger than a first energy value, and the frequency spectrum values of the sound frames in the two end areas of the window are smaller than a second energy value, generating a trigger signal, wherein the first energy value is larger than the second energy value;

when the window slides through a plurality of sound frames in sequence, if the number of the trigger signals generated continuously reaches a trigger number threshold value, extracting the second characteristic information from the sound segment.

6. The sound identification method according to claim 1, wherein the identifying whether the sound signal to be identified includes a clapping sound according to the second feature information of at least one of the sound pieces comprises:

and inputting the second characteristic information into a recognition model for recognition so as to recognize whether the sound signal to be recognized comprises a clapping sound.

7. The sound identification method of claim 6 wherein the second feature information comprises acoustic features including at least one of mel-frequency cepstral coefficient features, linear prediction coefficient features, Filterbank features, and bottleneck features.

8. The sound recognition method according to claim 6, wherein the recognition model includes a plurality of sound categories, and the inputting the second feature information into the recognition model for recognition to recognize whether the sound signal to be recognized includes a slap sound comprises:

determining the likelihood of the second feature information and the feature information of a plurality of sound types respectively;

and sequencing the likelihood, and determining the sound category with the highest likelihood as the category of the sound fragment so as to identify whether the sound signal to be identified comprises the slap sound.

9. The voice recognition method of claim 8, wherein the sound categories include a clapping sound category and a non-clapping sound category.

10. The sound recognition method of claim 9, wherein the clapping sound categories include at least two clapping sound categories representing different numbers of consecutive clapping.

11. The voice recognition method according to claim 6, characterized in that the voice recognition method comprises: training the recognition model using clapping sound training data and non-clapping sound training data.

12. The sound recognition method of claim 11, wherein the clapping sound training data comprises first clapping sound training data and second clapping sound training data, the first clapping sound training data and the second clapping sound training data representing different at least one of a number of claps, a duration of claps, and a frequency of claps;

the training the recognition model using clapping sound training data and non-clapping sound training data comprises:

training the recognition model using the first clapping sound training data and the second clapping sound training data.

13. The sound recognition method of claim 6, wherein the recognition model comprises at least one of a depth model and a shallow model.

14. The sound recognition method of claim 13, wherein the depth model comprises at least one of: deep neural networks, long-term and short-term memory networks and convolutional neural networks.

15. The voice recognition method of claim 13, wherein the shallow model comprises a gaussian mixture model-hidden markov model.

16. The voice recognition method of claim 15, wherein the gaussian mixture model-hidden markov model has a gaussian number in a range of 3 to 12.

17. The sound recognition method of claim 15, wherein the gaussian mixture model-hidden markov model is trained using clapping sound training data and non-clapping sound training data, wherein the clapping sound training data comprises first clapping sound training data and second clapping sound training data, and wherein at least one of the number of clapping, the duration of clapping, and the frequency of clapping represented by the first clapping sound training data and the second clapping sound training data is different.

18. The sound recognition method of claim 17, wherein the first clapping sound training data comprises clapping sound training data of two clapping successive times, and the number of states of the gaussian mixture model-hidden markov model corresponding to the first clapping sound training data ranges from 6 to 14.

19. The sound recognition method of claim 17, wherein the second clapping sound training data comprises clapping sound training data of three clapping successive times, and the number of states of the gaussian mixture model-hidden markov model corresponding to the second clapping sound training data ranges from 9 to 21.

20. The voice recognition method of claim 17, wherein the number of states of the gaussian mixture model-hidden markov model corresponding to the non-clapping voice training data ranges from 7 to 18.

21. The method of claim 17, wherein the training the gaussian mixture model-hidden markov model using clapping sound training data and non-clapping sound training data comprises:

and carrying out parameter estimation on the hidden Markov model.

22. The voice recognition method of claim 21, wherein the method of parameter estimating the hidden markov model comprises: a Baum-welch algorithm and/or a genetic algorithm.

23. The method of claim 17, wherein the training the gaussian mixture model-hidden markov model using clapping sound training data and non-clapping sound training data comprises:

and training the Gaussian mixture model in the Gaussian mixture model-hidden Markov model for multiple times.

24. The method of claim 23, wherein the method of training the gaussian mixture model in the gaussian mixture model-hidden markov model for a plurality of times comprises: expectation maximization or maximum likelihood.

25. The sound recognition method of claim 1, wherein the clapping sound has a frequency in a range of 300Hz to 8000 Hz.

26. The sound recognition method of claim 1, wherein the slap sound comprises at least one of a clapping sound and a tapping sound.

27. The sound identification method according to claim 1, wherein the identifying whether the sound signal to be identified includes a clapping sound according to the second feature information of at least one of the sound pieces further comprises:

when the sound signal to be identified comprises the clapping sound, identifying the class of the clapping sound, wherein the class of the clapping sound corresponds to a corresponding control instruction; wherein the category of the slap sound includes at least one of a number of slaps, a duration of the slap, and a frequency of the slap.

28. An interaction method, characterized by: the method comprises the following steps:

acquiring a voice signal to be identified;

the voice recognition method of any one of claims 1-27; and

and if the sound signal to be identified comprises the clapping sound according to the sound identification method, outputting a corresponding control instruction according to the clapping sound.

29. The interaction method according to claim 28, wherein the control instruction comprises a control instruction to control the movable platform when the sound signal to be recognized comprises the clapping sound.

30. The interaction method according to claim 29, wherein the control instructions comprise control instructions to control a vision system of the movable platform when the sound signal to be recognized comprises the clapping sound.

31. The interaction method according to claim 30, wherein the control instruction comprises a control instruction for controlling the vision system to start a visual tracking and/or a control instruction for controlling the vision system to end the visual tracking.

32. The interactive method of claim 28, wherein outputting the corresponding control command according to the clapping sound comprises:

acquiring at least one of the times of the slapping sound, the time length of the slapping and the frequency of the slapping;

and outputting different control instructions according to at least one of the times of the slapping sound, the time length of the slapping and the frequency of the slapping.

33. A voice recognition system, characterized by: one or more processors are included for implementing the voice recognition method of any one of claims 1-27.

34. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the voice recognition method of any one of claims 1-27.

35. A movable platform, comprising:

a body;

the power system is arranged on the machine body and used for providing power for the movable platform;

the microphone is used for receiving the voice to be identified and generating a corresponding voice signal to be identified; and

one or more processors, configured to implement the voice recognition method according to any one of claims 1 to 27, and if it is recognized that the voice signal to be recognized includes a clapping sound according to the voice recognition method, output a corresponding control instruction according to the clapping sound.

36. The movable platform of claim 35, wherein the control instructions comprise control instructions to control the movable platform when the sound to be identified signal comprises the clapping sound.

37. The movable platform of claim 36, wherein the movable platform comprises a vision system, and wherein the control instructions comprise control instructions to control the vision system when the sound signal to be identified comprises the clapping sound.

38. The movable platform of claim 37, wherein the control instructions comprise control instructions that control the vision system to initiate a visual tracking and/or control instructions that control the vision system to end the visual tracking.

39. The movable platform of claim 35, wherein the processor is configured to: