WO2020227955A1 - Sound recognition method, interaction method, sound recognition system, computer-readable storage medium and mobile platform - Google Patents

Sound recognition method, interaction method, sound recognition system, computer-readable storage medium and mobile platform Download PDF

Info

Publication number
WO2020227955A1
WO2020227955A1 PCT/CN2019/086979 CN2019086979W WO2020227955A1 WO 2020227955 A1 WO2020227955 A1 WO 2020227955A1 CN 2019086979 W CN2019086979 W CN 2019086979W WO 2020227955 A1 WO2020227955 A1 WO 2020227955A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
slap
voice recognition
recognition method
training data
Prior art date
Application number
PCT/CN2019/086979
Other languages
French (fr)
Chinese (zh)
Inventor
吴俊峰
赵文泉
李皓宇
周事成
吴晟
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2019/086979 priority Critical patent/WO2020227955A1/en
Priority to CN201980009292.6A priority patent/CN111684522A/en
Publication of WO2020227955A1 publication Critical patent/WO2020227955A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use

Definitions

  • This application relates to the field of voice recognition, in particular to a voice recognition method, an interactive method, a voice recognition system, a computer-readable storage medium, and a removable platform.
  • This application provides improved voice recognition methods, interactive methods, voice recognition systems, computer-readable storage media, and removable platforms.
  • a voice recognition method for recognizing slap sounds.
  • the voice recognition method includes: acquiring at least one voice segment of the voice signal to be recognized and first feature information of the voice segment, so The first feature information is the energy value of the sound segment, and if the energy value of the middle region of the sound segment is greater than the energy threshold, extract second feature information from the sound segment; and according to at least one of the sound segments The second characteristic information of identifying whether the to-be-identified sound signal includes a tapping sound.
  • an interaction method including: acquiring a voice signal to be recognized; a voice recognition method; and if it is recognized according to the voice recognition method that the voice signal to be recognized includes a tapping sound, The slap sound outputs corresponding control commands.
  • a voice recognition system including one or more processors for implementing a voice recognition method.
  • a computer-readable storage medium having a program stored thereon, and when the program is executed by a processor, a voice recognition method is implemented.
  • a movable platform including: a body; a power system, which is provided in the body, and is used to provide power to the movable platform; and a microphone, which is used to receive sounds to be recognized, And generate a corresponding voice signal to be recognized; and one or more processors for implementing a voice recognition method, and if it is recognized according to the voice recognition method that the voice signal to be recognized includes a slap sound, then according to the beat Click the sound to output the corresponding control command.
  • the voice recognition method of the embodiment of the present application if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and the voice signal to be recognized is initially screened according to the second feature information Recognize whether the sound signal to be recognized includes a slap sound, so that in a long distance range, the recognition rate of the slap sound is high, the robustness is good, and the possibility of false triggering is low, and it is suitable as a reliable human-computer interaction method.
  • Fig. 1 shows a flowchart of an embodiment of the voice recognition method of this application.
  • FIG. 2 shows a sub-flow chart of an embodiment of the voice recognition method of this application.
  • Fig. 3 shows a flowchart of an embodiment of the interaction method of this application.
  • Fig. 4 is a schematic diagram of an embodiment of the voice recognition system of this application.
  • Fig. 5 is a block diagram of a module of an embodiment of the mobile platform of this application.
  • the voice recognition method of the embodiment of the present application is used to recognize the slap sound.
  • the sound recognition method includes: acquiring at least one sound segment of the sound signal to be recognized and first feature information of the sound segment.
  • the first feature information is the energy value of the sound segment. If the energy value of the central region of the sound segment is greater than the energy threshold, then Extracting second characteristic information from the sound segment; and identifying whether the sound signal to be recognized includes a slap sound according to the second characteristic information of at least one sound segment.
  • the voice recognition method if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and thus the voice signal to be recognized is initially screened, and then the voice signal to be recognized is recognized based on the second feature information Whether the slap sound is included, so that the recognition rate of the slap sound is high, the robustness is good, and the possibility of false triggering is low in a long distance range, and it is suitable as a reliable human-computer interaction method.
  • An interaction method in an embodiment of the present application includes: acquiring a voice signal to be recognized; the foregoing voice recognition method includes acquiring first characteristic information of at least one sound segment, the first characteristic information being the energy value of the sound segment, if the middle of the sound segment If the energy value of the region is greater than the energy threshold, extract the second characteristic information from the sound segment; and according to the second characteristic information of the at least one sound segment, identify whether the sound signal to be identified includes a tapping sound; and if it is identified according to the sound recognition method
  • the sound signal to be recognized includes a slap sound, and a corresponding control command is output according to the slap sound.
  • the voice recognition method has a high recognition rate for slap sounds, good robustness, and low possibility of false triggering, thus making the interactive method reliable. Moreover, the instantaneous energy of the slap sound is stronger than that of the voice, and it is not easy to be attenuated in the air. Therefore, the recognition effect of the slap sound for a certain distance, such as a distance of more than 2 meters, will be better than the voice recognition effect. Within the distance range, the slap sound can also be used to realize human-computer interaction, which has a higher recognition rate and stronger anti-interference.
  • the voice recognition system of the embodiment of the present application includes one or more processors for implementing the above voice recognition method.
  • the machine-readable storage medium of the embodiment of the present application has a program stored thereon, and when the program is executed by a processor, the above voice recognition method is realized.
  • the movable platform of the embodiment of the present application includes a body, a power system, a microphone, and one or more processors.
  • the power system is arranged in the body to provide power for the movable platform.
  • the microphone is used to receive the sound to be recognized and generate a corresponding sound signal to be recognized.
  • One or more processors are configured to implement the above-mentioned voice recognition method, and if the voice signal to be recognized includes a slap sound according to the voice recognition method, output a corresponding control instruction according to the slap sound.
  • FIG. 1 shows a flowchart of an embodiment of a voice recognition method 100.
  • the voice recognition method 100 is used to recognize tapping sounds.
  • the frequency range of the slap sound is 300 Hz to 8000 Hz, the sound is crisp, the instantaneous energy is stronger than the voice, it is not easy to attenuate in the air, it is easy to recognize, the recognition effect is good, the recognition rate is high, and the anti-interference is strong.
  • the clapping sound includes at least one of clapping sounds and clapping sounds.
  • the beating sound may include the sound of beating something, such as the sound of beating on a wall, a table, etc.
  • the beating sound is similar to the waveform of applause. Slap sounds such as applause and/or tapping have a high recognition rate, strong anti-interference, and can be recognized at a longer distance.
  • the voice recognition method 100 includes steps 101 and 102.
  • step 101 at least one sound segment of the sound signal to be recognized and first feature information of the sound segment are acquired.
  • the first feature information is the energy value of the sound segment. If the energy value of the middle region of the sound segment is greater than the energy threshold, then The second feature information is extracted from the sound segment.
  • the sound signal to be recognized may be one or more sound signals in a real-time sound signal stream.
  • the voice recognition method 100 may include acquiring a voice signal to be recognized.
  • the sound signal to be recognized can be intercepted from the real-time sound signal stream.
  • the sound signal between two adjacent silent periods exceeding the silent time threshold may be intercepted as the to-be-identified sound signal.
  • the sound signal in the mute period can indicate no sound or low sound, which can be called a "silent signal", and its energy value is lower than the minimum energy value of the slap sound.
  • the energy value of the sound signal in the real-time sound signal stream can be compared with the set mute energy threshold.
  • the sound signal is determined to be a mute signal, and the duration of the mute signal can be determined, namely Silent period.
  • the mute energy threshold does not exceed the minimum energy value of the slap sound.
  • the mute time threshold can be preset. In one embodiment, the silent time threshold exceeds the interval time between two consecutive slaps. In one example, the silent time threshold is any value greater than or equal to 2 seconds. When the interval between two adjacent slaps is less than 2 seconds, it is regarded as a continuous slap.
  • the sound signals to be recognized are all sound signals of the real-time sound signal stream.
  • the energy value may be used to perform preliminary screening of the sound signal to be identified. If the energy value of the middle region of the sound segment of the sound signal to be identified is greater than the energy threshold, it indicates that the sound corresponding to the middle region of the sound segment is loud, then It shows that the sound clip may include slap sounds, which can be a good preliminary screening of slap sounds.
  • the middle region of the sound segment has a sharp peak, and the two end regions have small and gentle values.
  • the waveform of the sound segment is a high and low waveform in the middle, which means that the sound The clip may include tapping sounds.
  • the middle area of the sound segment may be the exact center of the sound segment, or it may be an area where the exact center of the sound segment extends to one end or to both ends respectively.
  • the energy threshold is a preset fixed value or a value that changes in real time. In one embodiment, the energy threshold may be determined according to the energy value of one or both ends outside the middle region of the sound segment, so the energy threshold may be different for different sound segments. In another embodiment, the energy threshold is a preset value, and a fixed energy threshold can be set according to the characteristics and experience of the slap sound.
  • Step 101 includes sub-steps 111 and 112.
  • sub-step 111 the sound signal to be identified is framed and windowed to obtain multiple sound frames corresponding to the sound signal to be identified.
  • a slap sound lasts approximately 80-160 milliseconds.
  • the sound signal to be recognized is divided into frames of 11-23 milliseconds, and each time a continuous 4-15 sound frame is judged.
  • the sound signal to be recognized is divided into frames of 16 milliseconds, and 7 consecutive sound frames are judged each time. In other examples, it is possible to divide frames according to one frame at other times, and/or judge sound frames including other frames, which is not limited here.
  • sub-step 112 if the energy value of the sound frame in the middle region of the multiple sound frames corresponding to the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment.
  • the sound frame in the middle region includes the center sound frame. In another embodiment, the sound frame in the middle region includes a center sound frame and one or more sound frames on one or both sides of the center sound frame. In one embodiment, the sound segment includes odd-numbered sound frames, and the center sound frame is a sound frame in the center of the sound segment. In another embodiment, the sound segment includes even-numbered sound frames, and the center sound frame is one or two sound frames closest to the center of the sound segment.
  • the energy value of the sound frame in the middle area is the energy value of the sound frame in that frame; when the sound frame in the middle area is multiple frames, the energy value of the sound frame in the middle area
  • the energy value of the sound frame of the multiple frames can be calculated by a suitable algorithm, such as an algorithm for calculating the average value, the median value, and the variance, which is not limited herein.
  • the framed and windowed window can be sequentially slid between multiple sound frames to judge multiple consecutive sound frames, which can avoid missing slap sounds, and the judgment is more accurate and robust Sex is better.
  • the window when the window is slid multiple times, several consecutive sound frames that have been judged are multiple sound frames of one sound segment. For example, the window slides three times, one frame each time, and 7 consecutive frames are judged each time. After sliding three times from the initial position of the window, a total of 10 consecutive sound frames are judged, and the 10 frames are regarded as a sound Multiple sound frames of the clip. Obtain sound clips by sliding the window.
  • the window slides one frame at a time.
  • the window may also slide two frames, three frames or more frames at a time, which is not limited herein.
  • the energy value includes the frequency spectrum value of the sound frame
  • the fast Fourier transform may be performed on multiple sound frames to obtain the frequency spectrum value of the multiple sound frames. If the spectral value of the sound frame in the middle region of the multiple sound frames of the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment.
  • the frequency spectrum value can reflect the energy of the sound. When the frequency spectrum value of the sound frame in the middle region is greater than the energy threshold, it indicates that the sound segment may include a slap sound, so the second characteristic information is extracted from the sound segment.
  • the method of obtaining the spectrum value is simple. The spectrum value can be used to perform a preliminary screening of the sound to be recognized, and some sound segments that obviously do not include the slap sound can be removed. The method is simple and effective.
  • a trigger signal is generated; when the window slides through several sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, Then, the second feature information is extracted from the sound segments of the consecutive frames where the several sound frames are located.
  • the window slides through several sound frames in sequence if multiple trigger signals are generated continuously, it means that the sound clip may contain slap sounds.
  • the window can move one frame at a time, and a slap sound may be triggered repeatedly to generate multiple trigger signals, thus avoiding the omission of the slap sound and enhancing the accuracy of judgment.
  • the trigger signal is generated, wherein the first energy The value is greater than the second energy value; when the window slides through several sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, the second characteristic information is extracted from the sound segment.
  • the first energy value includes a preset fixed value; and/or a value related to the spectral value of the sound frames in the two end regions. For example, the first energy value follows the sound frames in the two end regions.
  • the value of the spectrum changes, such as greater than the spectrum value of the sound frame at both ends.
  • the second energy value includes a preset fixed value; and/or a value related to the spectral value of the sound frame in the middle region.
  • the second energy value may follow the frequency spectrum of the sound frame in the middle region.
  • the value changes such as the spectral value of the sound frame smaller than the middle area.
  • the spectral value of the sound frame at one end of the window can also be set to be smaller than the second energy value, and the spectral value of the sound frame at the other end changes with the spectral value of the sound frame at one end. The changes are not limited here.
  • a continuous 7-frame sound frame is used for judgment, and the sound frame in the middle area of the 7-frame sound frame is the fourth frame.
  • the spectrum value of the x-th frame among these 7 frames is M(x)
  • the minimum value in the third to fifth frames is MI.
  • it is preset: if M(4)>2*MI and M(4)>5*M(2) and M(4)>3*M(6) and M(4) )>20*M(1) and M(4)>7*M(7) and M(4)>0.05, then it is judged to trigger once and a trigger signal is generated; M(4) is the sound frame in the middle area.
  • the window will slide to the next frame to perform the above judgment again; if it is triggered 4 times in a row, it is considered that this sound segment containing several consecutive sound frames that have been judged contains slap sounds, that is Ten sound frames from the first frame to the tenth frame corresponding to the initial position of the window contain the slap sound, and then the second characteristic information is extracted from the sound segment.
  • the threshold for the number of triggers is 4, but it is not limited to this. In other examples, other thresholds for the number of triggers can be set.
  • the energy threshold includes multiple energy thresholds, which are 2*MI, 5*M(2), 3*M(6), 20*M(1), 7*M(7), 0.05.
  • the energy threshold includes a fixed threshold of 0.05, and the energy threshold related to the spectral values of the sound frames at both ends, which may be a multiple of the spectral values of the sound frames at both ends. In this way, the preliminary screening can be carried out more accurately, and the slapping sound can be avoided.
  • step 102 according to the second feature information of at least one sound segment, it is recognized whether the sound signal to be recognized includes a slap sound.
  • At least one sound segment that may contain the slap sound after the preliminary screening is further identified to determine whether the to-be-identified sound signal includes the slap sound.
  • the voice recognition method 100 of the embodiment of the present application if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and the voice signal to be recognized is initially screened according to the second feature The information identifies whether the sound to be recognized includes a slap sound, so that the slap sound can achieve a high recognition rate even in a long distance range, and the possibility of false triggering is low, and it is suitable as a reliable human-computer interaction method.
  • the type of the slap sound is recognized. Further, the type of the slap sound corresponds to a corresponding control instruction.
  • the category of the slap sound includes at least one of the number of slaps, the duration of the slap, and the frequency of the slap.
  • the number of slaps can be the number of consecutive slaps in a slap sound.
  • the duration of the slap may be the total duration of continuous slaps in a segment of slap sound.
  • the frequency of the slap can reflect the speed of the slap.
  • the type of the slap sound can be further recognized, which can be better used in human-computer interaction for different types of slap sounds. Click sound can realize different interactions.
  • the second feature information is input into the recognition model for recognition, so as to further recognize whether the sound signal to be recognized includes a slap sound.
  • the recognition model can be accurately and quickly recognized.
  • the second feature information includes acoustic features, and the acoustic features include Mel Frequency Cepstral Coefficient (MFCC) features, Linear Prediction Coefficient (LPC) features, Filterbank (filter bank) features, and bottleneck features At least one of (Bottleneck feature).
  • MFCC Mel Frequency Cepstral Coefficient
  • LPC Linear Prediction Coefficient
  • Filterbank filter bank
  • bottleneck features At least one of (Bottleneck feature).
  • One or more of the above-mentioned acoustic features can be used to recognize the slap sound in the recognition model.
  • the recognition model includes multiple sound categories. Determine the likelihood of the second feature information and the feature information of multiple voice categories respectively; sort the likelihoods, and determine the voice category with the highest likelihood as the category of the voice to be recognized to identify whether the voice to be recognized includes a beat Click sound. This can be quickly identified.
  • the sound category includes a slap sound category and a non-slap sound category. The likelihood of the second characteristic information and the characteristic information of the slap sound category and the likelihood of the characteristic information of the non-slap sound category can be determined, and the sound category with the highest likelihood is determined as the category of the sound to be recognized. In this way, it can be determined whether the sound to be recognized includes a tapping sound, and the recognition accuracy is high and the speed is fast.
  • the slap sound category includes at least two slap sound categories representing different consecutive times of slap. For example, the slap sound category represents two consecutive slaps, the slap sound category represents three consecutive slaps, and the slap sound category represents more consecutive slaps.
  • the second feature information is input into the recognition model, and it can be determined whether the sound to be recognized includes a slap sound, and the number of consecutive slaps can be determined. In this way, the tapping sounds of different consecutive tapping times can be recognized, and the recognition can be more accurate.
  • the slap sound category may include at least two slap sound categories representing different durations and/or frequencies of the slap.
  • the slap sound training data and the non-slap sound training data are used to train the recognition model.
  • the non-slap sound training data may include data of sounds other than the slap, such as noise and speech sounds.
  • a large amount of slap sound training data and non-slap sound training data can be collected to train the recognition model.
  • the recognition model may be trained multiple times to obtain a recognition model with better performance.
  • the slap sound training data includes first slap sound training data and second slap sound training data
  • the first slap sound training data and the second slap sound training data represent the number of slaps
  • At least one of the duration of the slap and the frequency of the slap is different.
  • the recognition model is trained. In this way, different types of slap sounds can be obtained, which can be used to identify the types of slap sounds.
  • the first slap sound training data and the second slap sound training data indicate different numbers of consecutive slaps.
  • the first slap sound training data represents two consecutive slaps
  • the second slap sound training data represents three consecutive slaps, but it is not limited to this example.
  • the recognition model can be trained according to actual applications to obtain different slap sound categories.
  • the recognition model includes at least one of a deep model and a shallow model, and the recognition rate is high through the above recognition model.
  • the deep model includes at least one of the following: Deep Neural Networks (DNN), Long Short Term Memory networks (LSTM), and Convolutional Neural Networks (CNN) .
  • the shallow model includes a Gaussian Mixture Model-Hidden Markov (GMM-HMM) model.
  • the signal to be recognized is recognized by the Gaussian Mixture Model-Hidden Markov Model, with a high recognition rate and fast recognition speed.
  • the slap sound training data and the non-slap sound training data are used to train the Gaussian mixture model-hidden Markov model, where the slap sound training data includes the first slap sound training data and the second slap sound training data.
  • the first slap sound training data and the second slap sound training data indicate that at least one of the number of slaps, the duration of the slap, and the frequency of the slap is different.
  • the Gaussian mixture model-hidden Markov model trained in this way includes the non-slap sound category and the slap sound category, where the slap sound category includes at least the number of times of the slap, the duration of the slap, and the frequency of the slap. A different first slap sound category and second slap sound category. See the above for details.
  • the MFCC features are extracted from the training data of the slap sound and the training data of the non-slap sound, and used for the Gaussian mixture model-hidden Markov model training.
  • parameter estimation is performed on a hidden Markov (HMM) model.
  • the method for estimating the parameters of the hidden Markov model includes: Baum-welch algorithm and/or genetic algorithm (Genetic Algorithm).
  • the parameters of the hidden Markov model are estimated by Baum-welch algorithm and/or genetic algorithm.
  • the Baum-welch algorithm is also known as the forward-backward algorithm.
  • the Baum-Welch algorithm first makes an initial estimate of the parameters of the HMM model, but this is likely to be a wrong guess, and then evaluates these for the given training data The validity of the parameters (such as cross-validation) and reduce the errors they cause to update the parameters of the HMM model, so that the error with the given training data becomes smaller.
  • Genetic algorithm is a computational model that simulates the biological evolution process of natural selection and genetic mechanism of Darwin's biological evolution theory. It is a method to search for the optimal solution by simulating the natural evolution process.
  • the Gaussian mixture model-hidden Markov model has a Gaussian number ranging from 3 to 12, which is suitable for recognizing slap sounds, balancing recognition performance and recognition speed, and the recognition accuracy is as high as possible and the recognition speed is as fast as possible .
  • the first slap sound training data includes the slap sound training data of two consecutive slaps.
  • the Gaussian mixture model-hidden Markov model corresponds to the first slap sound training data.
  • the number of states ranges from 6 to 14. The performance is as good as possible, and the recognition speed is as fast as possible.
  • the second slap sound training data includes the slap sound training data of three consecutive slaps, and the number of states corresponding to the second slap sound training data of the Gaussian Mixture Model-Hidden Markov Model ranges from 9 to 21.
  • the performance of the recognition model is as good as possible, and the recognition speed is as fast as possible.
  • the number of states of the corresponding non-slap sound training data of the Gaussian mixture model-hidden Markov model ranges from 7 to 18, the performance of the recognition model is as good as possible, and the recognition speed is as fast as possible.
  • the number of states of the first slap sound training data is 10
  • the number of states of the second slap sound training data is 15
  • the number of states of the non-slap sound training data is 12, and the number of Gaussians is 3.
  • the above is only an example and is not limited to this example. In other examples, the number of states and/or the number of Gaussians may be other values, for example, the number of Gaussians may be 5 or 8.
  • the Gaussian Mixture Model (GMM) model in the Gaussian Mixture Model-Hidden Markov Model may be trained multiple times, so as to obtain a model with high recognition accuracy.
  • the method of training the Gaussian mixture model-hidden Markov model for multiple times includes: Expectation Maximization (EM) or Maximum Likelihood.
  • Expectation maximization method or maximum likelihood method trains Gaussian mixture model-hidden Markov model many times to obtain a model with high recognition accuracy.
  • the expectation maximization method is a method to obtain the maximum likelihood estimation of parameters.
  • the expectation maximization method is a maximum likelihood estimation method for solving the parameters of the probability model from incomplete data or data sets with data loss (with hidden variables).
  • the maximum likelihood method (Maximum Likelihood, ML) is also called the most likely estimation, also called the maximum likelihood estimation. It is a theoretical point estimation method that can be used to estimate the parameters of the model.
  • FIG. 3 shows a flowchart of an embodiment of an interaction method 200 of this application.
  • the interaction method 200 includes steps 201-203.
  • a voice signal to be recognized is obtained.
  • the sound signal to be recognized can be obtained from the real-time sound signal stream.
  • step 202 the voice recognition method 100 as described above is executed to recognize the acquired voice signal to be recognized.
  • step 203 if it is recognized according to the voice recognition method 100 that the voice signal to be recognized includes a slap sound, a corresponding control instruction is output according to the slap sound.
  • the interaction method 200 uses tapping sounds for interaction.
  • the voice recognition method has a high recognition rate for tapping sounds, good robustness, and low possibility of false triggering, thus making the interactive method reliable.
  • the instantaneous energy of the slap sound is stronger than that of the voice, and it is not easy to be attenuated in the air. Therefore, the recognition effect of the slap sound for a certain distance, such as a distance of 2 meters or more, will be better than the voice recognition effect, so it can be farther away.
  • the human-computer interaction is realized by tapping sound within the distance range, which has a higher recognition rate and stronger anti-interference.
  • the control instruction includes a control instruction for controlling the movable platform when the sound signal to be recognized includes a tapping sound.
  • the control instructions can control the movable platform, for example, the movable platform can be controlled to move forward, backward, turn, rotate, stand still, and fire bullets.
  • Movable platforms may include mobile cars, unmanned aerial vehicles, automobiles, robots, or other movable devices. Use the slap sound to interact with the movable platform to control the movable platform. The recognition rate of the slap sound is high, the control of the movable platform is more accurate, the probability of false control is low, and multiple interactions within a longer distance range can be realized Ways to improve user experience.
  • the control instruction includes a control instruction for controlling the visual system of the movable platform when the sound signal to be recognized includes a tapping sound.
  • the control instruction includes a control instruction for controlling the vision system to start visual tracking, and/or a control instruction for controlling the vision system to end visual tracking. The visual tracking can be started and/or ended by tapping sound, and the control of visual tracking can be accurately realized.
  • control instruction can control other systems of the movable platform, for example, the power device of the movable platform can be controlled to control the movement of the movable platform; the camera of the movable platform can be controlled to take pictures.
  • control command can control other devices, and is not limited to a movable platform.
  • At least one of the number of slaps of the slap sound, the duration of the slap, and the frequency of the slap is acquired; according to the number of slaps of the slap sound, the duration and the frequency of the slap At least one of them outputs different control commands. At least one of the number of times of the slap of the slap sound, the duration of the slap, and the frequency of the slap are different, and different control instructions are output, so that different control instructions can be generated according to different slap sounds to realize different controls. For example, different control commands can be generated according to different tapping sounds to control the start and end of visual tracking respectively.
  • different control commands are generated according to different times of continuous slap.
  • the user claps his palms twice in a row
  • the interactive method 200 recognizes the clapping sound representing two consecutive slaps, controls the movable platform to start visual tracking, and the movable platform starts to move with the user.
  • the user claps his palms three times in a row
  • the interactive method 200 recognizes the clapping sound representing three consecutive slaps, and controls the movable platform to stop moving.
  • the mapping relationship between the type of the slap sound and the control instruction may be preset, or may be independently set by the user, thereby enhancing the flexibility of interactive control and improving user experience.
  • FIG. 4 is a schematic diagram of an embodiment of the voice recognition system 300 of this application.
  • the voice recognition system 300 includes one or more processors for implementing a voice recognition method.
  • the processor 301 of the voice recognition system 300 can implement the voice recognition method 100 described above.
  • the voice recognition system 300 may include a computer-readable storage medium 304, which may store a program that can be called by the processor 301, and may include a non-volatile storage medium.
  • the voice recognition system 300 may include a memory 303 and an interface 302. In some embodiments, the voice recognition system 300 may also include other hardware according to actual applications.
  • the computer-readable storage medium 304 of this application has a program stored thereon, and when the program is executed by the processor 301, the voice recognition method 100 is implemented.
  • FIG. 5 shows a block diagram of an embodiment of the mobile platform 400 of the present application.
  • the movable platform 400 includes a body 401, a power system 402, a microphone 403, and one or more processors 404.
  • the movable platform 400 may include a mobile car, an unmanned aerial vehicle, a car, a robot, or other movable devices.
  • the power system 402 is provided in the body 401 and used to provide power for the movable platform.
  • the power system 402 may include an electric motor.
  • the movable platform 400 is an unmanned aerial vehicle, and the power system 402 includes a propeller connected with a motor.
  • the movable platform 400 is a mobile trolley, and the power system 402 includes wheels connected to motors, such as universal wheels.
  • the microphone 403 is used to receive the voice to be recognized and generate a corresponding voice signal to be recognized.
  • the microphone 403 may be installed in the body 401. Since the instantaneous energy of the slap sound is stronger than the voice, it is less likely to be attenuated in the air, and the slap sound can be better received by the microphone 403.
  • the number of microphones can be one or more.
  • the microphone may also include windproof accessories, such as a windproof hair cover, a shock absorber, etc., to better receive the sound to be recognized.
  • One or more processors 404 are configured to implement a voice recognition method, and if the voice signal to be recognized includes a slap sound according to the voice recognition method, output a corresponding control command according to the slap sound.
  • the processor 404 can control the power system 402.
  • the control instruction includes a control instruction for controlling the movable platform 400 when the sound signal to be recognized includes a tapping sound.
  • the movable platform 400 includes a vision system 405, and the control instruction includes a control instruction for controlling the vision system when the sound signal to be recognized includes a slap sound.
  • the processor 404 can control the vision system 405.
  • the control instruction includes a control instruction for controlling the vision system 405 to start visual tracking, and/or a control instruction for controlling the vision system 405 to end visual tracking. Specific description please refer to the above.
  • the processor 404 is configured to obtain at least one of the number of times of the slap of the slap sound, the duration of the slap, and the frequency of the slap; according to the number of slaps of the slap sound, the duration of the slap A control command different from at least one of the frequencies of the tapping is output. Specific description please refer to the above.
  • This application can take the form of a computer program product implemented on one or more storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing program codes.
  • Computer-readable storage media include permanent and non-permanent, removable and non-removable media, and information storage can be achieved by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer-readable storage media include, but are not limited to: phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only Memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical storage , Magnetic cassette tape, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only Memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technologies
  • CD-ROM compact disc
  • DVD digital versatile disc
  • Magnetic cassette tape magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
  • each part of this application can be implemented by hardware, software or a combination thereof.
  • multiple steps or methods can be implemented by software or hardware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if it is implemented by hardware, it can be implemented by any one of the following technologies or a combination of them: discrete logic circuits with logic gates for realizing logic functions on data signals, and dedicated logic gates with suitable combinational logic gates Integrated circuit, programmable gate array (PGA), field programmable gate array (FPGA), etc.
  • a person of ordinary skill in the art can understand that all or part of the steps carried in the implementation method described above can be completed by a program instructing relevant hardware.
  • the program can be stored in a computer-readable storage medium. When it includes one of the steps of the method embodiment or a combination thereof.

Abstract

Disclosed in the present application are a sound recognition method, an interaction method, a sound recognition system, a computer-readable storage medium and a mobile platform. The sound recognition method is used to recognize a percussive sound. The sound recognition method comprises: acquiring at least one sound snippet of a sound signal to be recognized and first feature information of the sound snippet, the first feature information being an energy value of the sound snippet, and if the energy value of a central region of the sound snippet is greater than an energy threshold, extracting second feature information from the sound snippet; according to second feature information of the at least one sound snippet, recognizing whether the sound signal to be recognized comprises a percussive sound.

Description

声音识别方法、交互方法、声音识别系统、计算机可读存储介质及可移动平台Voice recognition method, interaction method, voice recognition system, computer readable storage medium and removable platform 技术领域Technical field
本申请涉及声音识别领域,尤其涉及一种声音识别方法、交互方法、声音识别系统、计算机可读存储介质及可移动平台。This application relates to the field of voice recognition, in particular to a voice recognition method, an interactive method, a voice recognition system, a computer-readable storage medium, and a removable platform.
背景技术Background technique
随着智能硬件在家居生活、教育等应用场合的普及,声音逐渐成为一种重要的人机交互方式,例如语音交互。但是,受制于硬件限制,当距离较远时,例如,当距离硬件设备2米以上时,由于信噪比较低,在语音信号中混杂的环境噪声会给语音识别带来很大的挑战。与语音信号相比,拍击声音信号单一,具有更强的抗干扰能力,且瞬时能量更强。因此可以利用拍击声音,例如掌声等,控制硬件设备,例如声控开关。然而,现有的基于波形比较电路的声控开关,在使用中鲁棒性不足,高音量的声音大多都能将其触发,误触发过于频繁,作为人机交互方式不可靠。With the popularization of smart hardware in applications such as home life and education, sound has gradually become an important human-computer interaction method, such as voice interaction. However, due to hardware limitations, when the distance is long, for example, when the distance from the hardware device is more than 2 meters, due to the low signal-to-noise ratio, the mixed environmental noise in the voice signal will bring great challenges to voice recognition. Compared with the voice signal, the slap sound signal is single, has stronger anti-interference ability, and has stronger instantaneous energy. Therefore, you can use tapping sounds, such as applause, to control hardware devices, such as voice-activated switches. However, the existing voice-activated switches based on the waveform comparison circuit are not robust enough in use. Most high-volume sounds can trigger them, and false triggers are too frequent, which is unreliable as a human-computer interaction method.
发明内容Summary of the invention
本申请提供改进的声音识别方法、交互方法、声音识别系统、计算机可读存储介质及可移动平台。This application provides improved voice recognition methods, interactive methods, voice recognition systems, computer-readable storage media, and removable platforms.
根据本申请实施例的一个方面,提供一种声音识别方法,用于识别拍击声音,声音识别方法包括:获取待识别声音信号的至少一个声音片段和所述声音片段的第一特征信息,所述第一特征信息为所述声音片段的能 量值,若所述声音片段的中部区域的能量值大于能量阈值,则从所述声音片段中提取第二特征信息;及根据至少一个所述声音片段的所述第二特征信息,识别所述待识别声音信号是否包括拍击声音。According to one aspect of the embodiments of the present application, there is provided a voice recognition method for recognizing slap sounds. The voice recognition method includes: acquiring at least one voice segment of the voice signal to be recognized and first feature information of the voice segment, so The first feature information is the energy value of the sound segment, and if the energy value of the middle region of the sound segment is greater than the energy threshold, extract second feature information from the sound segment; and according to at least one of the sound segments The second characteristic information of identifying whether the to-be-identified sound signal includes a tapping sound.
根据本申请实施例的一个方面,提供一种交互方法,包括:获取待识别声音信号;声音识别方法;及若根据所述声音识别方法识别出所述待识别声音信号包括拍击声音,根据所述拍击声音输出相应的控制指令。According to one aspect of the embodiments of the present application, an interaction method is provided, including: acquiring a voice signal to be recognized; a voice recognition method; and if it is recognized according to the voice recognition method that the voice signal to be recognized includes a tapping sound, The slap sound outputs corresponding control commands.
根据本申请实施例的另一个方面,提供一种声音识别系统,包括一个或多个处理器,用于实现声音识别方法。According to another aspect of the embodiments of the present application, there is provided a voice recognition system including one or more processors for implementing a voice recognition method.
根据本申请实施例的另一个方面,提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现声音识别方法。According to another aspect of the embodiments of the present application, there is provided a computer-readable storage medium having a program stored thereon, and when the program is executed by a processor, a voice recognition method is implemented.
根据本申请实施例的另一个方面,提供一种可移动平台,包括:机体;动力系统,设于所述机体,用于为所述可移动平台提供动力;麦克风,用于接收待识别声音,并产生相应的待识别声音信号;及一个或多个处理器,用于实现声音识别方法,并若根据所述声音识别方法识别出所述待识别声音信号包括拍击声音,则根据所述拍击声音输出相应的控制指令。According to another aspect of the embodiments of the present application, there is provided a movable platform, including: a body; a power system, which is provided in the body, and is used to provide power to the movable platform; and a microphone, which is used to receive sounds to be recognized, And generate a corresponding voice signal to be recognized; and one or more processors for implementing a voice recognition method, and if it is recognized according to the voice recognition method that the voice signal to be recognized includes a slap sound, then according to the beat Click the sound to output the corresponding control command.
本申请实施例声音识别方法中,若所述声音片段的中部区域的能量值大于能量阈值,则从声音片段中提取第二特征信息,如此对待识别声音信号进行初筛,进而根据第二特征信息识别待识别声音信号是否包括拍击声音,从而在较远的距离范围内,拍击声音的识别率高,鲁棒性好,误触发可能性低,适合作为一种可靠的人机交互方式。In the voice recognition method of the embodiment of the present application, if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and the voice signal to be recognized is initially screened according to the second feature information Recognize whether the sound signal to be recognized includes a slap sound, so that in a long distance range, the recognition rate of the slap sound is high, the robustness is good, and the possibility of false triggering is low, and it is suitable as a reliable human-computer interaction method.
附图说明Description of the drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅 仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present invention, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1所示为本申请声音识别方法的一个实施例的流程图。Fig. 1 shows a flowchart of an embodiment of the voice recognition method of this application.
图2所示为本申请声音识别方法的一个实施例的子流程图。FIG. 2 shows a sub-flow chart of an embodiment of the voice recognition method of this application.
图3所示为本申请交互方法的一个实施例的流程图。Fig. 3 shows a flowchart of an embodiment of the interaction method of this application.
图4所示为本申请声音识别系统的一个实施例的示意图。Fig. 4 is a schematic diagram of an embodiment of the voice recognition system of this application.
图5所示为本申请可移动平台的一个实施例的模块框图。Fig. 5 is a block diagram of a module of an embodiment of the mobile platform of this application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Here, exemplary embodiments will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present application. On the contrary, they are only examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。除非另行指出,“前部”、“后部”、“下 部”和/或“上部”等类似词语只是为了便于说明,而并非限于一个位置或者一种空间定向。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而且可以包括电性的连接,不管是直接的还是间接的。“多个”或者“若干”等类似词语表示至少两个。The terms used in this application are only for the purpose of describing specific embodiments and are not intended to limit the application. The singular forms of "a", "said" and "the" used in this application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any or all possible combinations of one or more associated listed items. Unless otherwise indicated, similar words such as "front", "rear", "lower" and/or "upper" are only for convenience of description, and are not limited to one position or one spatial orientation. Similar words such as "connected" or "connected" are not limited to physical or mechanical connections, and may include electrical connections, whether direct or indirect. Similar words such as "plurality" or "several" mean at least two.
本申请实施例的声音识别方法用于识别拍击声音。声音识别方法包括:获取待识别声音信号的至少一个声音片段和声音片段的第一特征信息,第一特征信息为声音片段的能量值,若声音片段的中部区域的能量值大于能量阈值,则从声音片段中提取第二特征信息;及根据至少一个声音片段的第二特征信息,识别待识别声音信号是否包括拍击声音。The voice recognition method of the embodiment of the present application is used to recognize the slap sound. The sound recognition method includes: acquiring at least one sound segment of the sound signal to be recognized and first feature information of the sound segment. The first feature information is the energy value of the sound segment. If the energy value of the central region of the sound segment is greater than the energy threshold, then Extracting second characteristic information from the sound segment; and identifying whether the sound signal to be recognized includes a slap sound according to the second characteristic information of at least one sound segment.
声音识别方法中若所述声音片段的中部区域的能量值大于能量阈值,则从声音片段中提取第二特征信息,如此对待识别声音信号进行初筛,进而根据第二特征信息识别待识别声音信号是否包括拍击声音,从而在较远的距离范围内,拍击声音的识别率高,鲁棒性好,误触发可能性低,适合作为一种可靠的人机交互方式。In the voice recognition method, if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and thus the voice signal to be recognized is initially screened, and then the voice signal to be recognized is recognized based on the second feature information Whether the slap sound is included, so that the recognition rate of the slap sound is high, the robustness is good, and the possibility of false triggering is low in a long distance range, and it is suitable as a reliable human-computer interaction method.
本申请实施例的一种交互方法包括:获取待识别声音信号;上述声音识别方法,包括获取至少一个声音片段的第一特征信息,第一特征信息为声音片段的能量值,若声音片段的中部区域的能量值大于能量阈值,则从声音片段中提取第二特征信息;及根据至少一个声音片段的第二特征信息,识别待识别声音信号是否包括拍击声音;及若根据声音识别方法识别出待识别声音信号包括拍击声音,根据拍击声音输出相应的控制指令。An interaction method in an embodiment of the present application includes: acquiring a voice signal to be recognized; the foregoing voice recognition method includes acquiring first characteristic information of at least one sound segment, the first characteristic information being the energy value of the sound segment, if the middle of the sound segment If the energy value of the region is greater than the energy threshold, extract the second characteristic information from the sound segment; and according to the second characteristic information of the at least one sound segment, identify whether the sound signal to be identified includes a tapping sound; and if it is identified according to the sound recognition method The sound signal to be recognized includes a slap sound, and a corresponding control command is output according to the slap sound.
声音识别方法对拍击声音的识别率高,鲁棒性好,误触发可能性低,因此使得交互方法可靠。而且拍击声音瞬时能量比语音强,在空气中不易衰减殆尽,因此对传播一定距离,例如2米以上距离,的拍击声音的识别效果也会比语音识别效果好,从而在较远的距离范围内也可利用拍击声音实现人机交互,有更高的识别率和更强的抗干扰性。The voice recognition method has a high recognition rate for slap sounds, good robustness, and low possibility of false triggering, thus making the interactive method reliable. Moreover, the instantaneous energy of the slap sound is stronger than that of the voice, and it is not easy to be attenuated in the air. Therefore, the recognition effect of the slap sound for a certain distance, such as a distance of more than 2 meters, will be better than the voice recognition effect. Within the distance range, the slap sound can also be used to realize human-computer interaction, which has a higher recognition rate and stronger anti-interference.
本申请实施例的声音识别系统包括一个或多个处理器,用于实现上述声音识别方法。The voice recognition system of the embodiment of the present application includes one or more processors for implementing the above voice recognition method.
本申请实施例的机器可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述声音识别方法。The machine-readable storage medium of the embodiment of the present application has a program stored thereon, and when the program is executed by a processor, the above voice recognition method is realized.
本申请实施例的可移动平台包括机体、动力系统、麦克风和一个或多个处理器。动力系统设于机体,用于为可移动平台提供动力。麦克风用于接收待识别声音,并产生相应的待识别声音信号。一个或多个处理器,用于实现上述声音识别方法,并若根据声音识别方法识别出待识别声音信号包括拍击声音,则根据所述拍击声音输出相应的控制指令。The movable platform of the embodiment of the present application includes a body, a power system, a microphone, and one or more processors. The power system is arranged in the body to provide power for the movable platform. The microphone is used to receive the sound to be recognized and generate a corresponding sound signal to be recognized. One or more processors are configured to implement the above-mentioned voice recognition method, and if the voice signal to be recognized includes a slap sound according to the voice recognition method, output a corresponding control instruction according to the slap sound.
图1所示为声音识别方法100的一个实施例的流程图。声音识别方法100用于识别拍击声音。在一些实施例中,拍击声音的频率范围为300Hz至8000Hz,声音清脆,瞬时能量比语音强,在空气中不易衰减殆尽,易于识别,识别效果好,识别率高,抗干扰性强。在一个实施例中,拍击声音包括掌声和敲打声中的至少一种拍击声音。敲打声可以包括敲打东西的声音,例如敲打墙、桌子等的声音,敲打声与掌声的波形相似。诸如掌声和/或敲打声的拍击声音识别率高,抗干扰性强,可以实现较远距离的识别。在本实施例中,声音识别方法100包括步骤101和102。FIG. 1 shows a flowchart of an embodiment of a voice recognition method 100. The voice recognition method 100 is used to recognize tapping sounds. In some embodiments, the frequency range of the slap sound is 300 Hz to 8000 Hz, the sound is crisp, the instantaneous energy is stronger than the voice, it is not easy to attenuate in the air, it is easy to recognize, the recognition effect is good, the recognition rate is high, and the anti-interference is strong. In one embodiment, the clapping sound includes at least one of clapping sounds and clapping sounds. The beating sound may include the sound of beating something, such as the sound of beating on a wall, a table, etc. The beating sound is similar to the waveform of applause. Slap sounds such as applause and/or tapping have a high recognition rate, strong anti-interference, and can be recognized at a longer distance. In this embodiment, the voice recognition method 100 includes steps 101 and 102.
在步骤101中,获取待识别声音信号的至少一个声音片段和该声音片段的第一特征信息,第一特征信息为声音片段的能量值,若声音片段的中部区域的能量值大于能量阈值,则从声音片段中提取第二特征信息。In step 101, at least one sound segment of the sound signal to be recognized and first feature information of the sound segment are acquired. The first feature information is the energy value of the sound segment. If the energy value of the middle region of the sound segment is greater than the energy threshold, then The second feature information is extracted from the sound segment.
在一个实施例中,待识别声音信号可以为实时声音信号流中的一段或多段声音信号。在一个实施例中,声音识别方法100可以包括获取待识别声音信号。可以从实时声音信号流中截取待识别声音信号。在一个实施例中,可以截取相邻两个超过静音时间阈值的静音时间段之间的声音信号为待识别声音信号。静音时间段内的声音信号可以表示无声音或声音很小, 可以称作“静音信号”,其能量值低于拍击声音的最小能量值。可以比较实时声音信号流的声音信号的能量值与设定的静音能量阈值,若声音信号的能量值小于静音能量阈值,则确定该声音信号为静音信号,且可以确定静音信号持续的时间,即静音时间段。静音能量阈值不超过拍击声音的最小能量值。静音时间阈值可以预先设定。在一个实施例中,静音时间阈值超过连续拍击中相邻两次拍击之间的间隔时间。在一个例子中,静音时间阈值为大于或等于2秒的任意值,当相邻两次拍击间隔时间小于2秒时,视为连续拍击。在另一个实施例中,待识别声音信号为实时声音信号流的所有声音信号。In an embodiment, the sound signal to be recognized may be one or more sound signals in a real-time sound signal stream. In one embodiment, the voice recognition method 100 may include acquiring a voice signal to be recognized. The sound signal to be recognized can be intercepted from the real-time sound signal stream. In one embodiment, the sound signal between two adjacent silent periods exceeding the silent time threshold may be intercepted as the to-be-identified sound signal. The sound signal in the mute period can indicate no sound or low sound, which can be called a "silent signal", and its energy value is lower than the minimum energy value of the slap sound. The energy value of the sound signal in the real-time sound signal stream can be compared with the set mute energy threshold. If the energy value of the sound signal is less than the mute energy threshold, the sound signal is determined to be a mute signal, and the duration of the mute signal can be determined, namely Silent period. The mute energy threshold does not exceed the minimum energy value of the slap sound. The mute time threshold can be preset. In one embodiment, the silent time threshold exceeds the interval time between two consecutive slaps. In one example, the silent time threshold is any value greater than or equal to 2 seconds. When the interval between two adjacent slaps is less than 2 seconds, it is regarded as a continuous slap. In another embodiment, the sound signals to be recognized are all sound signals of the real-time sound signal stream.
在另一个实施例中,可以利用能量值对待识别声音信号进行初筛,若待识别声音信号的声音片段的中部区域的能量值大于能量阈值,说明该声音片段的中部区域对应的声音大,则说明该声音片段可能包括拍击声音,可以对拍击声音进行很好的初筛。例如,在一种实施方式中,该声音片段的中部区域为尖峰值,两端区域为较小且平缓的值,如此,该声音片段的波形为中间高两端低的波形,则说明该声音片段可能包括拍击声音。在一个实施例中,该声音片段的中部区域可以为该声音片段的正中心,也可以为该声音片段的正中心向一端扩展或向两端分别扩展的区域。在一个实施例中,能量阈值为预先设置的固定值或实时变化的值。在一个实施例中,能量阈值可以根据声音片段的中部区域之外的一端或两端区域的能量值确定,因此对于不同的声音片段,能量阈值可能不同。在另一个实施例中,能量阈值为预先设定的值,可以根据拍击声音的特点和经验,设定固定的能量阈值。In another embodiment, the energy value may be used to perform preliminary screening of the sound signal to be identified. If the energy value of the middle region of the sound segment of the sound signal to be identified is greater than the energy threshold, it indicates that the sound corresponding to the middle region of the sound segment is loud, then It shows that the sound clip may include slap sounds, which can be a good preliminary screening of slap sounds. For example, in one embodiment, the middle region of the sound segment has a sharp peak, and the two end regions have small and gentle values. Thus, the waveform of the sound segment is a high and low waveform in the middle, which means that the sound The clip may include tapping sounds. In an embodiment, the middle area of the sound segment may be the exact center of the sound segment, or it may be an area where the exact center of the sound segment extends to one end or to both ends respectively. In one embodiment, the energy threshold is a preset fixed value or a value that changes in real time. In one embodiment, the energy threshold may be determined according to the energy value of one or both ends outside the middle region of the sound segment, so the energy threshold may be different for different sound segments. In another embodiment, the energy threshold is a preset value, and a fixed energy threshold can be set according to the characteristics and experience of the slap sound.
图2所示为步骤101的一个实施例的子流程图。步骤101包括子步骤111和112。在子步骤111中,对待识别声音信号进行分帧加窗处理,得到待识别声音信号对应的多个声音帧。Figure 2 shows a sub-flow diagram of an embodiment of step 101. Step 101 includes sub-steps 111 and 112. In sub-step 111, the sound signal to be identified is framed and windowed to obtain multiple sound frames corresponding to the sound signal to be identified.
通过对待识别声音信号进行加窗处理实现分帧,得到多个声音帧。 通常,一次拍击声音大约持续80~160毫秒,在一个实施例中,对待识别声音信号按11~23毫秒一帧进行分帧,每次对连续的4-15帧声音帧进行判断。在一个例子中,对待识别声音信号按16毫秒一帧进行分帧,每次对连续的7帧声音帧进行判断。在其他例子中,可以按照其他时间一帧进行分帧,和/或对包括其他帧数的声音帧进行判断,在此不作限定。By performing windowing processing on the sound signal to be recognized, framing is realized to obtain multiple sound frames. Generally, a slap sound lasts approximately 80-160 milliseconds. In one embodiment, the sound signal to be recognized is divided into frames of 11-23 milliseconds, and each time a continuous 4-15 sound frame is judged. In one example, the sound signal to be recognized is divided into frames of 16 milliseconds, and 7 consecutive sound frames are judged each time. In other examples, it is possible to divide frames according to one frame at other times, and/or judge sound frames including other frames, which is not limited here.
在子步骤112中,若声音片段对应的多个声音帧的中部区域的声音帧的能量值大于能量阈值,则从声音片段中提取第二特征信息。In sub-step 112, if the energy value of the sound frame in the middle region of the multiple sound frames corresponding to the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment.
在一个实施例中,中部区域的声音帧包括中心声音帧。在另一个实施例中,中部区域的声音帧包括中心声音帧和中心声音帧一侧或两侧的一个或多个声音帧。在一个实施例中,声音片段包括奇数帧的声音帧,中心声音帧为声音片段中心的一帧声音帧。在另一个实施例中,声音片段包括偶数帧的声音帧,中心声音帧为最靠近声音片段中心的一帧或两帧声音帧。当中部区域的声音帧为一帧时,中部区域的声音帧的能量值即为该一帧的声音帧的能量值;当中部区域的声音帧为多帧时,中部区域的声音帧的能量值可以通过合适的算法对该多帧的声音帧的能量值进行计算,例如求取平均值、中值、方差等算法,在此不作限定。In one embodiment, the sound frame in the middle region includes the center sound frame. In another embodiment, the sound frame in the middle region includes a center sound frame and one or more sound frames on one or both sides of the center sound frame. In one embodiment, the sound segment includes odd-numbered sound frames, and the center sound frame is a sound frame in the center of the sound segment. In another embodiment, the sound segment includes even-numbered sound frames, and the center sound frame is one or two sound frames closest to the center of the sound segment. When the sound frame in the middle area is one frame, the energy value of the sound frame in the middle area is the energy value of the sound frame in that frame; when the sound frame in the middle area is multiple frames, the energy value of the sound frame in the middle area The energy value of the sound frame of the multiple frames can be calculated by a suitable algorithm, such as an algorithm for calculating the average value, the median value, and the variance, which is not limited herein.
在一些实施例中,分帧加窗的窗口能够在多个声音帧之间顺次滑动,以对连续的多个声音帧进行判断,可以避免遗漏拍击声音,并且判断更为准确,鲁棒性更好。在一个例子中,窗口滑动多次中,判断过的若干连续声音帧为一个声音片段的多个声音帧。例如窗口滑动三次,每次滑动一帧,每次对连续的7帧进行判断,从窗口的初始位置至滑动三次后,共对连续的10帧声音帧进行了判断,将该10帧作为一个声音片段的多个声音帧。通过窗口的滑动获得声音片段。在一个实施例中,窗口每次滑动一帧。当然,在其他实施例中,根据实际需要,窗口也可以每次滑动两帧、三帧或更多帧,在此不作限定。In some embodiments, the framed and windowed window can be sequentially slid between multiple sound frames to judge multiple consecutive sound frames, which can avoid missing slap sounds, and the judgment is more accurate and robust Sex is better. In one example, when the window is slid multiple times, several consecutive sound frames that have been judged are multiple sound frames of one sound segment. For example, the window slides three times, one frame each time, and 7 consecutive frames are judged each time. After sliding three times from the initial position of the window, a total of 10 consecutive sound frames are judged, and the 10 frames are regarded as a sound Multiple sound frames of the clip. Obtain sound clips by sliding the window. In one embodiment, the window slides one frame at a time. Of course, in other embodiments, according to actual needs, the window may also slide two frames, three frames or more frames at a time, which is not limited herein.
在一个实施例中,能量值包括声音帧的频谱值,可以对多个声音帧 进行快速傅里叶变换,获得多个声音帧的频谱值。若声音片段的多个声音帧的中部区域的声音帧的频谱值大于能量阈值,则从声音片段中提取第二特征信息。频谱值可以体现声音的能量,在中部区域的声音帧的频谱值大于能量阈值时,说明声音片段可能包括拍击声音,因而从声音片段中提取第二特征信息。获得频谱值的方法简单,利用频谱值可以对待识别声音进行初筛,可以去除一些明显不包括拍击声音的声音片段,方法简单有效。In an embodiment, the energy value includes the frequency spectrum value of the sound frame, and the fast Fourier transform may be performed on multiple sound frames to obtain the frequency spectrum value of the multiple sound frames. If the spectral value of the sound frame in the middle region of the multiple sound frames of the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment. The frequency spectrum value can reflect the energy of the sound. When the frequency spectrum value of the sound frame in the middle region is greater than the energy threshold, it indicates that the sound segment may include a slap sound, so the second characteristic information is extracted from the sound segment. The method of obtaining the spectrum value is simple. The spectrum value can be used to perform a preliminary screening of the sound to be recognized, and some sound segments that obviously do not include the slap sound can be removed. The method is simple and effective.
在一个实施例中,若窗口的中部区域的声音帧的频谱值大于能量阈值,产生触发信号;当窗口顺次滑动过若干个声音帧时,若连续产生的触发信号的数目达到触发数目阈值,则从该若干个声音帧所在的连续帧的声音片段中提取第二特征信息。窗口顺次滑动过若干个声音帧时,若连续产生多个触发信号,说明声音片段可能包含拍击声音。窗口可以每次移动一帧,一个拍击声音可能重复触发,产生多个触发信号,如此避免遗漏拍击声音,增强判断的准确性。In one embodiment, if the spectral value of the sound frame in the middle area of the window is greater than the energy threshold, a trigger signal is generated; when the window slides through several sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, Then, the second feature information is extracted from the sound segments of the consecutive frames where the several sound frames are located. When the window slides through several sound frames in sequence, if multiple trigger signals are generated continuously, it means that the sound clip may contain slap sounds. The window can move one frame at a time, and a slap sound may be triggered repeatedly to generate multiple trigger signals, thus avoiding the omission of the slap sound and enhancing the accuracy of judgment.
在一个实施例中,当窗口的中部区域的声音帧的频谱值大于第一能量值,窗口的两端区域的声音帧的频谱值小于第二能量值时,产生触发信号,其中,第一能量值大于第二能量值;当窗口顺次滑动过若干个声音帧时,若连续产生的触发信号的数目达到触发数目阈值,则从声音片段中提取第二特征信息。在一些实施例中,第一能量值包括预先设定的固定的值;和/或与两端区域的声音帧的频谱值相关的值,例如,第一能量值随着两端区域的声音帧的频谱值的变化而变化,如大于两端区域的声音帧的频谱值。在一些实施例中,第二能量值包括预先设定的固定的值;和/或与中部区域的声音帧的频谱值相关的值,例如第二能量值可以随着中部区域的声音帧的频谱值的变化而变化,如小于中部区域的声音帧的频谱值。当然,在其他实施例中,也可以设置窗口的两端区域中的一端的声音帧的频谱值小于第二能量值,另一端的声音帧的频谱值随着一端的声音帧的频谱值的变化而变化,在此不做限定。In an embodiment, when the spectral value of the sound frame in the middle region of the window is greater than the first energy value, and the spectral value of the sound frame in the two end regions of the window is less than the second energy value, the trigger signal is generated, wherein the first energy The value is greater than the second energy value; when the window slides through several sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, the second characteristic information is extracted from the sound segment. In some embodiments, the first energy value includes a preset fixed value; and/or a value related to the spectral value of the sound frames in the two end regions. For example, the first energy value follows the sound frames in the two end regions. The value of the spectrum changes, such as greater than the spectrum value of the sound frame at both ends. In some embodiments, the second energy value includes a preset fixed value; and/or a value related to the spectral value of the sound frame in the middle region. For example, the second energy value may follow the frequency spectrum of the sound frame in the middle region. The value changes, such as the spectral value of the sound frame smaller than the middle area. Of course, in other embodiments, the spectral value of the sound frame at one end of the window can also be set to be smaller than the second energy value, and the spectral value of the sound frame at the other end changes with the spectral value of the sound frame at one end. The changes are not limited here.
在一个例子中,在待识别声音中,取连续的7帧声音帧进行判断,该7帧声音帧的中部区域的声音帧为第四帧。设这7帧中第x帧的频谱值为M(x),第三到第五帧中的最小值为MI。例如,在一种实施方式中,预先设定:若M(4)>2*MI且M(4)>5*M(2)且M(4)>3*M(6)且M(4)>20*M(1)且M(4)>7*M(7)且M(4)>0.05,则判定触发一次,产生一个触发信号;M(4)为中间区域的声音帧。进一步地,无论是否触发,窗口都将滑动至下一帧重新进行上述的判断;若连续触发4次,则认为包含判断过的连续的若干声音帧的这段声音片段中包含拍击声音,即从窗口初始位置对应的第一帧至第十帧的十帧声音帧中包含拍击声音,则从这段声音片段中提取第二特征信息。在该例子中,触发数目阈值为4次,但不限于此,在其他例子中,可以设置其他的触发数目阈值。在该例子中,能量阈值包括多个能量阈值,为2*MI、5*M(2)、3*M(6)、20*M(1)、7*M(7)、0.05。能量阈值包括固定的阈值0.05,和与两端的声音帧的频谱值相关的能量阈值,可以为两端的声音帧的频谱值的倍数。如此可以较准确地进行初筛,避免遗漏拍击声音。In one example, in the sound to be recognized, a continuous 7-frame sound frame is used for judgment, and the sound frame in the middle area of the 7-frame sound frame is the fourth frame. Suppose the spectrum value of the x-th frame among these 7 frames is M(x), and the minimum value in the third to fifth frames is MI. For example, in one embodiment, it is preset: if M(4)>2*MI and M(4)>5*M(2) and M(4)>3*M(6) and M(4) )>20*M(1) and M(4)>7*M(7) and M(4)>0.05, then it is judged to trigger once and a trigger signal is generated; M(4) is the sound frame in the middle area. Furthermore, regardless of whether it is triggered or not, the window will slide to the next frame to perform the above judgment again; if it is triggered 4 times in a row, it is considered that this sound segment containing several consecutive sound frames that have been judged contains slap sounds, that is Ten sound frames from the first frame to the tenth frame corresponding to the initial position of the window contain the slap sound, and then the second characteristic information is extracted from the sound segment. In this example, the threshold for the number of triggers is 4, but it is not limited to this. In other examples, other thresholds for the number of triggers can be set. In this example, the energy threshold includes multiple energy thresholds, which are 2*MI, 5*M(2), 3*M(6), 20*M(1), 7*M(7), 0.05. The energy threshold includes a fixed threshold of 0.05, and the energy threshold related to the spectral values of the sound frames at both ends, which may be a multiple of the spectral values of the sound frames at both ends. In this way, the preliminary screening can be carried out more accurately, and the slapping sound can be avoided.
继续参考图1,在步骤102中,根据至少一个声音片段的第二特征信息,识别待识别声音信号是否包括拍击声音。Continuing to refer to FIG. 1, in step 102, according to the second feature information of at least one sound segment, it is recognized whether the sound signal to be recognized includes a slap sound.
将初筛后的可能包含拍击声音的至少一个声音片段进行进一步识别,以确定待识别声音信号是否包括拍击声音。本申请实施例的声音识别方法100中若所述声音片段的中部区域的能量值大于能量阈值,则从声音片段中提取第二特征信息,如此对待识别声音信号进行初筛,进而根据第二特征信息识别待识别声音是否包括拍击声音,从而使得拍击声音即使在较远的距离范围内也可以实现高识别率,误触发可能性低,适合作为一种可靠的人机交互方式。At least one sound segment that may contain the slap sound after the preliminary screening is further identified to determine whether the to-be-identified sound signal includes the slap sound. In the voice recognition method 100 of the embodiment of the present application, if the energy value of the middle region of the voice segment is greater than the energy threshold, the second feature information is extracted from the voice segment, and the voice signal to be recognized is initially screened according to the second feature The information identifies whether the sound to be recognized includes a slap sound, so that the slap sound can achieve a high recognition rate even in a long distance range, and the possibility of false triggering is low, and it is suitable as a reliable human-computer interaction method.
在一些实施例中,当待识别声音信号包括拍击声音时,识别拍击声音的类别,进一步地,拍击声音的类别对应于相应的控制指令。其中,拍击声音的类别包括拍击的次数、拍击的时长和拍击的频率中的至少一种。 拍击的次数可以为一段拍击声音中连续拍击的次数。拍击的时长可以为一段拍击声音中连续拍击的总时长。拍击的频率可以体现拍击的快慢。如此不仅识别待识别声音信号是否包括拍击声音,而且在待识别声音信号包括拍击声音时,可以进一步识别拍击声音的类别,如此可以更好地用于人机交互中,不同类别的拍击声音可以实现不同的交互。In some embodiments, when the sound signal to be recognized includes a slap sound, the type of the slap sound is recognized. Further, the type of the slap sound corresponds to a corresponding control instruction. The category of the slap sound includes at least one of the number of slaps, the duration of the slap, and the frequency of the slap. The number of slaps can be the number of consecutive slaps in a slap sound. The duration of the slap may be the total duration of continuous slaps in a segment of slap sound. The frequency of the slap can reflect the speed of the slap. In this way, it not only recognizes whether the sound signal to be recognized includes a slap sound, but also when the sound signal to be recognized includes a slap sound, the type of the slap sound can be further recognized, which can be better used in human-computer interaction for different types of slap sounds. Click sound can realize different interactions.
在一些实施例中,将第二特征信息输入识别模型中进行识别,以进一步识别待识别声音信号是否包括拍击声音。通过识别模型可以准确快速地识别。在一些实施例中,第二特征信息包括声学特征,声学特征包括梅尔频率倒谱系数(MFCC)特征、线性预测系数特征(Linear Prediction Coefficient,LPC)、Filterbank(滤波器组)特征、瓶颈特征(Bottleneck feature)中的至少一种。可以利用上述一种或多种声学特征,在识别模型中对拍击声音进行识别。In some embodiments, the second feature information is input into the recognition model for recognition, so as to further recognize whether the sound signal to be recognized includes a slap sound. The recognition model can be accurately and quickly recognized. In some embodiments, the second feature information includes acoustic features, and the acoustic features include Mel Frequency Cepstral Coefficient (MFCC) features, Linear Prediction Coefficient (LPC) features, Filterbank (filter bank) features, and bottleneck features At least one of (Bottleneck feature). One or more of the above-mentioned acoustic features can be used to recognize the slap sound in the recognition model.
在一些实施例中,识别模型包括多种声音类别。分别确定第二特征信息与多种声音类别的特征信息的似然度;对似然度进行排序,将似然度最高的声音类别确定为待识别声音的类别,以识别待识别声音是否包括拍击声音。如此可以快速地进行识别。在一个实施例中,声音类别包括拍击声音类别和非拍击声音类别。可以确定第二特征信息与拍击声音类别的特征信息的似然度,和与非拍击声音类别的特征信息的似然度,将似然度最高的声音类别确定为待识别声音的类别。如此可以确定待识别声音是否包括拍击声音,识别准确度高,速度快。In some embodiments, the recognition model includes multiple sound categories. Determine the likelihood of the second feature information and the feature information of multiple voice categories respectively; sort the likelihoods, and determine the voice category with the highest likelihood as the category of the voice to be recognized to identify whether the voice to be recognized includes a beat Click sound. This can be quickly identified. In one embodiment, the sound category includes a slap sound category and a non-slap sound category. The likelihood of the second characteristic information and the characteristic information of the slap sound category and the likelihood of the characteristic information of the non-slap sound category can be determined, and the sound category with the highest likelihood is determined as the category of the sound to be recognized. In this way, it can be determined whether the sound to be recognized includes a tapping sound, and the recognition accuracy is high and the speed is fast.
在一些实施例中,拍击声音类别包括至少两种表示不同连续拍击次数的拍击声音类别。例如,表示连续两次拍击的拍击声音类别、表示连续三次拍击的拍击声音类别以及表示连续更多次拍击的拍击声音类别。第二特征信息输入识别模型中,可以确定待识别声音是否包括拍击声音,且可以确定连续拍击的次数。如此可以对不同连续拍击次数的拍击声音进行识别,识别得更精确。在其他一些实施例中,拍击声音类别可以包括至少两 种表示拍击的时长和/或拍击的频率不同的拍击声音类别。In some embodiments, the slap sound category includes at least two slap sound categories representing different consecutive times of slap. For example, the slap sound category represents two consecutive slaps, the slap sound category represents three consecutive slaps, and the slap sound category represents more consecutive slaps. The second feature information is input into the recognition model, and it can be determined whether the sound to be recognized includes a slap sound, and the number of consecutive slaps can be determined. In this way, the tapping sounds of different consecutive tapping times can be recognized, and the recognition can be more accurate. In some other embodiments, the slap sound category may include at least two slap sound categories representing different durations and/or frequencies of the slap.
在一些实施例中,使用拍击声音训练数据和非拍击声音训练数据,训练识别模型。如此可以获得拍击声音类别和非拍击声音类别。非拍击声音训练数据可以包括噪声、说话声音等拍击以外的声音的数据。可以采集大量的拍击声音训练数据和非拍击声音训练数据来训练识别模型。在一些实施例中,可以对识别模型进行多次训练,以获得性能较好的识别模型。In some embodiments, the slap sound training data and the non-slap sound training data are used to train the recognition model. In this way, the slap sound category and the non-slap sound category can be obtained. The non-slap sound training data may include data of sounds other than the slap, such as noise and speech sounds. A large amount of slap sound training data and non-slap sound training data can be collected to train the recognition model. In some embodiments, the recognition model may be trained multiple times to obtain a recognition model with better performance.
在一些实施例中,拍击声音训练数据包括第一拍击声音训练数据和第二拍击声音训练数据,第一拍击声音训练数据和第二拍击声音训练数据表示的拍击的次数、拍击的时长和拍击的频率中的至少一种不同。使用第一拍击声音训练数据和第二拍击声音训练数据,训练识别模型。如此可以获得不同的拍击声音类别,从而可以用于识别拍击声音的类别。在一个实施例中,第一拍击声音训练数据和第二拍击声音训练数据表示的连续拍击的次数不同。在一个例子中,第一拍击声音训练数据表示连续拍击两次,第二拍击声音训练数据表示连续拍击三次,但不限于该例子。可以根据实际应用训练识别模型,获得不同的拍击声音类别。In some embodiments, the slap sound training data includes first slap sound training data and second slap sound training data, the first slap sound training data and the second slap sound training data represent the number of slaps, At least one of the duration of the slap and the frequency of the slap is different. Using the first slap sound training data and the second slap sound training data, the recognition model is trained. In this way, different types of slap sounds can be obtained, which can be used to identify the types of slap sounds. In one embodiment, the first slap sound training data and the second slap sound training data indicate different numbers of consecutive slaps. In an example, the first slap sound training data represents two consecutive slaps, and the second slap sound training data represents three consecutive slaps, but it is not limited to this example. The recognition model can be trained according to actual applications to obtain different slap sound categories.
在一些实施例中,识别模型包括深度模型和浅层模型中的至少一种,通过上述识别模型,识别率高。在一些实施例中,深度模型包括以下至少一种:深度神经网络(Deep Neural Networks,DNN)、长短时记忆网络(Long Short Term Memory networks,LSTM)和卷积神经网络(Convolutional Neural Networks,CNN)。In some embodiments, the recognition model includes at least one of a deep model and a shallow model, and the recognition rate is high through the above recognition model. In some embodiments, the deep model includes at least one of the following: Deep Neural Networks (DNN), Long Short Term Memory networks (LSTM), and Convolutional Neural Networks (CNN) .
在一个实施例中,浅层模型包括高斯混合模型-隐马尔科夫(GMM-HMM)模型,通过高斯混合模型-隐马尔科夫模型对待识别信号进行识别,识别率高,且识别速度快。在一些实施例中,使用拍击声音训练数据和非拍击声音训练数据训练高斯混合模型-隐马尔科夫模型,其中,拍击声音训练数据包括第一拍击声音训练数据和第二拍击声音训练数据,第一拍击声音训练数据和第二拍击声音训练数据表示的拍击的次数、拍击 的时长和拍击的频率中的至少一种不同。如此训练好的高斯混合模型-隐马尔科夫模型包括非拍击声音类别和拍击声音类别,其中拍击声音类别包括表示的拍击的次数、拍击的时长和拍击的频率中的至少一种不同的第一拍击声音类别和第二拍击声音类别。具体可以参见上文所述。对拍击声音训练数据和非拍击声音训练数据训练提取MFCC特征,用于高斯混合模型-隐马尔科夫模型训练。In one embodiment, the shallow model includes a Gaussian Mixture Model-Hidden Markov (GMM-HMM) model. The signal to be recognized is recognized by the Gaussian Mixture Model-Hidden Markov Model, with a high recognition rate and fast recognition speed. In some embodiments, the slap sound training data and the non-slap sound training data are used to train the Gaussian mixture model-hidden Markov model, where the slap sound training data includes the first slap sound training data and the second slap sound training data. In the sound training data, the first slap sound training data and the second slap sound training data indicate that at least one of the number of slaps, the duration of the slap, and the frequency of the slap is different. The Gaussian mixture model-hidden Markov model trained in this way includes the non-slap sound category and the slap sound category, where the slap sound category includes at least the number of times of the slap, the duration of the slap, and the frequency of the slap. A different first slap sound category and second slap sound category. See the above for details. The MFCC features are extracted from the training data of the slap sound and the training data of the non-slap sound, and used for the Gaussian mixture model-hidden Markov model training.
在一些实施例中,对隐马尔科夫(HMM)模型进行参数估计。在一个实施例中,对隐马尔科夫模型进行参数估计的方法包括:Baum-welch算法和/或遗传算法(Genetic Algorithm)。通过Baum-welch算法和/或遗传算法对隐马尔科夫模型进行参数估计。Baum-welch算法又称作前向-后向算法,Baum-Welch算法首先对于HMM模型的参数进行一个初始的估计,但这个很可能是一个错误的猜测,然后通过对于给定的训练数据评估这些参数的有效性(比如交叉验证)并减少它们所引起的错误来更新HMM模型的参数,使得和给定的训练数据的误差变小。遗传算法是模拟达尔文生物进化论的自然选择和遗传学机理的生物进化过程的计算模型,是一种通过模拟自然进化过程搜索最优解的方法。In some embodiments, parameter estimation is performed on a hidden Markov (HMM) model. In one embodiment, the method for estimating the parameters of the hidden Markov model includes: Baum-welch algorithm and/or genetic algorithm (Genetic Algorithm). The parameters of the hidden Markov model are estimated by Baum-welch algorithm and/or genetic algorithm. The Baum-welch algorithm is also known as the forward-backward algorithm. The Baum-Welch algorithm first makes an initial estimate of the parameters of the HMM model, but this is likely to be a wrong guess, and then evaluates these for the given training data The validity of the parameters (such as cross-validation) and reduce the errors they cause to update the parameters of the HMM model, so that the error with the given training data becomes smaller. Genetic algorithm is a computational model that simulates the biological evolution process of natural selection and genetic mechanism of Darwin's biological evolution theory. It is a method to search for the optimal solution by simulating the natural evolution process.
在一些实施例中,高斯混合模型-隐马尔科夫模型的高斯数量的范围为3至12,适合识别拍击声音,平衡识别性能和识别速度,识别准确度尽可能高且识别速度尽可能快。第一拍击声音训练数据包括连续两次拍击的拍击声音训练数据,高斯混合模型-隐马尔科夫模型的对应第一拍击声音训练数据的状态数量范围为6至14,识别模型的性能尽可能好,且识别速度尽可能快。在一些实施例中,第二拍击声音训练数据包括连续三次拍击的拍击声音训练数据,高斯混合模型-隐马尔科夫模型的对应第二拍击声音训练数据的状态数量范围为9至21,识别模型的性能尽可能好,且识别速度尽可能快。在一些实施例中,高斯混合模型-隐马尔科夫模型的对应非拍击声音训练数据的状态数量范围为7至18,识别模型的性能尽可能好,且识 别速度尽可能快。在一个例子中,第一拍击声音训练数据的状态数量为10,第二拍击声音训练数据的状态数量为15,非拍击声音训练数据的状态数量为12,高斯数量为3。上述仅是一个例子,并不限于该例子,在其他例子中,状态数量和/或高斯数量可以为其他值,例如高斯数量可以为5或8。In some embodiments, the Gaussian mixture model-hidden Markov model has a Gaussian number ranging from 3 to 12, which is suitable for recognizing slap sounds, balancing recognition performance and recognition speed, and the recognition accuracy is as high as possible and the recognition speed is as fast as possible . The first slap sound training data includes the slap sound training data of two consecutive slaps. The Gaussian mixture model-hidden Markov model corresponds to the first slap sound training data. The number of states ranges from 6 to 14. The performance is as good as possible, and the recognition speed is as fast as possible. In some embodiments, the second slap sound training data includes the slap sound training data of three consecutive slaps, and the number of states corresponding to the second slap sound training data of the Gaussian Mixture Model-Hidden Markov Model ranges from 9 to 21. The performance of the recognition model is as good as possible, and the recognition speed is as fast as possible. In some embodiments, the number of states of the corresponding non-slap sound training data of the Gaussian mixture model-hidden Markov model ranges from 7 to 18, the performance of the recognition model is as good as possible, and the recognition speed is as fast as possible. In an example, the number of states of the first slap sound training data is 10, the number of states of the second slap sound training data is 15, the number of states of the non-slap sound training data is 12, and the number of Gaussians is 3. The above is only an example and is not limited to this example. In other examples, the number of states and/or the number of Gaussians may be other values, for example, the number of Gaussians may be 5 or 8.
在一些实施例中,对高斯混合模型-隐马尔科夫模型中的高斯混合模型(GMM)模型可以进行多次训练,如此获得识别准确率高的模型。在一些实施例中,对高斯混合模型-隐马尔科夫模型进行多次训练的方法包括:期望最大化方法(Expectation maximization,EM)或最大似然法。期望最大化方法或最大似然法对高斯混合模型-隐马尔科夫模型进行多次训练,获得识别准确率高的模型。期望最大化方法是求参数极大似然估计的一种方法。期望最大化方法是一种从不完全数据或有数据丢失的数据集(存在隐含变量)中求解概率模型参数的最大似然估计方法。最大似然法(Maximum Likelihood,ML)也称为最大概似估计,也叫极大似然估计,是一种具有理论性的点估计法,可以用来估计模型的参数。In some embodiments, the Gaussian Mixture Model (GMM) model in the Gaussian Mixture Model-Hidden Markov Model may be trained multiple times, so as to obtain a model with high recognition accuracy. In some embodiments, the method of training the Gaussian mixture model-hidden Markov model for multiple times includes: Expectation Maximization (EM) or Maximum Likelihood. Expectation maximization method or maximum likelihood method trains Gaussian mixture model-hidden Markov model many times to obtain a model with high recognition accuracy. The expectation maximization method is a method to obtain the maximum likelihood estimation of parameters. The expectation maximization method is a maximum likelihood estimation method for solving the parameters of the probability model from incomplete data or data sets with data loss (with hidden variables). The maximum likelihood method (Maximum Likelihood, ML) is also called the most likely estimation, also called the maximum likelihood estimation. It is a theoretical point estimation method that can be used to estimate the parameters of the model.
如此,在对高斯混合模型-隐马尔科夫模型多次训练之后,以获得性能较好的高斯混合模型-隐马尔科夫模型识别模型。In this way, after training the Gaussian mixture model-hidden Markov model for many times, a Gaussian mixture model-hidden Markov model recognition model with better performance is obtained.
图3所示为本申请交互方法200的一个实施例的流程图。交互方法200包括步骤201-203。FIG. 3 shows a flowchart of an embodiment of an interaction method 200 of this application. The interaction method 200 includes steps 201-203.
在步骤201中,获取待识别声音信号。可以从实时声音信号流中获取待识别声音信号。详细描述可以参照上文所述,在此不赘述。In step 201, a voice signal to be recognized is obtained. The sound signal to be recognized can be obtained from the real-time sound signal stream. For detailed description, please refer to the above description, which will not be repeated here.
在步骤202中,执行如上文所述的声音识别方法100,对获取到的待识别声音信号进行识别。In step 202, the voice recognition method 100 as described above is executed to recognize the acquired voice signal to be recognized.
在步骤203中,若根据声音识别方法100识别出待识别声音信号包括拍击声音,根据拍击声音输出相应的控制指令。In step 203, if it is recognized according to the voice recognition method 100 that the voice signal to be recognized includes a slap sound, a corresponding control instruction is output according to the slap sound.
交互方法200利用拍击声音进行交互。声音识别方法对拍击声音的 识别率高,鲁棒性好,误触发可能性低,因此使得交互方法可靠。而且拍击声音瞬时能量比语音强,在空气中不易衰减殆尽,因此对传播一定距离,例如2米以上距离,的拍击声音的识别效果也会比语音识别效果好,从而可以在较远距离范围内利用拍击声音实现人机交互,有更高的识别率和更强的抗干扰性。The interaction method 200 uses tapping sounds for interaction. The voice recognition method has a high recognition rate for tapping sounds, good robustness, and low possibility of false triggering, thus making the interactive method reliable. Moreover, the instantaneous energy of the slap sound is stronger than that of the voice, and it is not easy to be attenuated in the air. Therefore, the recognition effect of the slap sound for a certain distance, such as a distance of 2 meters or more, will be better than the voice recognition effect, so it can be farther away. The human-computer interaction is realized by tapping sound within the distance range, which has a higher recognition rate and stronger anti-interference.
在一些实施例中,控制指令包括在待识别声音信号包括拍击声音时,控制可移动平台的控制指令。控制指令可以控制可移动平台,例如可以控制可移动平台前进、后退、转弯、旋转、静止、发射子弹等。可移动平台可以包括移动小车、无人飞行器、汽车、机器人或其他可移动装置。利用拍击声音与可移动平台交互,控制可移动平台,拍击声音的识别率高,对可移动平台的控制更准确,误控制的概率低,并且可以实现较远距离范围内的多种交互方式,提高用户体验。In some embodiments, the control instruction includes a control instruction for controlling the movable platform when the sound signal to be recognized includes a tapping sound. The control instructions can control the movable platform, for example, the movable platform can be controlled to move forward, backward, turn, rotate, stand still, and fire bullets. Movable platforms may include mobile cars, unmanned aerial vehicles, automobiles, robots, or other movable devices. Use the slap sound to interact with the movable platform to control the movable platform. The recognition rate of the slap sound is high, the control of the movable platform is more accurate, the probability of false control is low, and multiple interactions within a longer distance range can be realized Ways to improve user experience.
在一些实施例中,控制指令包括在待识别声音信号包括拍击声音时,控制可移动平台的视觉系统的控制指令。在待识别声音信号包括拍击声音时,可以控制视觉系统的工作状态等。利用拍击声音实现对视觉系统的控制,可以提高控制视觉系统的准确性。在一个实施例中,控制指令包括控制视觉系统启动视觉追踪的控制指令,和/或控制视觉系统结束视觉追踪的控制指令。利用拍击声音可以启动和/或结束视觉追踪,可以准确地实现对视觉追踪的控制。In some embodiments, the control instruction includes a control instruction for controlling the visual system of the movable platform when the sound signal to be recognized includes a tapping sound. When the sound signal to be recognized includes the tapping sound, the working state of the visual system can be controlled. Using slap sound to control the visual system can improve the accuracy of controlling the visual system. In one embodiment, the control instruction includes a control instruction for controlling the vision system to start visual tracking, and/or a control instruction for controlling the vision system to end visual tracking. The visual tracking can be started and/or ended by tapping sound, and the control of visual tracking can be accurately realized.
在其他一些实施例中,控制指令可以控制可移动平台的其他系统,例如可以控制可移动平台的动力装置,从而控制可移动平台的移动;可以控制可移动平台的摄像头拍照等。在其他一些实施例中,控制指令可以控制其他装置,并不限于可移动平台。In some other embodiments, the control instruction can control other systems of the movable platform, for example, the power device of the movable platform can be controlled to control the movement of the movable platform; the camera of the movable platform can be controlled to take pictures. In some other embodiments, the control command can control other devices, and is not limited to a movable platform.
在一些实施例中,获取拍击声音的拍击的次数、拍击的时长和拍击的频率中的至少一种;根据拍击声音的拍击的次数、拍击的时长和拍击的频率中的至少一种输出不同的控制指令。拍击声音的拍击的次数、拍击的 时长和拍击的频率中的至少一种不同,输出不同的控制指令,如此可以根据不同的拍击声音产生不同的控制指令,实现不同的控制。例如可以根据不同的拍击声音产生不同的控制指令,分别控制视觉追踪启动和结束。In some embodiments, at least one of the number of slaps of the slap sound, the duration of the slap, and the frequency of the slap is acquired; according to the number of slaps of the slap sound, the duration and the frequency of the slap At least one of them outputs different control commands. At least one of the number of times of the slap of the slap sound, the duration of the slap, and the frequency of the slap are different, and different control instructions are output, so that different control instructions can be generated according to different slap sounds to realize different controls. For example, different control commands can be generated according to different tapping sounds to control the start and end of visual tracking respectively.
在一个实施例中,根据不同的连续拍击的次数,产生不同的控制指令。在一个例子中,用户连续拍掌两次,交互方法200识别出表示连续拍击两次的拍击声音,控制可移动平台启动视觉跟踪,可移动平台开始跟随用户移动。跟随中,用户连续拍掌三次,交互方法200识别出表示连续拍击三次的拍击声音,控制可移动平台停止移动。上述仅是一个例子,并不限于上述的例子。在一个实施例中,拍击声音的类型与控制指令的映射关系可以是预先设定,也可以通过用户自主设置,从而增强交互控制的灵活性,提高用户体验。In one embodiment, different control commands are generated according to different times of continuous slap. In an example, the user claps his palms twice in a row, the interactive method 200 recognizes the clapping sound representing two consecutive slaps, controls the movable platform to start visual tracking, and the movable platform starts to move with the user. During following, the user claps his palms three times in a row, and the interactive method 200 recognizes the clapping sound representing three consecutive slaps, and controls the movable platform to stop moving. The above is only an example, and is not limited to the above example. In an embodiment, the mapping relationship between the type of the slap sound and the control instruction may be preset, or may be independently set by the user, thereby enhancing the flexibility of interactive control and improving user experience.
图4所示为本申请声音识别系统300的一个实施例的示意图。声音识别系统300包括一个或多个处理器,用于实现声音识别方法。声音识别系统300的处理器301可以实现上文所述的声音识别方法100。在一些实施例中,声音识别系统300可以包括计算机可读存储介质304,计算机可读存储介质可以存储有可被处理器301调用的程序,可以包括非易失性存储介质。在一些实施例中,声音识别系统300可以包括内存303和接口302。在一些实施例中,声音识别系统300还可以根据实际应用包括其他硬件。FIG. 4 is a schematic diagram of an embodiment of the voice recognition system 300 of this application. The voice recognition system 300 includes one or more processors for implementing a voice recognition method. The processor 301 of the voice recognition system 300 can implement the voice recognition method 100 described above. In some embodiments, the voice recognition system 300 may include a computer-readable storage medium 304, which may store a program that can be called by the processor 301, and may include a non-volatile storage medium. In some embodiments, the voice recognition system 300 may include a memory 303 and an interface 302. In some embodiments, the voice recognition system 300 may also include other hardware according to actual applications.
本申请计算机可读存储介质304,其上存储有程序,该程序被处理器301执行时,实现声音识别方法100。The computer-readable storage medium 304 of this application has a program stored thereon, and when the program is executed by the processor 301, the voice recognition method 100 is implemented.
图5所示为本申请可移动平台400的一个实施例的模块框图。可移动平台400包括机体401、动力系统402、麦克风403和一个或多个处理器404。可移动平台400可以包括移动小车、无人飞行器、汽车、机器人或其他可移动装置。动力系统402设于机体401,用于为可移动平台提供动力。在一些实施例中,动力系统402可以包括电机。在一个实施例中,可移动平台400为无人飞行器,动力系统402包括与电机连接的螺旋桨。在另一 个实施例中,可移动平台400为移动小车,动力系统402包括与电机连接的车轮,例如万向轮。FIG. 5 shows a block diagram of an embodiment of the mobile platform 400 of the present application. The movable platform 400 includes a body 401, a power system 402, a microphone 403, and one or more processors 404. The movable platform 400 may include a mobile car, an unmanned aerial vehicle, a car, a robot, or other movable devices. The power system 402 is provided in the body 401 and used to provide power for the movable platform. In some embodiments, the power system 402 may include an electric motor. In one embodiment, the movable platform 400 is an unmanned aerial vehicle, and the power system 402 includes a propeller connected with a motor. In another embodiment, the movable platform 400 is a mobile trolley, and the power system 402 includes wheels connected to motors, such as universal wheels.
麦克风403用于接收待识别声音,并产生相应的待识别声音信号。麦克风403可以安装于机体401。由于拍击声音瞬时能量比语音强,在空气中更不易衰减殆尽,拍击声音可以更好地被麦克风403接收。麦克风的数量可以是一个或多个。在一种实施方式中,麦克风还可以包括防风配件,例如防风毛罩、避震架等,以更好地接收待识别声音。The microphone 403 is used to receive the voice to be recognized and generate a corresponding voice signal to be recognized. The microphone 403 may be installed in the body 401. Since the instantaneous energy of the slap sound is stronger than the voice, it is less likely to be attenuated in the air, and the slap sound can be better received by the microphone 403. The number of microphones can be one or more. In an embodiment, the microphone may also include windproof accessories, such as a windproof hair cover, a shock absorber, etc., to better receive the sound to be recognized.
一个或多个处理器404,用于实现声音识别方法,并若根据声音识别方法识别出待识别声音信号包括拍击声音,则根据拍击声音输出相应的控制指令。处理器404可以控制动力系统402。One or more processors 404 are configured to implement a voice recognition method, and if the voice signal to be recognized includes a slap sound according to the voice recognition method, output a corresponding control command according to the slap sound. The processor 404 can control the power system 402.
在一个实施例中,控制指令包括在待识别声音信号包括拍击声音时,控制可移动平台400的控制指令。在一个实施例中,可移动平台400包括视觉系统405,控制指令包括在待识别声音信号包括拍击声音时,控制视觉系统的控制指令。处理器404可以控制视觉系统405。在一个实施例中,控制指令包括控制视觉系统405启动视觉追踪的控制指令,和/或控制视觉系统405结束视觉追踪的控制指令。具体描述参加上文所述。In one embodiment, the control instruction includes a control instruction for controlling the movable platform 400 when the sound signal to be recognized includes a tapping sound. In one embodiment, the movable platform 400 includes a vision system 405, and the control instruction includes a control instruction for controlling the vision system when the sound signal to be recognized includes a slap sound. The processor 404 can control the vision system 405. In one embodiment, the control instruction includes a control instruction for controlling the vision system 405 to start visual tracking, and/or a control instruction for controlling the vision system 405 to end visual tracking. Specific description please refer to the above.
在一个实施例中,处理器404用于获取拍击声音的拍击的次数、拍击的时长和拍击的频率中的至少一种;根据拍击声音的拍击的次数、拍击的时长和拍击的频率中的至少一种输出不同的控制指令。具体描述参加上文所述。In one embodiment, the processor 404 is configured to obtain at least one of the number of times of the slap of the slap sound, the duration of the slap, and the frequency of the slap; according to the number of slaps of the slap sound, the duration of the slap A control command different from at least one of the frequencies of the tapping is output. Specific description please refer to the above.
本申请可采用在一个或多个其中包含有程序代码的存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。计算机可读存储介质包括永久性和非永久性、可移动和非可移动媒体,可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机可读存储介质的例子包 括但不限于:相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。This application can take the form of a computer program product implemented on one or more storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing program codes. Computer-readable storage media include permanent and non-permanent, removable and non-removable media, and information storage can be achieved by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer-readable storage media include, but are not limited to: phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only Memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical storage , Magnetic cassette tape, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
应当理解,本申请的各部分可以用硬件、软件或它们的组合来实现。在上述实施例中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或硬件来实现。例如,如果用硬件来实现,可用下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that each part of this application can be implemented by hardware, software or a combination thereof. In the above embodiments, multiple steps or methods can be implemented by software or hardware stored in a memory and executed by a suitable instruction execution system. For example, if it is implemented by hardware, it can be implemented by any one of the following technologies or a combination of them: discrete logic circuits with logic gates for realizing logic functions on data signals, and dedicated logic gates with suitable combinational logic gates Integrated circuit, programmable gate array (PGA), field programmable gate array (FPGA), etc.
本技术领域的普通技术人员可以理解实现上述实施方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。A person of ordinary skill in the art can understand that all or part of the steps carried in the implementation method described above can be completed by a program instructing relevant hardware. The program can be stored in a computer-readable storage medium. When it includes one of the steps of the method embodiment or a combination thereof.
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as "first" and "second" are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is any such actual relationship or sequence between entities or operations. The terms "include", "include", or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements that are not explicitly listed. Elements, or also include elements inherent to such processes, methods, articles, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or equipment including the element.
以上对本发明实施例所提供的方法和装置进行了详细介绍,本文中 应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The methods and devices provided by the embodiments of the present invention are described in detail above. Specific examples are used in this article to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and methods of the present invention. Core idea; At the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as a limitation of the present invention .
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。The content disclosed in this patent document contains copyrighted material. The copyright belongs to the copyright owner. The copyright owner does not object to anyone copying the patent document or the patent disclosure in the official records and archives of the Patent and Trademark Office.

Claims (39)

  1. 一种声音识别方法,用于识别拍击声音,其特征在于,所述声音识别方法包括:A voice recognition method for recognizing tapping sounds, characterized in that the voice recognition method includes:
    获取待识别声音信号的至少一个声音片段和所述声音片段的第一特征信息,所述第一特征信息为所述声音片段的能量值,若所述声音片段的中部区域的能量值大于能量阈值,则从所述声音片段中提取第二特征信息;及Acquire at least one sound segment of the sound signal to be recognized and first characteristic information of the sound segment, where the first characteristic information is the energy value of the sound segment, if the energy value of the middle region of the sound segment is greater than the energy threshold , Extract second feature information from the sound segment; and
    根据至少一个所述声音片段的所述第二特征信息,识别所述待识别声音信号是否包括拍击声音。According to the second feature information of at least one of the sound segments, it is recognized whether the sound signal to be recognized includes a slap sound.
  2. 根据权利要求1所述的声音识别方法,其特征在于,所述若所述声音片段的中部区域的能量值大于能量阈值,则从所述声音片段中提取第二特征信息,包括:2. The voice recognition method according to claim 1, wherein if the energy value of the middle region of the sound segment is greater than an energy threshold, extracting second feature information from the sound segment comprises:
    对所述待识别声音信号进行分帧加窗处理,得到所述待识别声音信号对应的多个声音帧;Performing frame division and windowing processing on the voice signal to be recognized to obtain multiple sound frames corresponding to the voice signal to be recognized;
    若所述声音片段对应的多个所述声音帧的中部区域的声音帧的所述能量值大于所述能量阈值,则从所述声音片段中提取所述第二特征信息。If the energy value of the sound frame in the middle region of the plurality of sound frames corresponding to the sound segment is greater than the energy threshold, the second characteristic information is extracted from the sound segment.
  3. 根据权利要求2所述的声音识别方法,其特征在于,所述能量值包括所述声音帧的频谱值,所述若所述声音片段对应的多个所述声音帧的中部区域的声音帧的所述能量值大于所述能量阈值,则从所述声音片段中提取所述第二特征信息,包括:The voice recognition method according to claim 2, wherein the energy value includes the frequency spectrum value of the sound frame, and if the sound segment corresponds to the sound frame in the middle region of the sound frame, If the energy value is greater than the energy threshold, extracting the second characteristic information from the sound segment includes:
    对多个所述声音帧进行快速傅里叶变换,获得多个所述声音帧的频谱值;Performing fast Fourier transform on a plurality of the sound frames to obtain the spectral values of the plurality of sound frames;
    若所述声音片段对应的多个所述声音帧的中部区域的声音帧的所述频谱值大于所述能量阈值,则从所述声音片段中提取所述第二特征信息。If the frequency spectrum value of the sound frame in the middle region of the multiple sound frames corresponding to the sound segment is greater than the energy threshold, the second feature information is extracted from the sound segment.
  4. 根据权利要求3所述的声音识别方法,其特征在于,所述窗口能够在多个所述声音帧之间顺次滑动,所述若所述声音片段对应的多个所述声音帧的中部区域的声音帧的所述频谱值大于所述能量阈值,则从所述声音片段中提取所述第二特征信息,包括:The voice recognition method according to claim 3, wherein the window can be sequentially slid between a plurality of the sound frames, and the middle region of the plurality of the sound frames corresponding to the sound segment If the spectral value of the sound frame of is greater than the energy threshold, extracting the second characteristic information from the sound segment includes:
    若所述窗口的中部区域的声音帧的频谱值大于能量阈值,产生触发信号;If the spectral value of the sound frame in the middle area of the window is greater than the energy threshold, generating a trigger signal;
    当所述窗口顺次滑动过若干个所述声音帧时,若连续产生的触发信号的数目达到触发数目阈值,则从声音片段中提取第二特征信息。When the window slides through several sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, the second characteristic information is extracted from the sound segment.
  5. 根据权利要求3所述的声音识别方法,其特征在于,所述窗口能够在多个所述声音帧之间顺次滑动,所述若所述声音片段对应的多个所述声音帧的中部区域的声音帧的所述频谱值大于所述能量阈值,则从所述声音片段中提取所述第二特征信息,包括:The voice recognition method according to claim 3, wherein the window can be sequentially slid between a plurality of the sound frames, and the middle region of the plurality of the sound frames corresponding to the sound segment If the spectral value of the sound frame of is greater than the energy threshold, extracting the second characteristic information from the sound segment includes:
    当所述窗口的中部区域的声音帧的频谱值大于第一能量值,所述窗口的两端区域的声音帧的频谱值小于第二能量值时,产生触发信号,其中,所述第一能量值大于第二能量值;When the spectral value of the sound frame in the middle region of the window is greater than the first energy value, and the spectral value of the sound frame in the two end regions of the window is smaller than the second energy value, a trigger signal is generated, wherein the first energy Value is greater than the second energy value;
    当所述窗口顺次滑动过若干个所述声音帧时,若连续产生的所述触发信号的数目达到触发数目阈值,则从所述声音片段中提取所述第二特征信息。When the window slides through a number of the sound frames in sequence, if the number of continuously generated trigger signals reaches the trigger number threshold, the second characteristic information is extracted from the sound segment.
  6. 根据权利要求1所述的声音识别方法,其特征在于,所述根据至少一个所述声音片段的所述第二特征信息,识别所述待识别声音信号是否包括拍击声音,包括:The voice recognition method according to claim 1, wherein the recognizing whether the voice signal to be recognized includes a tapping sound according to the second characteristic information of at least one of the voice fragments comprises:
    将所述第二特征信息输入识别模型中进行识别,以识别所述待识别声音信号是否包括拍击声音。The second feature information is input into a recognition model for recognition, so as to recognize whether the sound signal to be recognized includes a tapping sound.
  7. 根据权利要求6所述的声音识别方法,其特征在于,所述第二特征信息包括声学特征,所述声学特征包括梅尔频率倒谱系数特征、线性预测系数特征、Filterbank特征、瓶颈特征中的至少一种。The voice recognition method according to claim 6, wherein the second feature information includes acoustic features, and the acoustic features include Mel frequency cepstral coefficient features, linear prediction coefficient features, Filterbank features, and bottleneck features. At least one.
  8. 根据权利要求6所述的声音识别方法,其特征在于,所述识别模型包括多种声音类别,所述将所述第二特征信息输入识别模型中进行识别,以识别所述待识别声音信号是否包括拍击声音,包括:The voice recognition method according to claim 6, wherein the recognition model includes multiple voice categories, and the second feature information is input into the recognition model for recognition, so as to recognize whether the voice signal to be recognized is Including slap sounds, including:
    分别确定所述第二特征信息与多种所述声音类别的特征信息的似然度;Respectively determining the likelihoods of the second feature information and feature information of multiple types of the sound;
    对所述似然度进行排序,将所述似然度最高的声音类别确定为所述声音片段的类别,以识别所述待识别声音信号是否包括拍击声音。The likelihood is sorted, and the sound category with the highest likelihood is determined as the category of the sound segment, so as to identify whether the sound signal to be recognized includes a slap sound.
  9. 根据权利要求8所述的声音识别方法,其特征在于,所述声音类别包括拍击声音类别和非拍击声音类别。8. The voice recognition method according to claim 8, wherein the voice category includes a slap sound category and a non-slap sound category.
  10. 根据权利要求9所述的声音识别方法,其特征在于,所述拍击声音类别包括至少两种表示不同连续拍击次数的拍击声音类别。The voice recognition method according to claim 9, wherein the slap sound category includes at least two slap sound categories representing different consecutive times of slaps.
  11. 根据权利要求6所述的声音识别方法,其特征在于,所述声音识别方法包括:使用拍击声音训练数据和非拍击声音训练数据,训练所述识别模型。The voice recognition method according to claim 6, wherein the voice recognition method comprises: training the recognition model using slap sound training data and non-slap sound training data.
  12. 根据权利要求11所述的声音识别方法,其特征在于,所述拍击声音训练数据包括第一拍击声音训练数据和第二拍击声音训练数据,所述第一拍击声音训练数据和所述第二拍击声音训练数据表示的拍击的次数、拍击的时长和拍击的频率中的至少一种不同;The voice recognition method according to claim 11, wherein the slap sound training data includes first slap sound training data and second slap sound training data, and the first slap sound training data and the At least one of the number of times of slap, the duration of slap, and the frequency of slap represented by the second slap sound training data is different;
    所述使用拍击声音训练数据和非拍击声音训练数据,训练所述识别模型,包括:The training the recognition model using the slap sound training data and the non-slap sound training data includes:
    使用所述第一拍击声音训练数据和所述第二拍击声音训练数据,训练所述识别模型。Training the recognition model using the first slap sound training data and the second slap sound training data.
  13. 根据权利要求6所述的声音识别方法,其特征在于,所述识别模型包括深度模型和浅层模型中的至少一种。The voice recognition method according to claim 6, wherein the recognition model includes at least one of a deep model and a shallow model.
  14. 根据权利要求13所述的声音识别方法,其特征在于,所述深度模型包括以下至少一种:深度神经网络、长短时记忆网络和卷积神经网络。The voice recognition method according to claim 13, wherein the deep model comprises at least one of the following: a deep neural network, a long and short-term memory network, and a convolutional neural network.
  15. 根据权利要求13所述的声音识别方法,其特征在于,所述浅层模型包括高斯混合模型-隐马尔科夫模型。The voice recognition method according to claim 13, wherein the shallow model comprises a Gaussian mixture model-hidden Markov model.
  16. 根据权利要求15所述的声音识别方法,其特征在于,所述高斯混合模型-隐马尔科夫模型的高斯数量的范围为3至12。The voice recognition method according to claim 15, wherein the Gaussian number of the Gaussian mixture model-hidden Markov model ranges from 3 to 12.
  17. 根据权利要求15所述的声音识别方法,其特征在于,使用拍击声音训练数据和非拍击声音训练数据训练所述高斯混合模型-隐马尔科夫 模型,其中,所述拍击声音训练数据包括第一拍击声音训练数据和第二拍击声音训练数据,所述第一拍击声音训练数据和所述第二拍击声音训练数据表示的拍击的次数、拍击的时长和拍击的频率中的至少一种不同。The voice recognition method according to claim 15, wherein the slap sound training data and non-slap sound training data are used to train the Gaussian mixture model-hidden Markov model, wherein the slap sound training data It includes first slap sound training data and second slap sound training data. The first slap sound training data and the second slap sound training data indicate the number of slaps, the duration of the slap, and the slap At least one of the frequencies is different.
  18. 根据权利要求17所述的声音识别方法,其特征在于,所述第一拍击声音训练数据包括连续两次拍击的拍击声音训练数据,所述高斯混合模型-隐马尔科夫模型的对应所述第一拍击声音训练数据的状态数量范围为6至14。The voice recognition method according to claim 17, wherein the first slap sound training data includes the slap sound training data of two consecutive slaps, and the corresponding Gaussian mixture model-hidden Markov model The number of states of the first slap sound training data ranges from 6 to 14.
  19. 根据权利要求17所述的声音识别方法,其特征在于,所述第二拍击声音训练数据包括连续三次拍击的拍击声音训练数据,所述高斯混合模型-隐马尔科夫模型的对应所述第二拍击声音训练数据的状态数量范围为9至21。The voice recognition method according to claim 17, wherein the second slap sound training data comprises three consecutive slaps slap sound training data, and the Gaussian mixture model-hidden Markov model corresponds to The number of states of the second slap sound training data ranges from 9 to 21.
  20. 根据权利要求17所述的声音识别方法,其特征在于,所述高斯混合模型-隐马尔科夫模型的对应所述非拍击声音训练数据的状态数量范围为7至18。The voice recognition method of claim 17, wherein the number of states of the Gaussian mixture model-hidden Markov model corresponding to the non-slap voice training data ranges from 7 to 18.
  21. 根据权利要求17所述的声音识别方法,其特征在于,所述使用拍击声音训练数据和非拍击声音训练数据训练所述高斯混合模型-隐马尔科夫模型,包括:The voice recognition method according to claim 17, wherein the training the Gaussian mixture model-hidden Markov model using slap sound training data and non-slap sound training data comprises:
    对隐马尔科夫模型进行参数估计。Estimate the parameters of the hidden Markov model.
  22. 根据权利要求21所述的声音识别方法,其特征在于,所述对隐马尔科夫模型进行参数估计的方法包括:Baum-welch算法和/或遗传算法。The voice recognition method according to claim 21, wherein the method for estimating the parameters of the hidden Markov model comprises: Baum-welch algorithm and/or genetic algorithm.
  23. 根据权利要求17所述的声音识别方法,其特征在于,所述使用拍击声音训练数据和非拍击声音训练数据训练所述高斯混合模型-隐马尔科夫模型,包括:The voice recognition method according to claim 17, wherein the training the Gaussian mixture model-hidden Markov model using slap sound training data and non-slap sound training data comprises:
    对所述高斯混合模型-隐马尔科夫模型中的高斯混合模型进行多次训练。The Gaussian mixture model in the Gaussian mixture model-hidden Markov model is trained multiple times.
  24. 根据权利要求23所述的声音识别方法,其特征在于,所述对所述高斯混合模型-隐马尔科夫模型中的高斯混合模型进行多次训练的方法包 括:期望最大化方法或最大似然法。The voice recognition method according to claim 23, wherein the method of training the Gaussian mixture model in the Gaussian mixture model-hidden Markov model for multiple times comprises: an expectation maximization method or a maximum likelihood law.
  25. 根据权利要求1所述的声音识别方法,其特征在于,所述拍击声音的频率范围为300Hz至8000Hz。The voice recognition method according to claim 1, wherein the frequency range of the slap sound is 300 Hz to 8000 Hz.
  26. 根据权利要求1所述的声音识别方法,其特征在于,所述拍击声音包括掌声和敲打声中的至少一种。The voice recognition method according to claim 1, wherein the clapping sound comprises at least one of applause and knocking sound.
  27. 根据权利要求1所述的声音识别方法,其特征在于,所述根据至少一个所述声音片段的所述第二特征信息,识别所述待识别声音信号是否包括拍击声音,还包括:The voice recognition method according to claim 1, wherein the recognizing whether the voice signal to be recognized includes a slap sound according to the second feature information of at least one of the voice segments further comprises:
    当所述待识别声音信号包括所述拍击声音时,识别所述拍击声音的类别,所述拍击声音的类别对应于相应的控制指令;其中,所述拍击声音的类别包括拍击的次数、拍击的时长和拍击的频率中的至少一种。When the sound signal to be recognized includes the slap sound, the category of the slap sound is identified, and the category of the slap sound corresponds to the corresponding control instruction; wherein, the category of the slap sound includes the slap At least one of the number of times, the duration of the slap, and the frequency of the slap.
  28. 一种交互方法,其特征在于:包括:An interactive method, characterized in that it includes:
    获取待识别声音信号;Acquire the sound signal to be recognized;
    权利要求1-27任一项所述的声音识别方法;及The voice recognition method of any one of claims 1-27; and
    若根据所述声音识别方法识别出所述待识别声音信号包括拍击声音,根据所述拍击声音输出相应的控制指令。If it is recognized according to the voice recognition method that the sound signal to be recognized includes a slap sound, a corresponding control instruction is output according to the slap sound.
  29. 根据权利要求28所述的交互方法,其特征在于,所述控制指令包括在所述待识别声音信号包括所述拍击声音时,控制可移动平台的控制指令。The interaction method according to claim 28, wherein the control instruction comprises a control instruction for controlling a movable platform when the sound signal to be recognized includes the tapping sound.
  30. 根据权利要求29所述的交互方法,其特征在于,所述控制指令包括在所述待识别声音信号包括所述拍击声音时,控制所述可移动平台的视觉系统的控制指令。The interaction method according to claim 29, wherein the control instruction comprises a control instruction for controlling the visual system of the movable platform when the sound signal to be recognized includes the slap sound.
  31. 根据权利要求30所述的交互方法,其特征在于,所述控制指令包括控制所述视觉系统启动视觉追踪的控制指令,和/或控制所述视觉系统结束所述视觉追踪的控制指令。The interaction method according to claim 30, wherein the control instruction comprises a control instruction for controlling the vision system to start visual tracking, and/or a control instruction for controlling the vision system to end the visual tracking.
  32. 根据权利要求28所述的交互方法,其特征在于,所述根据所述拍击声音输出相应的控制指令,包括:The interaction method according to claim 28, wherein said outputting a corresponding control instruction according to said tapping sound comprises:
    获取所述拍击声音的拍击的次数、拍击的时长和拍击的频率中的至少一种;Acquiring at least one of the number of times of the slap of the slap sound, the duration of the slap, and the frequency of the slap;
    根据所述拍击声音的拍击的次数、拍击的时长和拍击的频率中的至少一种输出不同的控制指令。Different control commands are output according to at least one of the number of slaps of the slap sound, the duration of the slap, and the frequency of the slap.
  33. 一种声音识别系统,其特征在于:包括一个或多个处理器,用于实现权利要求1-27中任一项所述的声音识别方法。A voice recognition system, characterized in that it comprises one or more processors for implementing the voice recognition method according to any one of claims 1-27.
  34. 一种计算机可读存储介质,其特征在于,其上存储有程序,该程序被处理器执行时,实现权利要求1-27中任意一项所述的声音识别方法。A computer-readable storage medium, characterized in that a program is stored thereon, and when the program is executed by a processor, the voice recognition method according to any one of claims 1-27 is realized.
  35. 一种可移动平台,其特征在于,包括:A movable platform, characterized in that it comprises:
    机体;Body
    动力系统,设于所述机体,用于为所述可移动平台提供动力;The power system is provided in the body and used to provide power to the movable platform;
    麦克风,用于接收待识别声音,并产生相应的待识别声音信号;及The microphone is used to receive the voice to be recognized and generate the corresponding voice signal to be recognized; and
    一个或多个处理器,用于实现权利要求1-27中任一项所述的声音识别方法,并若根据所述声音识别方法识别出所述待识别声音信号包括拍击声音,则根据所述拍击声音输出相应的控制指令。One or more processors, configured to implement the voice recognition method of any one of claims 1-27, and if it is recognized according to the voice recognition method that the voice signal to be recognized includes a tapping sound, then The slap sound outputs corresponding control commands.
  36. 根据权利要求35所述的可移动平台,其特征在于,所述控制指令包括在所述待识别声音信号包括所述拍击声音时,控制所述可移动平台的控制指令。The movable platform according to claim 35, wherein the control instruction comprises a control instruction for controlling the movable platform when the sound signal to be recognized includes the tapping sound.
  37. 根据权利要求36所述的可移动平台,其特征在于,所述可移动平台包括视觉系统,所述控制指令包括在所述待识别声音信号包括所述拍击声音时,控制所述视觉系统的控制指令。The movable platform according to claim 36, wherein the movable platform includes a vision system, and the control instruction includes controlling the vision system when the sound signal to be recognized includes the tapping sound. Control instruction.
  38. 根据权利要求37所述的可移动平台,其特征在于,所述控制指令包括控制所述视觉系统启动视觉追踪的控制指令,和/或控制所述视觉系统结束所述视觉追踪的控制指令。The movable platform according to claim 37, wherein the control instruction comprises a control instruction for controlling the vision system to start visual tracking, and/or a control instruction for controlling the vision system to end the visual tracking.
  39. 根据权利要求35所述的可移动平台,其特征在于,所述处理器用于:The mobile platform according to claim 35, wherein the processor is configured to:
    获取所述拍击声音的拍击的次数、拍击的时长和拍击的频率中的至 少一种;Acquiring at least one of the number of times of slap of the slap sound, the duration of the slap, and the frequency of the slap;
    根据所述拍击声音的拍击的次数、拍击的时长和拍击的频率中的至少一种输出不同的控制指令。Different control commands are output according to at least one of the number of slaps of the slap sound, the duration of the slap, and the frequency of the slap.
PCT/CN2019/086979 2019-05-15 2019-05-15 Sound recognition method, interaction method, sound recognition system, computer-readable storage medium and mobile platform WO2020227955A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/086979 WO2020227955A1 (en) 2019-05-15 2019-05-15 Sound recognition method, interaction method, sound recognition system, computer-readable storage medium and mobile platform
CN201980009292.6A CN111684522A (en) 2019-05-15 2019-05-15 Voice recognition method, interaction method, voice recognition system, computer-readable storage medium, and removable platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/086979 WO2020227955A1 (en) 2019-05-15 2019-05-15 Sound recognition method, interaction method, sound recognition system, computer-readable storage medium and mobile platform

Publications (1)

Publication Number Publication Date
WO2020227955A1 true WO2020227955A1 (en) 2020-11-19

Family

ID=72451467

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/086979 WO2020227955A1 (en) 2019-05-15 2019-05-15 Sound recognition method, interaction method, sound recognition system, computer-readable storage medium and mobile platform

Country Status (2)

Country Link
CN (1) CN111684522A (en)
WO (1) WO2020227955A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186581A (en) * 2021-11-15 2022-03-15 国网天津市电力公司 Cable hidden danger identification method and device based on MFCC (Mel frequency cepstrum coefficient) and diffusion Gaussian mixture model
CN115798514B (en) * 2023-02-06 2023-04-21 成都启英泰伦科技有限公司 Knock detection method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0654370A (en) * 1992-05-26 1994-02-25 Gold Star Co Ltd Remote-controlled search device
CN101136135A (en) * 2006-08-28 2008-03-05 日本胜利株式会社 Control device of electronic equipment and control method of electronic equipment
CN102281484A (en) * 2010-04-07 2011-12-14 索尼公司 Audio signal processing apparatus, audio signal processing method, and program
CN107655143A (en) * 2017-09-25 2018-02-02 合肥艾斯克光电科技有限责任公司 A kind of indoor temperature control system based on voice recognition technology

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100677613B1 (en) * 2005-09-09 2007-02-02 삼성전자주식회사 Method for controlling operation of multimedia device and apparatus therefore
FI20095371A (en) * 2009-04-03 2010-10-04 Aalto Korkeakoulusaeaetioe A method for controlling the device
CN101957736A (en) * 2010-09-30 2011-01-26 汉王科技股份有限公司 Electronic reading device and control method thereof
CN102915728B (en) * 2011-08-01 2014-08-27 佳能株式会社 Sound segmentation device and method and speaker recognition system
CN102982804B (en) * 2011-09-02 2017-05-03 杜比实验室特许公司 Method and system of voice frequency classification
CN104766610A (en) * 2015-04-07 2015-07-08 马业成 Voice recognition system and method based on vibration
KR101892794B1 (en) * 2015-08-25 2018-08-28 엘지전자 주식회사 Refrigerator
CN109712641A (en) * 2018-12-24 2019-05-03 重庆第二师范学院 A kind of processing method of audio classification and segmentation based on support vector machines

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0654370A (en) * 1992-05-26 1994-02-25 Gold Star Co Ltd Remote-controlled search device
CN101136135A (en) * 2006-08-28 2008-03-05 日本胜利株式会社 Control device of electronic equipment and control method of electronic equipment
CN102281484A (en) * 2010-04-07 2011-12-14 索尼公司 Audio signal processing apparatus, audio signal processing method, and program
CN107655143A (en) * 2017-09-25 2018-02-02 合肥艾斯克光电科技有限责任公司 A kind of indoor temperature control system based on voice recognition technology

Also Published As

Publication number Publication date
CN111684522A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
KR102134201B1 (en) Method, apparatus, and storage medium for constructing speech decoding network in numeric speech recognition
US10872602B2 (en) Training of acoustic models for far-field vocalization processing systems
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
Giri et al. Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning
US8930196B2 (en) System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands
EP3553773A1 (en) Training and testing utterance-based frameworks
US11069352B1 (en) Media presence detection
CN106710599A (en) Particular sound source detection method and particular sound source detection system based on deep neural network
KR20120054845A (en) Speech recognition method for robot
US11393473B1 (en) Device arbitration using audio characteristics
TWI711035B (en) Method, device, audio interaction system, and storage medium for azimuth estimation
JP4964204B2 (en) Multiple signal section estimation device, multiple signal section estimation method, program thereof, and recording medium
KR101065188B1 (en) Apparatus and method for speaker adaptation by evolutional learning, and speech recognition system using thereof
WO2020227955A1 (en) Sound recognition method, interaction method, sound recognition system, computer-readable storage medium and mobile platform
CN108665907B (en) Voice recognition device, voice recognition method, recording medium, and robot
CN109065026B (en) Recording control method and device
US11580955B1 (en) Synthetic speech processing
JP2021033051A (en) Information processing device, information processing method and program
US11521635B1 (en) Systems and methods for noise cancellation
CN112951219A (en) Noise rejection method and device
US20230260501A1 (en) Synthetic speech processing
Nakadai et al. A robot referee for rock-paper-scissors sound games
Loh et al. Speech recognition interactive system for vehicle
US11670326B1 (en) Noise detection and suppression
Tuasikal et al. Voice activation using speaker recognition for controlling humanoid robot

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929099

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929099

Country of ref document: EP

Kind code of ref document: A1