CN117958654A - Cleaning robot and voice control method and device thereof - Google Patents

Cleaning robot and voice control method and device thereof Download PDF

Info

Publication number
CN117958654A
CN117958654A CN202311360331.1A CN202311360331A CN117958654A CN 117958654 A CN117958654 A CN 117958654A CN 202311360331 A CN202311360331 A CN 202311360331A CN 117958654 A CN117958654 A CN 117958654A
Authority
CN
China
Prior art keywords
voice
directional microphone
signal
cleaning robot
original voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311360331.1A
Other languages
Chinese (zh)
Inventor
罗杰
方义
马峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202311360331.1A priority Critical patent/CN117958654A/en
Publication of CN117958654A publication Critical patent/CN117958654A/en
Pending legal-status Critical Current

Links

Landscapes

  • Manipulator (AREA)

Abstract

The application discloses a cleaning robot and a voice control method and a voice control device thereof. The directional microphone has directivity, so that noise at the axial rear of the microphone can be restrained, namely, a main noise source at the bottom of the robot can be restrained, and the signal-to-noise ratio of the collected original voice signals is improved. The processor calculates the signal energy of the original voice signal collected by each directional microphone in the frequency band where the voice is, selects the candidate original voice signal with the largest signal energy, respectively detects whether each candidate original voice signal meets the set voice control condition, and if so, executes the action of condition matching. By means of the original voice signal with high signal-to-noise ratio, the voice wake-up rate and the command word recognition rate can be improved. In addition, the application mainly uses acoustic positioning, and can avoid the defect that the real speakers cannot be distinguished in the case of multiple people by using the camera to detect and position the human body.

Description

Cleaning robot and voice control method and device thereof
Technical Field
The application relates to the technical field of intelligent equipment control, in particular to a cleaning robot and a voice control method and device thereof.
Background
With the improvement of the living standard of people, cleaning robots are favored by more and more consumers. The robot not only can help people liberate hands, but also is very convenient to operate and use. In addition to controlling the robot using the mobile phone APP, voice control also starts to become a basic function of the cleaning robot. In order to realize the sound reception of all directions around the cleaning robot, the prior art generally arranges an omnidirectional microphone at the side of the cleaning robot so as to collect the interactive voice of surrounding users.
However, the home cleaning robot, especially the sweeping robot, has strong self-noise during operation, such as noise of an internal motor of the sweeping robot, noise of an external roller, noise of a sweeping component, and the like, and the omni-directional microphone on the cleaning robot is too close to the noise source, so that the signal-to-noise ratio of the voice signal collected by the omni-directional microphone is low. The low signal-to-noise ratio leads the voice wake-up rate and the recognition rate to be obviously reduced, thereby affecting the voice control operation of the user.
Disclosure of Invention
In view of the above problems, the present application is provided to provide a cleaning robot and a voice control method and apparatus thereof, so as to reduce the adverse effect of self noise of the cleaning robot on voice control of the robot, and improve the voice wake-up and recognition success rate. The specific scheme is as follows:
In a first aspect, there is provided a cleaning robot including:
The directional microphone array is arranged at the top of the cleaning robot body, and the pickup directions of different directional microphones in the directional microphone array are different;
The processor is used for acquiring original voice signals acquired by each directional microphone in the directional microphone array; respectively calculating the signal energy of the original voice signal acquired by each directional microphone in a first frequency band, wherein the first frequency band is a frequency band where the set voice is located; selecting a plurality of candidate original voice signals with the maximum signal energy; and respectively detecting whether each candidate original voice signal meets a set voice control condition, and if so, executing the action matched with the set voice control condition.
Preferably, the axial direction of each directional microphone in the directional microphone array forms an included angle θ with respect to the top plane of the body.
Preferably, the included angle θ is configured to be adjustable by a user.
Preferably, the cleaning robot further comprises:
the laser radar LDS panel is arranged at the top of the body;
the directional microphone array is disposed above the LDS panel.
Preferably, the directional microphone array is a circular directional microphone array.
In a second aspect, a cleaning robot voice control method is provided, applied to a voice control process of the cleaning robot, and the method includes:
acquiring original voice signals collected by each directional microphone in a directional microphone array;
Respectively calculating the signal energy of the original voice signal acquired by each directional microphone in a first frequency band, wherein the first frequency band is a frequency band where the set voice is located;
selecting a plurality of candidate original voice signals with the maximum signal energy;
and respectively detecting whether each candidate original voice signal meets a set voice control condition, and if so, executing the action matched with the set voice control condition.
Preferably, the setting the voice control condition includes: wake-up conditions and/or sound source localization conditions;
The act of matching the wake-up condition includes: waking up the cleaning robot;
the act of matching the sound source localization conditions includes: and screening target directional microphones corresponding to the candidate original voice signals meeting the sound source positioning conditions, and taking the pickup direction of the target directional microphones as the direction of the user sound source.
Preferably, the step of detecting whether each candidate original speech signal satisfies a set speech control condition, if yes, executing an action matching the set speech control condition includes:
Sending each candidate original voice signal into a configured voice recognition model respectively, and obtaining the probability that each candidate original voice signal output by the voice recognition model contains a set wake-up word and/or the probability that each candidate original voice signal contains a set sound source positioning command word;
If the probability that at least one candidate original voice signal contains a set wake-up word exceeds a first probability threshold, waking up the cleaning robot;
Screening candidate original voice signals with the probability of setting the sound source positioning command words exceeding a second probability threshold and the maximum probability, and taking the pickup direction of a target directional microphone corresponding to the screened candidate original voice signals as the sound source direction of the user.
Preferably, the training process of the speech recognition model includes:
Obtaining an awakening word audio sample, a sound source positioning word audio sample, a general word and noise audio sample, and forming a training sample set by each sample;
extracting acoustic characteristics of each sample in the training sample set, and sending the acoustic characteristics into a voice recognition model to obtain a model recognition result;
Based on the model recognition result, the wake-up word is used as a class, the sound source localization word is used as a class, the general word and the noise are used as a class, and the three-classification loss function is adopted to train the voice recognition model.
Preferably, before calculating the signal energy of the original voice signal collected by each directional microphone in the first frequency band, the method further includes:
And adopting a preconfigured noise reduction model to perform noise reduction treatment on the original voice signals acquired by each directional microphone.
Preferably, when the axial direction of each directional microphone in the directional microphone array forms an angle θ with respect to the top plane of the cleaning robot body, the method further comprises:
Receiving an instruction for adjusting the angle of a target directional microphone;
And adjusting the included angle theta of the target directional microphone according to the requirement of the instruction.
In a third aspect, there is provided a cleaning robot voice control apparatus, applied to a processor of the aforementioned cleaning robot, comprising:
The original voice signal acquisition unit is used for acquiring original voice signals acquired by each directional microphone in the directional microphone array;
the signal energy calculating unit is used for calculating the signal energy of the original voice signal acquired by each directional microphone in a first frequency band, wherein the first frequency band is a frequency band where the set voice is located;
the signal energy screening unit is used for selecting a plurality of candidate original voice signals with the maximum signal energy;
And the condition detection unit is used for respectively detecting whether each candidate original voice signal meets the set voice control condition, and if so, executing the action matched with the set voice control condition.
By means of the technical scheme, the directional microphone array is arranged at the top of the cleaning robot body, and pick-up directions of different directional microphones are different, so that interactive voices emitted by users in different directions around the cleaning robot are collected. Because the directional microphone has directivity, noise in the axial rear direction of the microphone can be restrained, the directional microphone array is arranged at the top of the robot body, so that the axial rear direction of the directional microphone is aligned to the bottom of the robot body, and the direction is the direction of a main noise source, so that self-noise of the cleaning robot is greatly restrained in the voice signals collected by the directional microphone, and the signal-to-noise ratio of the voice signals is greatly improved. And by arranging a plurality of directional microphones with different pickup directions, a plurality of different directions of a user can be considered.
Further, the processor acquires original voice signals acquired by the directional microphones, calculates the signal energy of each original voice signal in a first frequency band where the set voice is located, selects a plurality of candidate original voice signals with the largest signal energy, and respectively detects whether each candidate original voice signal meets the set voice control condition, if so, the action of condition matching is executed. Here, the set voice control conditions may be a wake-up condition, a sound source positioning condition, etc., and the voice wake-up rate and recognition rate may be greatly improved by means of the original voice signal with a high signal-to-noise ratio. In addition, the application mainly uses acoustic positioning, and can avoid the defect that the sound source is positioned by the camera in a multi-person occasion and the real speakers cannot be distinguished. And the sound source positioning accuracy is improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
Fig. 1 is a schematic perspective view of a cleaning robot according to an example of the embodiment of the present application;
FIG. 2 is a top view of a cleaning robot according to an example of an embodiment of the present application;
Fig. 3 is a rear view of a cleaning robot according to an example of embodiment of the present application;
fig. 4 is a schematic flow chart of a voice control method of a cleaning robot according to an embodiment of the present application;
Fig. 5 is a schematic structural diagram of a voice control device for a cleaning robot according to an example of the embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The application introduces a cleaning robot and a voice control scheme thereof, which can reduce the influence of self noise of the cleaning robot, improve the signal-to-noise ratio of collected voice signals, and on the basis, carry out subsequent awakening and command word recognition processes based on the voice signals with higher signal-to-noise ratio, and can improve the awakening success rate and the command word recognition accuracy.
In addition, the existing part of high-end cleaning robots also have a function of calling and cleaning, and when a user sends out an instruction, the cleaning robot identifies the user in the environment through a camera, and then moves to the position where the user is located for cleaning. However, the user position is simply positioned by the camera, so that the problem that a real speaker cannot be distinguished easily occurs in a multi-person occasion. The acoustic positioning can be used as the main basis by means of the directional microphone array, the possible azimuth of the user can be primarily determined by judging the signal energy of the original voice signals collected by different directional microphones in the human voice frequency band, and the azimuth of the user can be accurately positioned by further detecting conditions, such as the identification of sound source positioning command words for each candidate original voice signal.
Next, the structure of the cleaning robot of the present application will be described first.
As shown in fig. 1 to 3, a directional microphone array 101 is provided on top of the cleaning robot body 100. The directional microphone array 101 includes a plurality of directional microphones, such as 6 directional microphones as illustrated in fig. 1, and of course, the number of the directional microphones may also be adjusted, specifically, may be adjusted according to a pickup angle of each directional microphone, etc. The pick-up directions of different directional microphones are different for giving consideration to different directions of users. In a use scene of the cleaning robot, the voice signal received by the directional microphone array has a higher signal-to-noise ratio than that of the omnidirectional microphone in a middle-high frequency part of voice (generally, human voice is in the frequency band), and is more friendly to voice awakening and recognition.
Further, the cleaning robot further comprises a processor for acquiring raw speech signals collected by each of the directional microphones in the directional microphone array 101. And respectively calculating the signal energy of the original voice signal acquired by each directional microphone in a first frequency band, wherein the first frequency band is a frequency band where the set voice is located, and generally, the first frequency band can be medium-high frequency, for example, 1000 Hz-5000 Hz.
The processor selects a plurality of candidate original voice signals with the largest signal energy, and the pickup direction of the directional microphone corresponding to the candidate original voice signals is the direction of a possible user. And further detecting whether each candidate original voice signal meets a set voice control condition or not respectively, and if so, executing the action matched with the set voice control condition.
Wherein, the setting the voice control condition may include: wake-up conditions, sound source positioning conditions and the like, namely, a user can issue a wake-up instruction, or further issue a task instruction after waking up the cleaning robot, such as cleaning an area where the user is located, and the like, and the robot is required to accurately position the direction where the sound source of the user is located.
According to the application, the directional microphone array is arranged at the top of the cleaning robot body, wherein the pick-up directions of different directional microphones are different, so that interactive voices emitted by users in different directions around the cleaning robot are collected. Because the directional microphone has directivity, noise in the axial rear direction of the microphone can be restrained, the directional microphone array is arranged at the top of the robot body, so that the axial rear direction of the directional microphone is aligned to the bottom of the robot body, and the direction is the direction of a main noise source, so that self-noise of the cleaning robot is greatly restrained in the voice signals collected by the directional microphone, and the signal-to-noise ratio of the voice signals is greatly improved. And by arranging a plurality of directional microphones with different pickup directions, a plurality of different directions of a user can be considered.
Further, the processor acquires original voice signals acquired by the directional microphones, calculates the signal energy of each original voice signal in a first frequency band where the set voice is located, selects a plurality of candidate original voice signals with the largest signal energy, and respectively detects whether each candidate original voice signal meets the set voice control condition, if so, the action of condition matching is executed. Here, the set voice control conditions may be a wake-up condition, a sound source positioning condition, etc., and the voice wake-up rate and recognition rate may be greatly improved by means of the original voice signal with a high signal-to-noise ratio. In addition, the application mainly uses acoustic positioning, and can avoid the defect that the sound source is positioned by the camera in a multi-person occasion and the real speakers cannot be distinguished. And the sound source positioning accuracy is improved.
As further shown in fig. 1-3, the cleaning robot may further include a laser radar LDS panel 102 disposed on the top of the body 100, where the laser radar may collect surrounding information for navigation path planning and so on. The lidar LDS panel 102 is raised relative to the top of the body 100, and then the directional microphone array 101 can be disposed on the raised panel 102, so that the directional microphone array 101 is further away from the noise source at the top of the robot, and noise interference is further reduced.
In addition, the directional microphone array 101 may be configured as an annular structure, and each directional microphone may be arranged on the annular circumference at equal intervals, so that each direction of the user is considered, and no matter where the user speaks the instruction in any direction of the cleaning robot, part of the directional microphones may point to the user to pick up sound.
As further shown in fig. 3, the axis of each directional microphone in the directional microphone array 101 is at an angle θ relative to the top plane of the body 100. The direction pointing obliquely upwards along the axial direction is the pickup direction, and the pickup direction forms an included angle theta with the horizontal plane. Fig. 3 illustrates that the directional microphone array is disposed on the LDS panel, and an angle θ is formed between the axial direction of the directional microphone and the LDS panel. The general LDS panel is parallel to the top plane of the body 100.
The angle θ between the axial direction of each directional microphone and the horizontal plane in the directional microphone array may be the same or different. The included angle θ may be fixed when shipped from the factory, or may support user adjustment, for example, the user may manually perform angle adjustment on each directional microphone, or may issue an instruction for performing angle adjustment on the directional microphone through the APP control interface, or issue an instruction for performing angle adjustment on the directional microphone through a voice control manner. By setting the controllable adjustment of the included angle theta, the method can adapt to the preferences of different users, so that the direction of the directional microphone is more matched with the direction of the sound source of the user.
On the basis of the cleaning robot structure described in the above embodiments, the voice control logic of the cleaning robot is further described in this embodiment from the viewpoint of the processor of the cleaning robot. As shown in fig. 4, the cleaning robot voice control method may include the steps of:
Step S100, acquiring original voice signals acquired by each directional microphone in the directional microphone array.
Specifically, the original voice signal collected by each directional microphone can be used as one path, and then N paths of original voice signals can be obtained through the directional microphone array, wherein the number of N is equal to the number of directional microphones contained in the directional microphone array.
It should be noted that, if there is a single directional microphone in the directional microphone array that is not activated, or there is a fault, the original voice signal collected by the corresponding directional microphone will not be obtained.
Further, the processor can identify the corresponding relation between each path of original voice signal and the directional microphone, namely, the corresponding directional microphone can be determined through any path of original voice signal.
Step S110, respectively calculating signal energy of original voice signals collected by each directional microphone in a first frequency band, wherein the first frequency band is a frequency band where the set voice is located.
Specifically, the first frequency band may be generally set to a medium-high frequency, such as 1000Hz to 5000Hz. The voice signals received by the directional microphone array have higher signal-to-noise ratio than the voice signals received by the omnidirectional microphone array in the middle and high frequency part of voice, and are more friendly to voice wakeup and recognition. In this step, the signal energy of the original voice signal collected by each directional microphone in the first frequency band can be calculated respectively.
Step S120, selecting a plurality of candidate original voice signals with the largest signal energy.
Specifically, topN candidate original speech signals with the largest signal energy may be selected, or the top N% candidate original speech signals with the largest signal energy may be selected. Taking topN with the largest signal energy as an example, N can be 3 or other values.
The pick-up direction corresponding to the directional microphone corresponding to the selected candidate original speech signal can be regarded as the azimuth of the user sound source. It can be understood that if multiple users speak at the same time in the environment, the pick-up directions of the multiple directional microphones selected herein may include the sound source positions of multiple different users at the same time, where the sound source positions of the target users actually issuing the voice control command may exist, or all the sound source positions may be interference sound source positions.
Step S130, detecting whether each candidate original voice signal meets the set voice control condition, if so, executing the action matched with the set voice control condition.
Specifically, as described in the previous step, there may be a plurality of candidate original speech signals selected, in order to determine whether each candidate original speech signal includes a speech control instruction issued by the target user, in this step, condition detection is performed on each candidate original speech signal, and if it is determined that the set speech control condition is satisfied, the speech control instruction issued by the target user is considered to be included, so that an action matching with the set speech control condition may be performed.
Wherein, the set voice control conditions may include a wake-up condition, a sound source localization condition, and the like.
Taking a wake-up condition as an example, the corresponding matching actions may include: the cleaning robot is awakened to enter a command recognition stage.
Taking the sound source localization condition as an example, the corresponding matching actions may include: screening target directional microphones corresponding to candidate original voice signals meeting sound source positioning conditions, and taking the pickup direction of the target directional microphones as the direction of a user sound source.
An alternative implementation of step S130 is provided in this embodiment.
Specifically, in this embodiment, a speech recognition model may be trained in advance, so as to detect whether the candidate original speech signal includes a set wake-up word and a set sound source localization command word.
The speech recognition model may be one, that is, it is detected whether the candidate original speech signal contains the set wake-up word and the set sound source localization command word through one speech recognition model.
Or the number of the voice recognition models can be two, one of the voice recognition models is used for detecting whether the candidate original voice signals contain the set wake-up words, and the other voice recognition models is used for detecting whether the candidate voice signals contain the set sound source positioning command words.
In this embodiment, a case of a speech recognition model is taken as an example for illustration, and the training process of the speech recognition model may include:
S1, obtaining an awakening word audio sample, a sound source positioning word audio sample, a general word and noise audio sample, and forming a training sample set by the samples.
The wake-up words are preset words, such as "XX hello", "hello XX", and the like. The sound source localization word is also a preset word, such as "clean me here", "get me before" and the like.
S2, extracting acoustic characteristics of each sample in the training sample set, and sending the acoustic characteristics into a voice recognition model to obtain a model recognition result.
Specifically, in this embodiment, an end-to-end speech recognition scheme may be adopted, the acoustic features extracted for each sample may be filter-bank features or other acoustic features, and the extracted acoustic features are sent to a speech recognition model to obtain the probability that the segment of speech output by the model contains a wake-up word and the probability that the segment contains a sound source localization word.
S3, training the voice recognition model by adopting a three-classification loss function based on the model recognition result, wherein wake-up words are used as a class, sound source positioning words are used as a class, general words and noise are used as a class.
Specifically, the training process can adopt a cross entropy loss function of three categories, namely wake-up words, sound source positioning command words are of a second category, other general words and noise are of a third category, and model overfitting is prevented.
Through the training, the voice recognition model can recognize the probability of the input voice signal belonging to three types of labels.
On this basis, if the above step S130 is a scenario of detecting the wake-up condition, the specific steps may include:
And respectively sending each candidate original voice signal into the voice recognition model to obtain the probability that each candidate original voice signal output by the voice recognition model contains the set wake-up word.
If the probability that the at least one candidate original voice signal contains the set wake-up word exceeds the first probability threshold, waking up the cleaning robot.
Specifically, if the probability that the candidate original voice signal contains the set wake-up word exceeds the first probability threshold, the fact that the target user gives a wake-up instruction currently indicates that the target user gives a wake-up instruction is indicated, and therefore the cleaning robot can be awakened.
Further, if the cleaning robot is currently known to be in the wake-up state, the step S130 is a scenario of detecting the sound source positioning condition, and the specific steps may include:
and respectively sending each candidate original voice signal into the voice recognition model to obtain the probability that each candidate original voice signal output by the voice recognition model contains the set sound source positioning command word.
Screening candidate original voice signals with the probability of setting the sound source positioning command words exceeding a second probability threshold and the maximum probability, and taking the pickup direction of a target directional microphone corresponding to the screened candidate original voice signals as the sound source direction of the user.
Assuming that 6 directional microphones are provided, the direction (pick-up direction) of each directional microphone is 30 °,90 °,150 °,210 °,270 °,330 °, respectively. According to the step, the probability that the candidate original voice signal collected by the No. 2 directional microphone contains the sound source positioning command word exceeds the second probability threshold and is maximum, so that the pickup direction of the No. 2 directional microphone can be used as the user sound source direction, and the cleaning robot can be further controlled to rotate for 90 degrees and then aim at the user sound source direction.
In still another scenario, if the step S130 detects the wake-up condition and the sound source localization condition at the same time, the specific steps may include:
and respectively sending each candidate original voice signal into a voice recognition model to obtain the probability that each candidate original voice signal output by the voice recognition model contains a set wake-up word and the probability that each candidate original voice signal contains a set sound source positioning command word.
If the probability that at least one candidate original voice signal contains a set wake-up word exceeds a first probability threshold, waking up the cleaning robot.
Screening candidate original voice signals with the probability of setting the sound source positioning command words exceeding a second probability threshold and the maximum probability, and taking the pickup direction of a target directional microphone corresponding to the screened candidate original voice signals as the sound source direction of the user.
In the method provided by the embodiment, through the pre-training voice recognition model, recognition of the wake-up word and the sound source positioning command word can be performed on the candidate original voice signals, so that whether the wake-up condition is met or not and the sound source position of the target user giving the voice control instruction can be accurately judged.
It can be understood that, on the basis of the sound source localization provided by the embodiment of the present application, a scheme of performing sound source localization based on the image acquired by the camera may be further fused, and the accuracy of sound source localization may be further improved by fusing two sound source localization algorithms, and specifically, the fusion strategies may be multiple, which is not described in the embodiment.
In some embodiments of the present application, in order to further improve the signal-to-noise ratio of the original voice signal collected by the directional microphone, before calculating the signal energy of the original voice signal in the first frequency band in the foregoing steps, a process of performing noise reduction processing on the original voice signal may be increased, so that the signal energy of the noise reduced voice signal in the first frequency band may be calculated, and the influence of noise on the accuracy of the calculation result may be reduced.
Next, a procedure of performing noise reduction processing on an original speech signal in this embodiment will be described.
A pre-configured noise reduction model may be used to perform noise reduction processing on the original speech signal collected by each directional microphone.
The noise reduction model is obtained by taking a noise-carrying voice training signal containing self noise and clean word voice signals (including clean wake-up words and/or clean sound source positioning commands) of the cleaning robot as a training sample and taking the proportion of the clean word voice signals in the noise-carrying voice training signal as a sample label.
The target proportion occupied by the useful clean word voice signal in the original voice signal output by the noise reduction model is obtained by inputting the original voice signal into the noise reduction model. And determining the noise-reduced voice signal from the original voice signal based on the target proportion.
The noise-reduced voice signal has filtered the self-noise of the cleaning robot, so that the subsequent awakening and recognition processes are carried out based on the noise-reduced voice signal, and the awakening success rate and the command word recognition accuracy rate can be improved.
Furthermore, because only clean word voice signals except the self-noise of the equipment exist in the training sample during the training of the noise reduction model, namely the noise reduction model can be understood as a deeply customized noise reduction model, the noise reduction model only reduces the set clean word (which can comprise a wake-up word and a sound source positioning command word) voice, the noise reduction effect is more excellent, and the success rate of equipment wake-up and the command word recognition accuracy can be further improved based on the noise reduction effect.
Next, a training process of the noise reduction model will be described.
S1, firstly, a plurality of noisy speech training signals are obtained.
Each noisy speech training signal comprises a recorded self-noise signal of the cleaning robot in a working state, and a speech signal after convolution of impulse response and clean word speech generated by a simulated sound source and microphone equipment on the cleaning robot in different distances and environments.
Specifically, the application can simulate and generate the impact response I generated by the sound source and the microphone equipment on the cleaning robot under different distances and different environments, and convolve the impact response I with the preset clean word voice s to obtain a convolved voice signal. Further, the convolved voice signal is added with the recorded self-noise signal n of the cleaning robot in the working state, and the result is used as a voice training signal y with noise, and the following formula is adopted:
y=s*I+a*n
wherein a is a set parameter, and the proportion of the noise signal to the clean word speech signal in the finally generated noisy speech training signal y can be adjusted by controlling the size of a, so that a plurality of noisy speech training signals y with different signal to noise ratios are generated.
S2, for each noisy speech training signal y obtained above:
And determining the proportion mask occupied by the clean word voice s contained in the mask in the noisy voice training signal y, and generating training data by using the noisy voice training signal y and the corresponding proportion mask.
S3, training a noise reduction model by adopting each piece of training data.
In some embodiments of the present application, an alternative implementation manner of the step S2 is described, which is specifically as follows:
for each noisy speech training signal y:
s21, respectively carrying out short-time Fourier transform on the noisy speech training signal Y and the clean word speech S, and then obtaining an amplitude spectrum Y corresponding to the noisy speech training signal Y and an amplitude spectrum S corresponding to the clean word speech S.
S22, the proportion mask of the amplitude spectrum S and the amplitude spectrum Y is used as the proportion of the clean wake-up word voice S in the noisy voice training signal Y, and the amplitude spectrum Y corresponding to the noisy voice training signal Y and the corresponding proportion mask form training data.
mask=S/Y
In this embodiment, the ratio of the amplitude spectrum of the clean word speech and the amplitude spectrum of the noisy speech training signal in the frequency domain is used as the mask, and then the amplitude spectrum Y of the noisy speech training signal and the mask form a piece of training data, so that when the noise reduction model is trained according to the training data, the noise reduction model can learn the mapping relation between the amplitude spectrum of the input signal and the mask.
Besides the above manner, when calculating the ratio mask occupied by the clean word speech s in the noisy speech training signal y, the ratio between the energy spectrum or mel spectrum of each of the clean word speech s and the noisy speech training signal y may be used as the mask.
Based on the noise reduction model obtained by training in the above embodiment, the process of processing the original speech signal by using the pre-trained noise reduction model in the above step to obtain a target proportion output by the noise reduction model, and determining the noise-reduced speech signal from the original speech signal based on the target proportion may be described, where the process may include:
and performing short-time Fourier transform on the original voice signal, and then taking a magnitude spectrum to obtain a target magnitude spectrum corresponding to the original voice signal.
And inputting the target magnitude spectrum into a noise reduction model to obtain the target proportion output by the noise reduction model.
Multiplying the target proportion with a target amplitude spectrum of the original voice signal, and performing short-time inverse Fourier transform to obtain a voice signal after noise reduction.
The structure of the cleaning robot described with reference to the foregoing embodiment, wherein the orientation of each directional microphone in the directional microphone array forms an angle θ with the horizontal plane, and the magnitude of the angle θ can be adjusted. In this embodiment, a method for adjusting the included angle θ is provided, which may specifically include:
and receiving an instruction for adjusting the angle of the target directional microphone, which is issued by a user. The instruction can be in a voice form or can be issued by a user operating on an APP control interface. The instructions may include a target directional microphone identification of the angle adjustment to be made, and a specific adjustment angle value. On the basis, the processor can adjust the included angle theta of the target directional microphone according to the requirement of the instruction.
In this embodiment, the user is supported to perform uniform angle adjustment on all directional microphones in the directional microphone array, and also can be supported to perform angle adjustment on a single directional microphone array, and the direction of the directional microphones can be more matched with the direction of the sound source of the user by adjusting the included angle of the directional microphones.
The cleaning robot voice control device provided by the embodiment of the application is described below, the cleaning robot voice control device described below and the cleaning robot voice control method described above can be correspondingly referred to each other, and the cleaning robot voice control device can be applied to a processor of a cleaning robot.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice control device for a cleaning robot according to an embodiment of the present application.
As shown in fig. 5, the apparatus may include:
an original voice signal obtaining unit 11, configured to obtain an original voice signal collected by each directional microphone in the directional microphone array;
a signal energy calculating unit 12, configured to calculate signal energy of an original voice signal acquired by each directional microphone in a first frequency band, where the first frequency band is a frequency band where a set voice is located;
The signal energy screening unit 13 is used for selecting a plurality of candidate original voice signals with the largest signal energy;
and a condition detection unit 14, configured to detect whether each candidate original speech signal meets a set speech control condition, and if yes, execute an action matching the set speech control condition.
Optionally, the processing procedure of the condition detection unit may specifically include:
Sending each candidate original voice signal into a configured voice recognition model respectively, and obtaining the probability that each candidate original voice signal output by the voice recognition model contains a set wake-up word and/or the probability that each candidate original voice signal contains a set sound source positioning command word;
If the probability that at least one candidate original voice signal contains a set wake-up word exceeds a first probability threshold, waking up the cleaning robot;
Screening candidate original voice signals with the probability of setting the sound source positioning command words exceeding a second probability threshold and the maximum probability, and taking the pickup direction of a target directional microphone corresponding to the screened candidate original voice signals as the sound source direction of the user.
Optionally, the apparatus of the present application may further include: the training unit of model, is used for training the above-mentioned speech recognition model, this training process can include:
Obtaining an awakening word audio sample, a sound source positioning word audio sample, a general word and noise audio sample, and forming a training sample set by each sample;
extracting acoustic characteristics of each sample in the training sample set, and sending the acoustic characteristics into a voice recognition model to obtain a model recognition result;
Based on the model recognition result, the wake-up word is used as a class, the sound source localization word is used as a class, the general word and the noise are used as a class, and the three-classification loss function is adopted to train the voice recognition model.
Optionally, the apparatus of the present application may further include:
And the noise reduction unit is used for carrying out noise reduction processing on the original voice signals acquired by each directional microphone by adopting a pre-configured noise reduction model before the signal energy calculation unit is used for processing.
Optionally, the apparatus of the present application may further include:
the angle adjusting unit is used for receiving an instruction for adjusting the angle of the target directional microphone; and adjusting the included angle theta of the target directional microphone according to the requirement of the instruction.
The embodiment of the application also provides a storage medium, which can store a program suitable for being executed by a processor, wherein the program is used for realizing the steps of the voice control method of the cleaning robot.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A cleaning robot, comprising:
The directional microphone array is arranged at the top of the cleaning robot body, and the pickup directions of different directional microphones in the directional microphone array are different;
The processor is used for acquiring original voice signals acquired by each directional microphone in the directional microphone array; respectively calculating the signal energy of the original voice signal acquired by each directional microphone in a first frequency band, wherein the first frequency band is a frequency band where the set voice is located; selecting a plurality of candidate original voice signals with the maximum signal energy; and respectively detecting whether each candidate original voice signal meets a set voice control condition, and if so, executing the action matched with the set voice control condition.
2. The cleaning robot of claim 1, wherein an axial direction of each of the directional microphones in the directional microphone array is at an angle θ with respect to the body top plane.
3. The cleaning robot of claim 2, wherein the included angle θ is configured to be adjustable by a user.
4. The cleaning robot of claim 1, further comprising:
the laser radar LDS panel is arranged at the top of the body;
the directional microphone array is disposed above the LDS panel.
5. The cleaning robot of claim 1, wherein the directional microphone array is a circular directional microphone array.
6. A cleaning robot voice control method, characterized by being applied to a voice control process of the cleaning robot according to any one of claims 1 to 5, the method comprising:
acquiring original voice signals collected by each directional microphone in a directional microphone array;
Respectively calculating the signal energy of the original voice signal acquired by each directional microphone in a first frequency band, wherein the first frequency band is a frequency band where the set voice is located;
selecting a plurality of candidate original voice signals with the maximum signal energy;
and respectively detecting whether each candidate original voice signal meets a set voice control condition, and if so, executing the action matched with the set voice control condition.
7. The method of claim 6, wherein the setting the voice control conditions comprises: wake-up conditions and/or sound source localization conditions;
The act of matching the wake-up condition includes: waking up the cleaning robot;
the act of matching the sound source localization conditions includes: and screening target directional microphones corresponding to the candidate original voice signals meeting the sound source positioning conditions, and taking the pickup direction of the target directional microphones as the direction of the user sound source.
8. The method of claim 7, wherein detecting whether each of the candidate original speech signals satisfies a set speech control condition, respectively, and if so, performing an action matching the set speech control condition, comprises:
Sending each candidate original voice signal into a configured voice recognition model respectively, and obtaining the probability that each candidate original voice signal output by the voice recognition model contains a set wake-up word and/or the probability that each candidate original voice signal contains a set sound source positioning command word;
If the probability that at least one candidate original voice signal contains a set wake-up word exceeds a first probability threshold, waking up the cleaning robot;
Screening candidate original voice signals with the probability of setting the sound source positioning command words exceeding a second probability threshold and the maximum probability, and taking the pickup direction of a target directional microphone corresponding to the screened candidate original voice signals as the sound source direction of the user.
9. The method of claim 8, wherein the training process of the speech recognition model comprises:
Obtaining an awakening word audio sample, a sound source positioning word audio sample, a general word and noise audio sample, and forming a training sample set by each sample;
extracting acoustic characteristics of each sample in the training sample set, and sending the acoustic characteristics into a voice recognition model to obtain a model recognition result;
Based on the model recognition result, the wake-up word is used as a class, the sound source localization word is used as a class, the general word and the noise are used as a class, and the three-classification loss function is adopted to train the voice recognition model.
10. The method of claim 6, further comprising, prior to calculating the signal energy of each of the directional microphones for the original speech signal at the first frequency band:
And adopting a preconfigured noise reduction model to perform noise reduction treatment on the original voice signals acquired by each directional microphone.
11. The method of claim 6, wherein when the axis of each of the directional microphones in the directional microphone array is at an angle θ with respect to the cleaning robot body top plane, the method further comprises:
Receiving an instruction for adjusting the angle of a target directional microphone;
And adjusting the included angle theta of the target directional microphone according to the requirement of the instruction.
12. A cleaning robot voice control apparatus, characterized by being applied to the processor of the cleaning robot according to any one of claims 1 to 5, comprising:
The original voice signal acquisition unit is used for acquiring original voice signals acquired by each directional microphone in the directional microphone array;
the signal energy calculating unit is used for calculating the signal energy of the original voice signal acquired by each directional microphone in a first frequency band, wherein the first frequency band is a frequency band where the set voice is located;
the signal energy screening unit is used for selecting a plurality of candidate original voice signals with the maximum signal energy;
And the condition detection unit is used for respectively detecting whether each candidate original voice signal meets the set voice control condition, and if so, executing the action matched with the set voice control condition.
CN202311360331.1A 2023-10-19 2023-10-19 Cleaning robot and voice control method and device thereof Pending CN117958654A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311360331.1A CN117958654A (en) 2023-10-19 2023-10-19 Cleaning robot and voice control method and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311360331.1A CN117958654A (en) 2023-10-19 2023-10-19 Cleaning robot and voice control method and device thereof

Publications (1)

Publication Number Publication Date
CN117958654A true CN117958654A (en) 2024-05-03

Family

ID=90854201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311360331.1A Pending CN117958654A (en) 2023-10-19 2023-10-19 Cleaning robot and voice control method and device thereof

Country Status (1)

Country Link
CN (1) CN117958654A (en)

Similar Documents

Publication Publication Date Title
CN112020864B (en) Smart beam control in microphone arrays
EP3574500B1 (en) Audio device filter modification
US11158333B2 (en) Multi-stream target-speech detection and channel fusion
US9100734B2 (en) Systems, methods, apparatus, and computer-readable media for far-field multi-source tracking and separation
US9197974B1 (en) Directional audio capture adaptation based on alternative sensory input
CN117544890A (en) Hearing device and operation method thereof
CN111445920B (en) Multi-sound source voice signal real-time separation method, device and pickup
US20190208317A1 (en) Direction of arrival estimation for multiple audio content streams
JP6450139B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
CN108235181B (en) Method for noise reduction in an audio processing apparatus
JP6612310B2 (en) Hearing aid operation
CN107124647A (en) A kind of panoramic video automatically generates the method and device of subtitle file when recording
KR20220044204A (en) Acoustic Echo Cancellation Control for Distributed Audio Devices
EP3900399B1 (en) Source separation in hearing devices and related methods
WO2022253003A1 (en) Speech enhancement method and related device
CN110351629B (en) Radio reception method, radio reception device and terminal
CN117958654A (en) Cleaning robot and voice control method and device thereof
US12003673B2 (en) Acoustic echo cancellation control for distributed audio devices
US11917386B2 (en) Estimating user location in a system including smart audio devices
US11646046B2 (en) Psychoacoustic enhancement based on audio source directivity
CN113314121B (en) Soundless voice recognition method, soundless voice recognition device, soundless voice recognition medium, soundless voice recognition earphone and electronic equipment
CN113782046A (en) Microphone array pickup method and system for remote speech recognition
Ince et al. Ego noise cancellation of a robot using missing feature masks
WO2023041148A1 (en) Directional audio transmission to broadcast devices
CN115996349A (en) Hearing device comprising a feedback control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination