CN112382277A

CN112382277A - Smart device wake-up method, smart device and computer-readable storage medium

Info

Publication number: CN112382277A
Application number: CN202110019710.9A
Authority: CN
Inventors: 傅涛; 杨杰; 王力; 冯凌
Original assignee: Bozhi Safety Technology Co ltd
Current assignee: Bozhi Safety Technology Co ltd
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-02-19

Abstract

The invention discloses an intelligent device awakening method, an intelligent device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a voice instruction signal; determining a first phoneme value corresponding to the voice command signal; acquiring a mouth image sequence of a person who sends a voice instruction signal; determining a second phoneme value corresponding to the mouth image sequence; and respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one similarity is greater than a set similarity threshold value. The method combines the image recognition technology and the voice technology, reduces the misjudgment rate of the intelligent equipment, improves the non-inductive interaction experience of the user, and makes voice interaction more smooth and natural.

Description

Smart device wake-up method, smart device and computer-readable storage medium

Technical Field

The present invention relates to a method for waking up an intelligent device, an intelligent device using the method, and a computer-readable storage medium storing the method, and more particularly, to a method for waking up a voice by combining voice and image recognition, which belongs to the technical field of image recognition and voice recognition.

Background

The voice recognition technology has made remarkable progress in recent years, and has entered various fields such as industry, home appliances, smart home, and the like. The voice wake-up technology containing the wake-up word is a form of voice recognition technology, which does not directly contact with a hardware device, and the wake-up or operation of the device can be realized through the voice containing the wake-up word. The existing playing interruption function of the intelligent voice device with the loudspeaker, such as an intelligent sound box, a vehicle-mounted mobile phone frame or a voice robot, is also realized by adopting a voice awakening technology containing awakening words, and the awakening words in the existing voice awakening technology applied to the intelligent voice device are all in a fixed threshold mode, namely, a balance value is taken between the positive awakening rate and the false awakening rate of the intelligent voice device as a fixed awakening word threshold value. In the working process of the intelligent voice device, for example, when music or voice broadcast is played, because the sound emitted by the speaker of the intelligent voice device is transmitted to the microphone of the intelligent voice device and is collected by the microphone, the sound emitted by the speaker may interfere with the voice recognition of the intelligent voice device. For such a situation, the smart voice device usually performs echo cancellation processing on the sound emitted by the speaker, but if the echo cancellation is not complete or the nonlinear distortion from the speaker to the microphone is too large, the situation may result in an excessive echo residue, and when the smart voice device is in an environment with an excessive echo residue for a long time, since the threshold of the wakeup word applied in the smart voice device is always fixed, the possibility that the smart voice device is mistakenly woken by the echo residue may be greatly increased. If the microphone of the intelligent voice device does not receive the voice containing the awakening words sent by the user, but the current playing state of the intelligent voice device is interrupted because of the residual echo, the use experience of the user is greatly reduced.

Disclosure of Invention

The present application aims to provide an intelligent device wake-up method, an intelligent device, and a computer-readable storage medium, so as to solve the technical problems that the wake-up method in the prior art is easily interfered and has misjudgment.

A first embodiment of the present invention provides a method for waking up an intelligent device, including:

acquiring a voice instruction signal, and determining a first phoneme value corresponding to the voice instruction signal;

acquiring a mouth image sequence of a person who sends the voice instruction signal, and determining a second phoneme value corresponding to the mouth image sequence;

and respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one of the similarity is greater than a set similarity threshold value.

Preferably, the acquiring the voice instruction signal specifically includes:

obtaining noise data of current environment without voice command signal

；

Acquiring the voice data of the current environment containing the voice instruction signal

；

Based on the noise data

And the voice data

The voice instruction signal is determined.

Preferably, based on said noise data

And the voice data

Determining the voice instruction signal, specifically:

processing the noise data using a short-time Fourier transform

Obtaining a processed noise frequency domain signal

；

Processing the vocal data using short-time Fourier transform

Obtaining the processed human voice frequency domain signal

；

According to the noise frequency domain signal

And the human voice frequency domain signal

Determining a frequency domain signal corresponding to the voice instruction signal

；

Correspondingly, determining a first phoneme value corresponding to the voice instruction signal, specifically:

The corresponding first phoneme value.

Preferably, the frequency domain signal is based on the noise

And the human voice frequency domain signal

The method specifically comprises the following steps:

determining a frequency domain signal corresponding to the voice instruction signal by using a first formula

The first formula is:

in the formula (I), the compound is shown in the specification,

is the index of the frame or frames,

is a frequency index, and

is the number of points of the short-time fourier transform,

in order to be a filter coefficient, the filter coefficient,

wherein

In order to adjust the factor in terms of the step size,

which represents the conjugate of the two or more different molecules,

is that

The ORD is the number of frames of the cache value.

Preferably, the determining a second phoneme value corresponding to the mouth image sequence specifically includes:

extracting mouth features from the mouth image sequence;

determining a recognition probability result of the factor unit by using the mouth features;

inputting the recognition probability result of the phoneme unit into a connected time sequence classifier to obtain a classification result of the phoneme unit;

and decoding the classification result of the phoneme unit by adopting a decoding method introducing an attention mechanism to obtain a second phoneme value corresponding to the mouth image sequence.

Preferably, the mouth feature is extracted from the mouth image sequence, specifically:

extracting the spatial feature of the mouth movement from the mouth image sequence by using a 2D convolutional neural network to obtain the spatial feature information of the mouth movement;

extracting the time characteristics of mouth movement from the mouth image sequence by using a 1D convolutional neural network to obtain time domain characteristic information of the mouth movement;

fusing the time domain characteristic information and the space characteristic information by utilizing a multi-space-time information fusion residual error network to obtain the fused mouth characteristic;

correspondingly, by using the mouth features, determining the recognition probability result of the factor unit, specifically:

and determining the recognition probability result of the factor unit by using the mouth features after fusion.

Preferably, the determining, by using the mouth feature, the recognition probability result of the factor unit specifically includes:

and inputting the mouth features into a Bi-GRU model to obtain a recognition probability result of the phoneme unit.

Preferably, the decoding method with the attention-introducing mechanism is used to decode the classification result of the phoneme unit to obtain a second phoneme value corresponding to the mouth image sequence, and specifically:

obtaining the hidden state of each moment of the phoneme unit in the classification result of the phoneme unit through attention;

obtaining a score for each of the hidden states;

acquiring a score of attention;

calculating a weighted sum of the score of the hidden state and the score of the attention to obtain a context vector;

and inputting the context vector into the decoder for joint training to obtain a second phoneme value corresponding to the mouth image sequence.

A second embodiment of the present invention provides an intelligent device, including:

the first phoneme determining module is used for acquiring a voice instruction signal and determining a first phoneme value corresponding to the voice instruction signal;

the second phoneme determining module is used for acquiring a mouth image sequence of a person who sends the voice instruction signal and determining a second phoneme value corresponding to the mouth image sequence;

and the awakening module is used for respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one of the similarity is greater than a set similarity threshold value.

A third embodiment of the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-mentioned method.

Compared with the prior art, the intelligent device awakening method, the intelligent device and the computer readable storage medium have the following beneficial effects:

the method combines the image recognition technology and the voice technology, reduces the misjudgment rate of the intelligent equipment, improves the non-inductive interaction experience of the user, and makes voice interaction more smooth and natural.

Drawings

FIG. 1 is a flowchart of a wake-up method for an intelligent device according to the present invention;

fig. 2 is a detailed flowchart of step 1 in the wake-up method of the smart device in the embodiment of the present invention;

fig. 3 is a detailed flowchart of step 2 in the wake-up method of the smart device in the embodiment of the present invention;

FIG. 4 is a schematic diagram of an improved MST (multi-spatiotemporal information fusion) unit after 2D convolution kernel and 1D convolution fusion in an intelligent device wake-up method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a lip spatiotemporal feature extraction network in the wake-up method of the smart device according to the embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following examples are given.

The invention provides an intelligent device awakening method fusing voice and image recognition, an intelligent device and a computer readable storage medium. The voice acquisition device acquires the voice signal (namely the voice command signal) of the current environment person in real time, the algorithm is used for analyzing and judging whether the voice signal of the current person is intended to awaken the intelligent device or not, the image sensor is used for monitoring the mouth action signal of the current environment person in real time, and the algorithm is used for analyzing and judging whether the mouth action characteristic of the current person is intended to awaken the intelligent device or not. If it is determined that one of the two exists, the mobile terminal wakes up. The method and the device can effectively reduce the possibility of mistaken awakening of the intelligent device in a specific environment, and improve the non-inductive interaction experience of the user.

Fig. 1 is a flowchart of a method for waking up an intelligent device according to the present invention.

The method for waking up the intelligent device in the first embodiment of the invention comprises the following steps:

step 1, acquiring a voice instruction signal, and determining a first phoneme value corresponding to the voice instruction signal, specifically:

collecting noise data without voice command signal in current environment by using voice collector

(ii) a Using a voice collector to collect voice data containing voice instruction signals in the current environment

(ii) a The voice collector is a microphone;

processing noise data using short-time Fourier transform

Obtaining a processed noise frequency domain signal

(ii) a Processing human voice data using short-time fourier transform

Obtaining the processed human voice frequency domain signal

；

Then, the preset echo cancellation algorithm corresponding to NLMS algorithm is used to carry out echo cancellation on the human audio frequency domain signal

Echo cancellation is carried out, the echo referred to in the application is the noise data, and a frequency domain signal corresponding to the voice command signal after the echo cancellation is obtained

In particular, toDetermining a frequency domain signal corresponding to a voice command signal using a first formula

The first formula is:

in the formula (I), the compound is shown in the specification,

is the index of the frame or frames,

is a frequency index, and

is the number of points of the short-time fourier transform,

is the filter coefficient;

wherein

In order to adjust the factor in terms of the step size,

which represents the conjugate of the two or more different molecules,

is that

The ORD is the frame number of the cache value;

。

and finally, determining a first phoneme value corresponding to the frequency domain signal corresponding to the voice command signal.

Step 2, obtaining a mouth image sequence of a person who sends a voice instruction signal, and determining a second phoneme value corresponding to the mouth image sequence, wherein the steps are as follows:

the method comprises the following steps of monitoring mouth action signals of people in the current environment in real time through an image sensor, specifically: detecting the image information of the human face in the visible area in real time through a binocular camera, and then detecting and cutting out a mouth image sequence from the image information of the face through a face detector;

then, mouth feature extraction is carried out on the mouth image sequence by utilizing a hybrid convolution neural network, and the method specifically comprises the following steps: the hybrid convolutional neural network of the present application consists of an improved 3D convolutional neural network and an MST (multiple spatiotemporal information fusion) residual network; the improved 3D convolutional neural network is a block that decomposes the 3D convolutional operation into two consecutive sub-convolutional blocks, a 2D convolutional neural network and a 1D convolutional neural network, respectively. The 2D convolutional neural network performs spatial feature extraction of mouth motion on the mouth image sequence to obtain spatial feature information of the mouth; the 1D convolutional neural network performs time dimension feature extraction on mouth movement on the mouth image sequence to obtain time domain feature information of the mouth movement; MST (multi-space-time information fusion) residual error network carries out multi-scale information fusion on the spatial features and the time features of the mouth part to obtain fused mouth part features;

then, inputting the fused mouth features into a Bi-GRU model to obtain a recognition probability result of the factor unit; then inputting the recognition probability result of the phoneme unit into a connection time sequence classifier CTC to obtain the classification result of the phoneme unit;

finally, decoding the classification result of the phoneme unit by adopting a decoding method introducing an attention mechanism to obtain a second phoneme value corresponding to the mouth image sequence, wherein the method specifically comprises the following steps: obtaining a hidden state of each phoneme unit at each moment by attention, obtaining a score state of the attention by scoring each hidden state, aggregating the hidden states of the phoneme units by using a weighted sum of the hidden states of the phoneme units and the attention scores to obtain a context vector, and inputting the context vector into a decoder for joint training to obtain a second phoneme value corresponding to the mouth image sequence.

And 3, respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one similarity is greater than a set similarity threshold value.

A second embodiment of the present invention discloses an intelligent device, including:

the second phoneme determining module is used for acquiring a mouth image sequence of a person sending the voice command signal and determining a second phoneme value corresponding to the mouth image sequence;

and the awakening module is used for respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one similarity is greater than a set similarity threshold value.

A third embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the above-mentioned smart device wake-up method.

Hereinafter, the present application will be described in detail with specific examples.

Fig. 2 is a detailed flowchart of step 1, namely a detailed flowchart of acquiring a voice command signal and determining a first phoneme value corresponding to the voice command signal.

FIG. 3 is a detailed flowchart of step 2; namely, a detailed flow chart for acquiring a mouth image sequence of a person who sends a voice command signal and determining a second phoneme value corresponding to the mouth image sequence.

In this embodiment, the application devices are a microphone and a camera, wherein the camera is a 480pusb camera, and the USB camera and the microphone are fixed in front of the speaker and are 45cm away from the speaker. The awakening method comprises the following specific steps:

firstly, noise data without human voice (namely voice command signal) of intelligent voice equipment is obtained

And voice data including voice

;

And then noise data is processed by short-time Fourier transform

Processing to obtain processed noise frequency domain signal

(ii) a And to human acoustic data by short-time Fourier transform

Processing to obtain processed human voice frequency domain signal

；

To human audio frequency domain signal by preset echo cancellation algorithm

Echo cancellation is carried out to obtain a human voice channel frequency domain signal after echo cancellation

；

Eliminating echo to eliminate human voice channel frequency domain signal

And converting the value into a corresponding phoneme value C1, and judging whether the similarity with the awakening parameter exists.

Acquiring each frame of image in a real-time video collected by a camera from the beginning of receiving a voice signal, and detecting and cutting out a mouth image sequence from face image information by using a face detector;

in the embodiment, a dlib library face 68 characteristic point extractor is adopted to extract lip regions of speakers in lip reading data sets, and a dlib library face detection model can be used to quickly capture large-amplitude shaking of faces, so that the sensitivity is high;

inputting the collected images into a network, finally outputting images surrounding 68 key points of the human face, and extracting coordinates of key points of lips of 46-68 to obtain coordinates of a central point of a rectangular region of the lips

And the width of the rectangle

Rectangular height

；

Performing feature extraction on the lip image sequence by using a hybrid convolutional neural network; adopting a mixed convolutional neural network ((2+1) D + MST) to extract features of different spatial amplitudes and different time periods of the lip sequence; wherein the (2+1) D convolution block is a sub-convolution block that decomposes the 3D convolution operation into two successive ones, respectively a 2D convolution neural network and a 1D convolution neural network; the 2D convolutional neural network performs spatial feature extraction of lip movement on the lip image sequence to obtain spatial feature information of lips; the method comprises the steps that 1D convolutional neural network carries out time dimension feature extraction on lip movement on a lip image sequence to obtain time domain feature information of the lip movement; and performing multi-scale information fusion on the spatial characteristics and the temporal characteristics of the lips by using an MST (multi-spatiotemporal information fusion) residual error network.

In this embodiment, for the defect that each layer of the (2+1) D convolutional neural network has a spatial scale and a time depth with a single size, and each element in the feature map corresponds to single feature information, so that the model generalization capability is poor, 2D convolutional kernels and 1D convolutional kernels with different scales are used in space and time, respectively, and important spatio-temporal information which cannot be captured in a single space-time can be better processed. Fig. 4 is a schematic diagram of an improved MST (multiple spatiotemporal information fusion) unit after 2D convolution kernel ID convolution fusion. The improved MST unit includes n 2D convolution kernels, m 1D convolution kernels, 2 EN layers, and 2 non-linear layers. In the process of feature extraction, firstly, 2D convolution kernels of different scales are used for extracting multi-scale spatial feature information on a single-frame picture at the same time, then the multi-scale spatial feature information is combined into a short video according to a video time sequence, the short video is input into a multi-scale 1D convolution layer, time domain feature information of a long time period, a medium time period and a short time period is extracted at the same time, and finally a new feature map is formed through a fusion layer.

Fig. 5 is a schematic structural diagram of a lip spatiotemporal feature extraction network. The hybrid convolutional neural network specifically comprises 1 input layer, 6 improved MST residual error units, a global pooling layer, 1 full-connection layer, 1 softmax classification layer, 3 time domain down-sampling layers and 4 spatial down-sampling layers. The 3 time domain down-sampling layers are respectively arranged in the 4 th, 5 th and 6 th MST residual units, and the 4 spatial down-sampling layers are respectively arranged in the 1 st, 4 th, 5 th and 6 th MST residual units.

And inputting the lip characteristics into a Bi-directional gating circulation unit Bi-GRU model to obtain a recognition probability result of the phoneme unit. The Bi-GRU network is specifically a forward GRU and a reverse GRU, and is a gate recursion unit GRU, each layer of GRU network is provided with 256 filters, and the output of each time step of the GRU is processed by a full connection layer Softmax to obtain the recognition probability result of the phoneme unit.

Inputting the recognition probability result of the phoneme unit into a connection time sequence classifier CTC to obtain a phoneme unit classification result;

and processing the classification result of the phoneme unit by adopting a decoding method introducing an attention mechanism to obtain a mouth action recognition result.

In order to further increase the accuracy of long sentence mouth motion recognition, an attention mechanism is introduced at the output end of the algorithm framework, namely a decoding method of the attention mechanism is introduced; the method can enable the model decoder to pay attention to the coded content at a specific position, and does not need to take the whole coded content as the basis of decoding, thereby improving the decoding effect of the model and increasing the robustness of the system.

The decoder is a gated cyclic unit (GRU) with 3 layers cascaded, the conventional decoding processing is to directly input the phoneme unit classification result into the decoder for training to obtain a mouth action recognition result, the decoding processing introducing the attention mechanism is to obtain the hidden state of the phoneme unit at each moment through attention, score each hidden state by using an additive function, and obtain the score state of the attention through a softmax layer. And aggregating the hidden states of the phoneme units by using the weighted sum of the hidden states of the phoneme units and the attention scores to obtain a context vector, and inputting the context vector into a decoder for joint training to obtain a mouth action recognition result. Different phoneme unit recognition results can be used at each moment of the decoder by applying an attention mechanism in the decoding process, so that the decoding process can selectively focus on useful parts in the phoneme recognition results, the decoding effect is improved, and the long sentence recognition effect is better. If the attention mechanism is not introduced, the phoneme unit recognition result is converted into the corresponding Chinese characters word by word according to the sequence of the phoneme unit recognition result, but if the sentence is long, the previous conversion result can be forgotten in the conversion process, so that semantic errors and the recognition accuracy rate are reduced.

In the embodiment, the Chinese mouth motion recognition device based on the hybrid convolutional neural network is further disclosed, and comprises an image acquisition unit, a lip detection unit, a lip feature extraction unit and a mouth motion recognition unit; the image acquisition unit is used for acquiring facial image information of a speaker; the lip detection unit detects a cut lip image sequence from the face image information input by the image acquisition unit; the lip feature extraction unit completes lip feature extraction by using a hybrid convolution neural network according to the lip image sequence input by the lip detection unit; the mouth action recognition unit inputs a Bi-GRU model according to the lip features extracted by the lip feature extraction unit to obtain a recognition probability result of the phoneme unit, then is connected with a time sequence classifier CTC to obtain a phoneme unit classification result, and then processes the classification result of the phoneme unit by a decoding method introducing an attention mechanism to obtain a mouth action recognition result. And judging whether the equipment needs to be awakened or not according to the phoneme of the voice recognition and the phoneme of the mouth action recognition.

Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. A method for waking up an intelligent device is characterized by comprising the following steps:

2. The smart device wake-up method according to claim 1, wherein the acquiring of the voice instruction signal specifically comprises:

obtaining noise data of current environment without voice command signal

；

；

Based on the noise data

And the voice data

The voice instruction signal is determined.

3. The smart device wake-up method according to claim 2, wherein the noise data is based on

And the voice data

Determining the voice instruction signal, specifically:

processing the noise data using a short-time Fourier transform

Obtaining a processed noise frequency domain signal

；

Processing the vocal data using short-time Fourier transform

Obtaining the processed human voice frequency domain signal

；

According to the noise frequency domain signal

And the human voice frequency domain signal

；

The corresponding first phoneme value.

4. The smart device wake-up method according to claim 3, wherein the frequency domain signal is derived from the noise

And the human voice frequency domain signal

The method specifically comprises the following steps:

The first formula is:

in the formula (I), the compound is shown in the specification,

is the index of the frame or frames,

is a frequency index, and

is the number of points of the short-time fourier transform,

is the filter coefficient;

wherein

In order to adjust the factor in terms of the step size,

which represents the conjugate of the two or more different molecules,

is that

The ORD is the number of frames of the cache value.

5. The intelligent device wake-up method according to any one of claims 1 to 4, wherein the determining of the second phoneme value corresponding to the mouth image sequence specifically includes:

extracting mouth features from the mouth image sequence;

6. The smart device wake-up method according to claim 5, wherein the extracting mouth features from the mouth image sequence is specifically:

7. The smart device wake-up method according to claim 5, wherein the determining, by using the mouth feature, the recognition probability result of the factor unit specifically includes:

8. The intelligent device wake-up method according to claim 5, wherein the decoding method with the attention mechanism is used to decode the classification result of the phoneme unit to obtain a second phoneme value corresponding to the mouth image sequence, and specifically includes:

obtaining a score for each of the hidden states;

acquiring a score of attention;

9. A smart device, comprising:

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.