CN112382277A - Smart device wake-up method, smart device and computer-readable storage medium - Google Patents

Smart device wake-up method, smart device and computer-readable storage medium Download PDF

Info

Publication number
CN112382277A
CN112382277A CN202110019710.9A CN202110019710A CN112382277A CN 112382277 A CN112382277 A CN 112382277A CN 202110019710 A CN202110019710 A CN 202110019710A CN 112382277 A CN112382277 A CN 112382277A
Authority
CN
China
Prior art keywords
phoneme
mouth
voice
determining
instruction signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110019710.9A
Other languages
Chinese (zh)
Inventor
傅涛
杨杰
王力
冯凌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bozhi Safety Technology Co ltd
Original Assignee
Bozhi Safety Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bozhi Safety Technology Co ltd filed Critical Bozhi Safety Technology Co ltd
Priority to CN202110019710.9A priority Critical patent/CN112382277A/en
Publication of CN112382277A publication Critical patent/CN112382277A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an intelligent device awakening method, an intelligent device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a voice instruction signal; determining a first phoneme value corresponding to the voice command signal; acquiring a mouth image sequence of a person who sends a voice instruction signal; determining a second phoneme value corresponding to the mouth image sequence; and respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one similarity is greater than a set similarity threshold value. The method combines the image recognition technology and the voice technology, reduces the misjudgment rate of the intelligent equipment, improves the non-inductive interaction experience of the user, and makes voice interaction more smooth and natural.

Description

Smart device wake-up method, smart device and computer-readable storage medium
Technical Field
The present invention relates to a method for waking up an intelligent device, an intelligent device using the method, and a computer-readable storage medium storing the method, and more particularly, to a method for waking up a voice by combining voice and image recognition, which belongs to the technical field of image recognition and voice recognition.
Background
The voice recognition technology has made remarkable progress in recent years, and has entered various fields such as industry, home appliances, smart home, and the like. The voice wake-up technology containing the wake-up word is a form of voice recognition technology, which does not directly contact with a hardware device, and the wake-up or operation of the device can be realized through the voice containing the wake-up word. The existing playing interruption function of the intelligent voice device with the loudspeaker, such as an intelligent sound box, a vehicle-mounted mobile phone frame or a voice robot, is also realized by adopting a voice awakening technology containing awakening words, and the awakening words in the existing voice awakening technology applied to the intelligent voice device are all in a fixed threshold mode, namely, a balance value is taken between the positive awakening rate and the false awakening rate of the intelligent voice device as a fixed awakening word threshold value. In the working process of the intelligent voice device, for example, when music or voice broadcast is played, because the sound emitted by the speaker of the intelligent voice device is transmitted to the microphone of the intelligent voice device and is collected by the microphone, the sound emitted by the speaker may interfere with the voice recognition of the intelligent voice device. For such a situation, the smart voice device usually performs echo cancellation processing on the sound emitted by the speaker, but if the echo cancellation is not complete or the nonlinear distortion from the speaker to the microphone is too large, the situation may result in an excessive echo residue, and when the smart voice device is in an environment with an excessive echo residue for a long time, since the threshold of the wakeup word applied in the smart voice device is always fixed, the possibility that the smart voice device is mistakenly woken by the echo residue may be greatly increased. If the microphone of the intelligent voice device does not receive the voice containing the awakening words sent by the user, but the current playing state of the intelligent voice device is interrupted because of the residual echo, the use experience of the user is greatly reduced.
Disclosure of Invention
The present application aims to provide an intelligent device wake-up method, an intelligent device, and a computer-readable storage medium, so as to solve the technical problems that the wake-up method in the prior art is easily interfered and has misjudgment.
A first embodiment of the present invention provides a method for waking up an intelligent device, including:
acquiring a voice instruction signal, and determining a first phoneme value corresponding to the voice instruction signal;
acquiring a mouth image sequence of a person who sends the voice instruction signal, and determining a second phoneme value corresponding to the mouth image sequence;
and respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one of the similarity is greater than a set similarity threshold value.
Preferably, the acquiring the voice instruction signal specifically includes:
obtaining noise data of current environment without voice command signal
Figure 177170DEST_PATH_IMAGE001
Acquiring the voice data of the current environment containing the voice instruction signal
Figure 478969DEST_PATH_IMAGE002
Based on the noise data
Figure 747140DEST_PATH_IMAGE001
And the voice data
Figure 80645DEST_PATH_IMAGE002
The voice instruction signal is determined.
Preferably, based on said noise data
Figure 572806DEST_PATH_IMAGE001
And the voice data
Figure 345590DEST_PATH_IMAGE002
Determining the voice instruction signal, specifically:
processing the noise data using a short-time Fourier transform
Figure 699342DEST_PATH_IMAGE001
Obtaining a processed noise frequency domain signal
Figure 342813DEST_PATH_IMAGE003
Processing the vocal data using short-time Fourier transform
Figure 22187DEST_PATH_IMAGE002
Obtaining the processed human voice frequency domain signal
Figure 751108DEST_PATH_IMAGE004
According to the noise frequency domain signal
Figure 423398DEST_PATH_IMAGE003
And the human voice frequency domain signal
Figure 937687DEST_PATH_IMAGE004
Determining a frequency domain signal corresponding to the voice instruction signal
Figure 771651DEST_PATH_IMAGE005
Correspondingly, determining a first phoneme value corresponding to the voice instruction signal, specifically:
determining a frequency domain signal corresponding to the voice instruction signal
Figure 253448DEST_PATH_IMAGE005
The corresponding first phoneme value.
Preferably, the frequency domain signal is based on the noise
Figure 229230DEST_PATH_IMAGE003
And the human voice frequency domain signal
Figure 847294DEST_PATH_IMAGE004
Determining a frequency domain signal corresponding to the voice instruction signal
Figure 602891DEST_PATH_IMAGE005
The method specifically comprises the following steps:
determining a frequency domain signal corresponding to the voice instruction signal by using a first formula
Figure 571984DEST_PATH_IMAGE005
The first formula is:
Figure 523760DEST_PATH_IMAGE006
in the formula (I), the compound is shown in the specification,
Figure 809379DEST_PATH_IMAGE007
is the index of the frame or frames,
Figure 188407DEST_PATH_IMAGE008
is a frequency index, and
Figure 441534DEST_PATH_IMAGE009
is the number of points of the short-time fourier transform,
Figure 682154DEST_PATH_IMAGE010
in order to be a filter coefficient, the filter coefficient,
Figure 337126DEST_PATH_IMAGE011
wherein
Figure 887056DEST_PATH_IMAGE012
In order to adjust the factor in terms of the step size,
Figure 844123DEST_PATH_IMAGE013
which represents the conjugate of the two or more different molecules,
Figure 403280DEST_PATH_IMAGE014
is that
Figure 663492DEST_PATH_IMAGE003
The ORD is the number of frames of the cache value.
Preferably, the determining a second phoneme value corresponding to the mouth image sequence specifically includes:
extracting mouth features from the mouth image sequence;
determining a recognition probability result of the factor unit by using the mouth features;
inputting the recognition probability result of the phoneme unit into a connected time sequence classifier to obtain a classification result of the phoneme unit;
and decoding the classification result of the phoneme unit by adopting a decoding method introducing an attention mechanism to obtain a second phoneme value corresponding to the mouth image sequence.
Preferably, the mouth feature is extracted from the mouth image sequence, specifically:
extracting the spatial feature of the mouth movement from the mouth image sequence by using a 2D convolutional neural network to obtain the spatial feature information of the mouth movement;
extracting the time characteristics of mouth movement from the mouth image sequence by using a 1D convolutional neural network to obtain time domain characteristic information of the mouth movement;
fusing the time domain characteristic information and the space characteristic information by utilizing a multi-space-time information fusion residual error network to obtain the fused mouth characteristic;
correspondingly, by using the mouth features, determining the recognition probability result of the factor unit, specifically:
and determining the recognition probability result of the factor unit by using the mouth features after fusion.
Preferably, the determining, by using the mouth feature, the recognition probability result of the factor unit specifically includes:
and inputting the mouth features into a Bi-GRU model to obtain a recognition probability result of the phoneme unit.
Preferably, the decoding method with the attention-introducing mechanism is used to decode the classification result of the phoneme unit to obtain a second phoneme value corresponding to the mouth image sequence, and specifically:
obtaining the hidden state of each moment of the phoneme unit in the classification result of the phoneme unit through attention;
obtaining a score for each of the hidden states;
acquiring a score of attention;
calculating a weighted sum of the score of the hidden state and the score of the attention to obtain a context vector;
and inputting the context vector into the decoder for joint training to obtain a second phoneme value corresponding to the mouth image sequence.
A second embodiment of the present invention provides an intelligent device, including:
the first phoneme determining module is used for acquiring a voice instruction signal and determining a first phoneme value corresponding to the voice instruction signal;
the second phoneme determining module is used for acquiring a mouth image sequence of a person who sends the voice instruction signal and determining a second phoneme value corresponding to the mouth image sequence;
and the awakening module is used for respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one of the similarity is greater than a set similarity threshold value.
A third embodiment of the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-mentioned method.
Compared with the prior art, the intelligent device awakening method, the intelligent device and the computer readable storage medium have the following beneficial effects:
the method combines the image recognition technology and the voice technology, reduces the misjudgment rate of the intelligent equipment, improves the non-inductive interaction experience of the user, and makes voice interaction more smooth and natural.
Drawings
FIG. 1 is a flowchart of a wake-up method for an intelligent device according to the present invention;
fig. 2 is a detailed flowchart of step 1 in the wake-up method of the smart device in the embodiment of the present invention;
fig. 3 is a detailed flowchart of step 2 in the wake-up method of the smart device in the embodiment of the present invention;
FIG. 4 is a schematic diagram of an improved MST (multi-spatiotemporal information fusion) unit after 2D convolution kernel and 1D convolution fusion in an intelligent device wake-up method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a lip spatiotemporal feature extraction network in the wake-up method of the smart device according to the embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following examples are given.
The invention provides an intelligent device awakening method fusing voice and image recognition, an intelligent device and a computer readable storage medium. The voice acquisition device acquires the voice signal (namely the voice command signal) of the current environment person in real time, the algorithm is used for analyzing and judging whether the voice signal of the current person is intended to awaken the intelligent device or not, the image sensor is used for monitoring the mouth action signal of the current environment person in real time, and the algorithm is used for analyzing and judging whether the mouth action characteristic of the current person is intended to awaken the intelligent device or not. If it is determined that one of the two exists, the mobile terminal wakes up. The method and the device can effectively reduce the possibility of mistaken awakening of the intelligent device in a specific environment, and improve the non-inductive interaction experience of the user.
Fig. 1 is a flowchart of a method for waking up an intelligent device according to the present invention.
The method for waking up the intelligent device in the first embodiment of the invention comprises the following steps:
step 1, acquiring a voice instruction signal, and determining a first phoneme value corresponding to the voice instruction signal, specifically:
collecting noise data without voice command signal in current environment by using voice collector
Figure 384323DEST_PATH_IMAGE001
(ii) a Using a voice collector to collect voice data containing voice instruction signals in the current environment
Figure 612042DEST_PATH_IMAGE002
(ii) a The voice collector is a microphone;
processing noise data using short-time Fourier transform
Figure 991202DEST_PATH_IMAGE001
Obtaining a processed noise frequency domain signal
Figure 89608DEST_PATH_IMAGE003
(ii) a Processing human voice data using short-time fourier transform
Figure 732073DEST_PATH_IMAGE002
Obtaining the processed human voice frequency domain signal
Figure 181509DEST_PATH_IMAGE004
Then, the preset echo cancellation algorithm corresponding to NLMS algorithm is used to carry out echo cancellation on the human audio frequency domain signal
Figure 816890DEST_PATH_IMAGE004
Echo cancellation is carried out, the echo referred to in the application is the noise data, and a frequency domain signal corresponding to the voice command signal after the echo cancellation is obtained
Figure 54623DEST_PATH_IMAGE005
In particular, toDetermining a frequency domain signal corresponding to a voice command signal using a first formula
Figure 117257DEST_PATH_IMAGE005
The first formula is:
Figure 257251DEST_PATH_IMAGE015
in the formula (I), the compound is shown in the specification,
Figure 243793DEST_PATH_IMAGE007
is the index of the frame or frames,
Figure 333102DEST_PATH_IMAGE008
is a frequency index, and
Figure 628955DEST_PATH_IMAGE009
is the number of points of the short-time fourier transform,
Figure 990666DEST_PATH_IMAGE016
is the filter coefficient;
Figure 780898DEST_PATH_IMAGE017
wherein
Figure 177245DEST_PATH_IMAGE012
In order to adjust the factor in terms of the step size,
Figure 391801DEST_PATH_IMAGE013
which represents the conjugate of the two or more different molecules,
Figure 771967DEST_PATH_IMAGE014
is that
Figure 552841DEST_PATH_IMAGE003
The ORD is the frame number of the cache value;
Figure 616743DEST_PATH_IMAGE018
and finally, determining a first phoneme value corresponding to the frequency domain signal corresponding to the voice command signal.
Step 2, obtaining a mouth image sequence of a person who sends a voice instruction signal, and determining a second phoneme value corresponding to the mouth image sequence, wherein the steps are as follows:
the method comprises the following steps of monitoring mouth action signals of people in the current environment in real time through an image sensor, specifically: detecting the image information of the human face in the visible area in real time through a binocular camera, and then detecting and cutting out a mouth image sequence from the image information of the face through a face detector;
then, mouth feature extraction is carried out on the mouth image sequence by utilizing a hybrid convolution neural network, and the method specifically comprises the following steps: the hybrid convolutional neural network of the present application consists of an improved 3D convolutional neural network and an MST (multiple spatiotemporal information fusion) residual network; the improved 3D convolutional neural network is a block that decomposes the 3D convolutional operation into two consecutive sub-convolutional blocks, a 2D convolutional neural network and a 1D convolutional neural network, respectively. The 2D convolutional neural network performs spatial feature extraction of mouth motion on the mouth image sequence to obtain spatial feature information of the mouth; the 1D convolutional neural network performs time dimension feature extraction on mouth movement on the mouth image sequence to obtain time domain feature information of the mouth movement; MST (multi-space-time information fusion) residual error network carries out multi-scale information fusion on the spatial features and the time features of the mouth part to obtain fused mouth part features;
then, inputting the fused mouth features into a Bi-GRU model to obtain a recognition probability result of the factor unit; then inputting the recognition probability result of the phoneme unit into a connection time sequence classifier CTC to obtain the classification result of the phoneme unit;
finally, decoding the classification result of the phoneme unit by adopting a decoding method introducing an attention mechanism to obtain a second phoneme value corresponding to the mouth image sequence, wherein the method specifically comprises the following steps: obtaining a hidden state of each phoneme unit at each moment by attention, obtaining a score state of the attention by scoring each hidden state, aggregating the hidden states of the phoneme units by using a weighted sum of the hidden states of the phoneme units and the attention scores to obtain a context vector, and inputting the context vector into a decoder for joint training to obtain a second phoneme value corresponding to the mouth image sequence.
And 3, respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one similarity is greater than a set similarity threshold value.
A second embodiment of the present invention discloses an intelligent device, including:
the first phoneme determining module is used for acquiring a voice instruction signal and determining a first phoneme value corresponding to the voice instruction signal;
the second phoneme determining module is used for acquiring a mouth image sequence of a person sending the voice command signal and determining a second phoneme value corresponding to the mouth image sequence;
and the awakening module is used for respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one similarity is greater than a set similarity threshold value.
A third embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the above-mentioned smart device wake-up method.
Hereinafter, the present application will be described in detail with specific examples.
Fig. 2 is a detailed flowchart of step 1, namely a detailed flowchart of acquiring a voice command signal and determining a first phoneme value corresponding to the voice command signal.
FIG. 3 is a detailed flowchart of step 2; namely, a detailed flow chart for acquiring a mouth image sequence of a person who sends a voice command signal and determining a second phoneme value corresponding to the mouth image sequence.
In this embodiment, the application devices are a microphone and a camera, wherein the camera is a 480pusb camera, and the USB camera and the microphone are fixed in front of the speaker and are 45cm away from the speaker. The awakening method comprises the following specific steps:
firstly, noise data without human voice (namely voice command signal) of intelligent voice equipment is obtained
Figure 192081DEST_PATH_IMAGE001
And voice data including voice
Figure 75854DEST_PATH_IMAGE002
;
And then noise data is processed by short-time Fourier transform
Figure 394840DEST_PATH_IMAGE001
Processing to obtain processed noise frequency domain signal
Figure 296937DEST_PATH_IMAGE003
(ii) a And to human acoustic data by short-time Fourier transform
Figure 59488DEST_PATH_IMAGE002
Processing to obtain processed human voice frequency domain signal
Figure 945404DEST_PATH_IMAGE004
To human audio frequency domain signal by preset echo cancellation algorithm
Figure 802502DEST_PATH_IMAGE004
Echo cancellation is carried out to obtain a human voice channel frequency domain signal after echo cancellation
Figure 832207DEST_PATH_IMAGE005
Eliminating echo to eliminate human voice channel frequency domain signal
Figure 546085DEST_PATH_IMAGE005
And converting the value into a corresponding phoneme value C1, and judging whether the similarity with the awakening parameter exists.
Acquiring each frame of image in a real-time video collected by a camera from the beginning of receiving a voice signal, and detecting and cutting out a mouth image sequence from face image information by using a face detector;
in the embodiment, a dlib library face 68 characteristic point extractor is adopted to extract lip regions of speakers in lip reading data sets, and a dlib library face detection model can be used to quickly capture large-amplitude shaking of faces, so that the sensitivity is high;
inputting the collected images into a network, finally outputting images surrounding 68 key points of the human face, and extracting coordinates of key points of lips of 46-68 to obtain coordinates of a central point of a rectangular region of the lips
Figure 607713DEST_PATH_IMAGE019
And the width of the rectangle
Figure 268502DEST_PATH_IMAGE020
Rectangular height
Figure 145191DEST_PATH_IMAGE021
Performing feature extraction on the lip image sequence by using a hybrid convolutional neural network; adopting a mixed convolutional neural network ((2+1) D + MST) to extract features of different spatial amplitudes and different time periods of the lip sequence; wherein the (2+1) D convolution block is a sub-convolution block that decomposes the 3D convolution operation into two successive ones, respectively a 2D convolution neural network and a 1D convolution neural network; the 2D convolutional neural network performs spatial feature extraction of lip movement on the lip image sequence to obtain spatial feature information of lips; the method comprises the steps that 1D convolutional neural network carries out time dimension feature extraction on lip movement on a lip image sequence to obtain time domain feature information of the lip movement; and performing multi-scale information fusion on the spatial characteristics and the temporal characteristics of the lips by using an MST (multi-spatiotemporal information fusion) residual error network.
In this embodiment, for the defect that each layer of the (2+1) D convolutional neural network has a spatial scale and a time depth with a single size, and each element in the feature map corresponds to single feature information, so that the model generalization capability is poor, 2D convolutional kernels and 1D convolutional kernels with different scales are used in space and time, respectively, and important spatio-temporal information which cannot be captured in a single space-time can be better processed. Fig. 4 is a schematic diagram of an improved MST (multiple spatiotemporal information fusion) unit after 2D convolution kernel ID convolution fusion. The improved MST unit includes n 2D convolution kernels, m 1D convolution kernels, 2 EN layers, and 2 non-linear layers. In the process of feature extraction, firstly, 2D convolution kernels of different scales are used for extracting multi-scale spatial feature information on a single-frame picture at the same time, then the multi-scale spatial feature information is combined into a short video according to a video time sequence, the short video is input into a multi-scale 1D convolution layer, time domain feature information of a long time period, a medium time period and a short time period is extracted at the same time, and finally a new feature map is formed through a fusion layer.
Fig. 5 is a schematic structural diagram of a lip spatiotemporal feature extraction network. The hybrid convolutional neural network specifically comprises 1 input layer, 6 improved MST residual error units, a global pooling layer, 1 full-connection layer, 1 softmax classification layer, 3 time domain down-sampling layers and 4 spatial down-sampling layers. The 3 time domain down-sampling layers are respectively arranged in the 4 th, 5 th and 6 th MST residual units, and the 4 spatial down-sampling layers are respectively arranged in the 1 st, 4 th, 5 th and 6 th MST residual units.
And inputting the lip characteristics into a Bi-directional gating circulation unit Bi-GRU model to obtain a recognition probability result of the phoneme unit. The Bi-GRU network is specifically a forward GRU and a reverse GRU, and is a gate recursion unit GRU, each layer of GRU network is provided with 256 filters, and the output of each time step of the GRU is processed by a full connection layer Softmax to obtain the recognition probability result of the phoneme unit.
Inputting the recognition probability result of the phoneme unit into a connection time sequence classifier CTC to obtain a phoneme unit classification result;
and processing the classification result of the phoneme unit by adopting a decoding method introducing an attention mechanism to obtain a mouth action recognition result.
In order to further increase the accuracy of long sentence mouth motion recognition, an attention mechanism is introduced at the output end of the algorithm framework, namely a decoding method of the attention mechanism is introduced; the method can enable the model decoder to pay attention to the coded content at a specific position, and does not need to take the whole coded content as the basis of decoding, thereby improving the decoding effect of the model and increasing the robustness of the system.
The decoder is a gated cyclic unit (GRU) with 3 layers cascaded, the conventional decoding processing is to directly input the phoneme unit classification result into the decoder for training to obtain a mouth action recognition result, the decoding processing introducing the attention mechanism is to obtain the hidden state of the phoneme unit at each moment through attention, score each hidden state by using an additive function, and obtain the score state of the attention through a softmax layer. And aggregating the hidden states of the phoneme units by using the weighted sum of the hidden states of the phoneme units and the attention scores to obtain a context vector, and inputting the context vector into a decoder for joint training to obtain a mouth action recognition result. Different phoneme unit recognition results can be used at each moment of the decoder by applying an attention mechanism in the decoding process, so that the decoding process can selectively focus on useful parts in the phoneme recognition results, the decoding effect is improved, and the long sentence recognition effect is better. If the attention mechanism is not introduced, the phoneme unit recognition result is converted into the corresponding Chinese characters word by word according to the sequence of the phoneme unit recognition result, but if the sentence is long, the previous conversion result can be forgotten in the conversion process, so that semantic errors and the recognition accuracy rate are reduced.
In the embodiment, the Chinese mouth motion recognition device based on the hybrid convolutional neural network is further disclosed, and comprises an image acquisition unit, a lip detection unit, a lip feature extraction unit and a mouth motion recognition unit; the image acquisition unit is used for acquiring facial image information of a speaker; the lip detection unit detects a cut lip image sequence from the face image information input by the image acquisition unit; the lip feature extraction unit completes lip feature extraction by using a hybrid convolution neural network according to the lip image sequence input by the lip detection unit; the mouth action recognition unit inputs a Bi-GRU model according to the lip features extracted by the lip feature extraction unit to obtain a recognition probability result of the phoneme unit, then is connected with a time sequence classifier CTC to obtain a phoneme unit classification result, and then processes the classification result of the phoneme unit by a decoding method introducing an attention mechanism to obtain a mouth action recognition result. And judging whether the equipment needs to be awakened or not according to the phoneme of the voice recognition and the phoneme of the mouth action recognition.
Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims (10)

1. A method for waking up an intelligent device is characterized by comprising the following steps:
acquiring a voice instruction signal, and determining a first phoneme value corresponding to the voice instruction signal;
acquiring a mouth image sequence of a person who sends the voice instruction signal, and determining a second phoneme value corresponding to the mouth image sequence;
and respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one of the similarity is greater than a set similarity threshold value.
2. The smart device wake-up method according to claim 1, wherein the acquiring of the voice instruction signal specifically comprises:
obtaining noise data of current environment without voice command signal
Figure 593322DEST_PATH_IMAGE001
Acquiring the voice data of the current environment containing the voice instruction signal
Figure 417052DEST_PATH_IMAGE002
Based on the noise data
Figure 244063DEST_PATH_IMAGE001
And the voice data
Figure 887665DEST_PATH_IMAGE002
The voice instruction signal is determined.
3. The smart device wake-up method according to claim 2, wherein the noise data is based on
Figure 940547DEST_PATH_IMAGE001
And the voice data
Figure 868052DEST_PATH_IMAGE002
Determining the voice instruction signal, specifically:
processing the noise data using a short-time Fourier transform
Figure 819959DEST_PATH_IMAGE001
Obtaining a processed noise frequency domain signal
Figure 996862DEST_PATH_IMAGE003
Processing the vocal data using short-time Fourier transform
Figure 387523DEST_PATH_IMAGE002
Obtaining the processed human voice frequency domain signal
Figure 185846DEST_PATH_IMAGE004
According to the noise frequency domain signal
Figure 761184DEST_PATH_IMAGE003
And the human voice frequency domain signal
Figure 659606DEST_PATH_IMAGE004
Determining a frequency domain signal corresponding to the voice instruction signal
Figure 40909DEST_PATH_IMAGE005
Correspondingly, determining a first phoneme value corresponding to the voice instruction signal, specifically:
determining a frequency domain signal corresponding to the voice instruction signal
Figure 146268DEST_PATH_IMAGE005
The corresponding first phoneme value.
4. The smart device wake-up method according to claim 3, wherein the frequency domain signal is derived from the noise
Figure 705556DEST_PATH_IMAGE003
And the human voice frequency domain signal
Figure 60314DEST_PATH_IMAGE004
Determining a frequency domain signal corresponding to the voice instruction signal
Figure 933723DEST_PATH_IMAGE005
The method specifically comprises the following steps:
determining a frequency domain signal corresponding to the voice instruction signal by using a first formula
Figure 487065DEST_PATH_IMAGE005
The first formula is:
Figure DEST_PATH_IMAGE006
in the formula (I), the compound is shown in the specification,
Figure 748413DEST_PATH_IMAGE007
is the index of the frame or frames,
Figure 603849DEST_PATH_IMAGE008
is a frequency index, and
Figure 326954DEST_PATH_IMAGE009
is the number of points of the short-time fourier transform,
Figure 954376DEST_PATH_IMAGE010
is the filter coefficient;
Figure 386625DEST_PATH_IMAGE011
wherein
Figure DEST_PATH_IMAGE012
In order to adjust the factor in terms of the step size,
Figure 529025DEST_PATH_IMAGE013
which represents the conjugate of the two or more different molecules,
Figure 524662DEST_PATH_IMAGE014
is that
Figure 9520DEST_PATH_IMAGE003
The ORD is the number of frames of the cache value.
5. The intelligent device wake-up method according to any one of claims 1 to 4, wherein the determining of the second phoneme value corresponding to the mouth image sequence specifically includes:
extracting mouth features from the mouth image sequence;
determining a recognition probability result of the factor unit by using the mouth features;
inputting the recognition probability result of the phoneme unit into a connected time sequence classifier to obtain a classification result of the phoneme unit;
and decoding the classification result of the phoneme unit by adopting a decoding method introducing an attention mechanism to obtain a second phoneme value corresponding to the mouth image sequence.
6. The smart device wake-up method according to claim 5, wherein the extracting mouth features from the mouth image sequence is specifically:
extracting the spatial feature of the mouth movement from the mouth image sequence by using a 2D convolutional neural network to obtain the spatial feature information of the mouth movement;
extracting the time characteristics of mouth movement from the mouth image sequence by using a 1D convolutional neural network to obtain time domain characteristic information of the mouth movement;
fusing the time domain characteristic information and the space characteristic information by utilizing a multi-space-time information fusion residual error network to obtain the fused mouth characteristic;
correspondingly, by using the mouth features, determining the recognition probability result of the factor unit, specifically:
and determining the recognition probability result of the factor unit by using the mouth features after fusion.
7. The smart device wake-up method according to claim 5, wherein the determining, by using the mouth feature, the recognition probability result of the factor unit specifically includes:
and inputting the mouth features into a Bi-GRU model to obtain a recognition probability result of the phoneme unit.
8. The intelligent device wake-up method according to claim 5, wherein the decoding method with the attention mechanism is used to decode the classification result of the phoneme unit to obtain a second phoneme value corresponding to the mouth image sequence, and specifically includes:
obtaining the hidden state of each moment of the phoneme unit in the classification result of the phoneme unit through attention;
obtaining a score for each of the hidden states;
acquiring a score of attention;
calculating a weighted sum of the score of the hidden state and the score of the attention to obtain a context vector;
and inputting the context vector into the decoder for joint training to obtain a second phoneme value corresponding to the mouth image sequence.
9. A smart device, comprising:
the first phoneme determining module is used for acquiring a voice instruction signal and determining a first phoneme value corresponding to the voice instruction signal;
the second phoneme determining module is used for acquiring a mouth image sequence of a person who sends the voice instruction signal and determining a second phoneme value corresponding to the mouth image sequence;
and the awakening module is used for respectively calculating the similarity between the first phoneme value and the second phoneme value and the phoneme value corresponding to the set awakening word, and awakening the intelligent equipment when at least one of the similarity is greater than a set similarity threshold value.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202110019710.9A 2021-01-07 2021-01-07 Smart device wake-up method, smart device and computer-readable storage medium Pending CN112382277A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110019710.9A CN112382277A (en) 2021-01-07 2021-01-07 Smart device wake-up method, smart device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110019710.9A CN112382277A (en) 2021-01-07 2021-01-07 Smart device wake-up method, smart device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN112382277A true CN112382277A (en) 2021-02-19

Family

ID=74590169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110019710.9A Pending CN112382277A (en) 2021-01-07 2021-01-07 Smart device wake-up method, smart device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN112382277A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053377A (en) * 2021-03-23 2021-06-29 南京地平线机器人技术有限公司 Voice wake-up method and device, computer readable storage medium and electronic equipment
CN116110393A (en) * 2023-02-01 2023-05-12 镁佳(北京)科技有限公司 Voice similarity-based refusing method, device, computer and medium
CN117672228A (en) * 2023-12-06 2024-03-08 山东凌晓通信科技有限公司 Intelligent voice interaction false wake-up system and method based on machine learning

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS63303550A (en) * 1987-06-04 1988-12-12 Ricoh Co Ltd Voice recognizing device
CN105206282A (en) * 2015-09-24 2015-12-30 深圳市冠旭电子有限公司 Noise acquisition method and device
CN106485214A (en) * 2016-09-28 2017-03-08 天津工业大学 A kind of eyes based on convolutional neural networks and mouth state identification method
CN107945789A (en) * 2017-12-28 2018-04-20 努比亚技术有限公司 Audio recognition method, device and computer-readable recording medium
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN108711430A (en) * 2018-04-28 2018-10-26 广东美的制冷设备有限公司 Audio recognition method, smart machine and storage medium
CN108804453A (en) * 2017-04-28 2018-11-13 上海荆虹电子科技有限公司 A kind of video and audio recognition methods and device
CN109977811A (en) * 2019-03-12 2019-07-05 四川长虹电器股份有限公司 The system and method for exempting from voice wake-up is realized based on the detection of mouth key position feature
CN110570862A (en) * 2019-10-09 2019-12-13 三星电子(中国)研发中心 voice recognition method and intelligent voice engine device
CN111028842A (en) * 2019-12-10 2020-04-17 上海芯翌智能科技有限公司 Method and equipment for triggering voice interaction response
CN111401250A (en) * 2020-03-17 2020-07-10 东北大学 Chinese lip language identification method and device based on hybrid convolutional neural network
CN111445918A (en) * 2020-03-23 2020-07-24 深圳市友杰智新科技有限公司 Method and device for reducing false awakening of intelligent voice equipment and computer equipment
CN111739534A (en) * 2020-06-04 2020-10-02 广东小天才科技有限公司 Processing method and device for assisting speech recognition, electronic equipment and storage medium
CN111833896A (en) * 2020-07-24 2020-10-27 北京声加科技有限公司 Voice enhancement method, system, device and storage medium for fusing feedback signals
CN111968662A (en) * 2020-08-10 2020-11-20 北京小米松果电子有限公司 Audio signal processing method and device and storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS63303550A (en) * 1987-06-04 1988-12-12 Ricoh Co Ltd Voice recognizing device
CN105206282A (en) * 2015-09-24 2015-12-30 深圳市冠旭电子有限公司 Noise acquisition method and device
CN106485214A (en) * 2016-09-28 2017-03-08 天津工业大学 A kind of eyes based on convolutional neural networks and mouth state identification method
CN108281137A (en) * 2017-01-03 2018-07-13 中国科学院声学研究所 A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN108804453A (en) * 2017-04-28 2018-11-13 上海荆虹电子科技有限公司 A kind of video and audio recognition methods and device
CN107945789A (en) * 2017-12-28 2018-04-20 努比亚技术有限公司 Audio recognition method, device and computer-readable recording medium
CN108711430A (en) * 2018-04-28 2018-10-26 广东美的制冷设备有限公司 Audio recognition method, smart machine and storage medium
CN109977811A (en) * 2019-03-12 2019-07-05 四川长虹电器股份有限公司 The system and method for exempting from voice wake-up is realized based on the detection of mouth key position feature
CN110570862A (en) * 2019-10-09 2019-12-13 三星电子(中国)研发中心 voice recognition method and intelligent voice engine device
CN111028842A (en) * 2019-12-10 2020-04-17 上海芯翌智能科技有限公司 Method and equipment for triggering voice interaction response
CN111401250A (en) * 2020-03-17 2020-07-10 东北大学 Chinese lip language identification method and device based on hybrid convolutional neural network
CN111445918A (en) * 2020-03-23 2020-07-24 深圳市友杰智新科技有限公司 Method and device for reducing false awakening of intelligent voice equipment and computer equipment
CN111739534A (en) * 2020-06-04 2020-10-02 广东小天才科技有限公司 Processing method and device for assisting speech recognition, electronic equipment and storage medium
CN111833896A (en) * 2020-07-24 2020-10-27 北京声加科技有限公司 Voice enhancement method, system, device and storage medium for fusing feedback signals
CN111968662A (en) * 2020-08-10 2020-11-20 北京小米松果电子有限公司 Audio signal processing method and device and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053377A (en) * 2021-03-23 2021-06-29 南京地平线机器人技术有限公司 Voice wake-up method and device, computer readable storage medium and electronic equipment
CN116110393A (en) * 2023-02-01 2023-05-12 镁佳(北京)科技有限公司 Voice similarity-based refusing method, device, computer and medium
CN116110393B (en) * 2023-02-01 2024-01-23 镁佳(北京)科技有限公司 Voice similarity-based refusing method, device, computer and medium
CN117672228A (en) * 2023-12-06 2024-03-08 山东凌晓通信科技有限公司 Intelligent voice interaction false wake-up system and method based on machine learning

Similar Documents

Publication Publication Date Title
WO2021082941A1 (en) Video figure recognition method and apparatus, and storage medium and electronic device
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN113035227B (en) Multi-modal voice separation method and system
WO2021143599A1 (en) Scene recognition-based speech processing method and apparatus, medium and system
CN111401250A (en) Chinese lip language identification method and device based on hybrid convolutional neural network
CN112400325A (en) Data-driven audio enhancement
CN109147763B (en) Audio and video keyword identification method and device based on neural network and inverse entropy weighting
US20220392224A1 (en) Data processing method and apparatus, device, and readable storage medium
CN112382277A (en) Smart device wake-up method, smart device and computer-readable storage medium
CN110097890A (en) A kind of method of speech processing, device and the device for speech processes
Yargıç et al. A lip reading application on MS Kinect camera
CN112347450B (en) Identity verification method based on blink sound signal
WO2022062800A1 (en) Speech separation method, electronic device, chip and computer-readable storage medium
CN113516990A (en) Voice enhancement method, method for training neural network and related equipment
Huang et al. Audio-visual speech recognition using an infrared headset
CN115206306A (en) Voice interaction method, device, equipment and system
CN112286364A (en) Man-machine interaction method and device
CN111462732B (en) Speech recognition method and device
US11842745B2 (en) Method, system, and computer-readable medium for purifying voice using depth information
CN112381069A (en) Voice-free wake-up method, intelligent device and computer-readable storage medium
CN111436956A (en) Attention detection method, device, equipment and storage medium
Anderson et al. Robust tri-modal automatic speech recognition for consumer applications
CN114283493A (en) Artificial intelligence-based identification system
WO2020200081A1 (en) Live streaming control method and apparatus, live streaming device, and storage medium
CN113542466A (en) Audio processing method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210219

RJ01 Rejection of invention patent application after publication