CN115440196A - Voice recognition method, device, medium and equipment based on user facial expression - Google Patents
Voice recognition method, device, medium and equipment based on user facial expression Download PDFInfo
- Publication number
- CN115440196A CN115440196A CN202211163199.0A CN202211163199A CN115440196A CN 115440196 A CN115440196 A CN 115440196A CN 202211163199 A CN202211163199 A CN 202211163199A CN 115440196 A CN115440196 A CN 115440196A
- Authority
- CN
- China
- Prior art keywords
- user
- voice
- feature
- target
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000008921 facial expression Effects 0.000 title claims abstract description 19
- 230000008451 emotion Effects 0.000 claims abstract description 84
- 230000001815 facial effect Effects 0.000 claims abstract description 38
- 238000012544 monitoring process Methods 0.000 claims abstract description 35
- 238000012549 training Methods 0.000 claims abstract description 33
- 230000014509 gene expression Effects 0.000 claims description 34
- 238000000605 extraction Methods 0.000 claims description 19
- 230000009467 reduction Effects 0.000 claims description 19
- 238000012216 screening Methods 0.000 claims description 13
- 230000007613 environmental effect Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 6
- 238000011084 recovery Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
- G06V10/811—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Signal Processing (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Engineering & Computer Science (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The present disclosure provides a voice recognition method, apparatus, medium, and device based on a user's facial expression, the method comprising: determining the change condition of a facial feature point of a target user in a monitoring environment in a preset time period according to the recognition model to generate a facial dynamic feature image, matching a plurality of feature area dynamic sub-images with preset dynamic sub-images of corresponding feature areas, determining an emotion label corresponding to the target user, collecting audio data of the target user in the monitoring environment in the preset time period to generate user voice corresponding to the target user, training the voice recognition model according to the emotion label, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the intelligent device can more accurately recognize the user intention corresponding to the user voice, the accuracy of voice recognition is improved, and better product experience is brought to the user.
Description
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, medium, and device based on user facial expressions.
Background
In the prior art, only character recognition is performed on user voice, character extraction is performed on user language, and user intention is recognized through text semantics. However, in a human-computer interaction voice conversation scene, the user intention obtained by analyzing the semantic is not accurate, the subsequent interaction process between the intelligent device and the user is seriously influenced, and bad experience is brought to the user.
Disclosure of Invention
In view of the above, an object of the present disclosure is to provide a method, an apparatus, a medium, and a device for speech recognition based on facial expressions of a user, so as to solve the technical problem of inaccurate speech recognition in the related art.
Based on the above object of the invention, a first aspect of the present disclosure provides a speech recognition method based on facial expressions of a user, the method comprising:
acquiring thermal images in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the human face, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas, wherein the plurality of feature areas at least comprise an eye feature area, a nose feature area and a mouth feature area;
matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, fusing the plurality of expression recognition results according to preset weights to determine an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the representation emotion labels of the feature regions;
acquiring audio data of the target user in the monitoring environment within the preset time period to generate target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics to generate user voice corresponding to the target user, wherein the user voice acquired through a microphone issues a control instruction to an intelligent terminal;
screening initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user, wherein the initial database comprises mapping relations between a plurality of initial sample voice data and a plurality of emotion labels.
Further, the denoising the target audio data according to the user voice frequency band, and performing voice extraction on the denoised target audio data according to a set voice feature to generate a user voice corresponding to the target user includes:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and performing noise reduction processing on the target audio data based on the user voice frequency band to remove the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
Further, the screening out initial sample voice data corresponding to the emotion tag from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user includes:
screening the initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;
performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a fully-connected neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantic under the condition that the target semantic information is inconsistent with the first emotion semantic;
and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.
Further, the matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, and fusing the plurality of expression recognition results according to preset weights to determine the emotion label corresponding to the target user includes:
normalizing the dynamic subimages in any characteristic area to generate dynamic grayscale subimages with the same size;
identifying the gray level sub-image, and determining the characteristic region corresponding to the gray level sub-image;
acquiring a plurality of preset dynamic sub-images corresponding to the feature area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each preset dynamic sub-image corresponds to a preset expression recognition result;
and determining the target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.
A second aspect of the present disclosure provides a voice recognition apparatus based on a facial expression of a user, the apparatus including:
the first generation module is used for acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, obtaining the frame image with the corresponding duration again, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on the preset distribution rule and a plurality of feature areas corresponding to the target user, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the plurality of feature regions includes at least an eye feature region, a nose feature region, and a mouth feature region;
the determining module is used for matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, fusing the plurality of expression recognition results according to preset weights to determine an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the representation emotion labels of the feature regions;
the second generation module is used for acquiring audio data of the target user in the monitoring environment within the preset time period, generating target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics to generate user voice corresponding to the target user, wherein the user voice acquired through a microphone issues a control instruction to the intelligent terminal;
and the third generation module is used for screening initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the voice of the user through the trained voice recognition model, and generating semantic information corresponding to the target user, wherein the initial database comprises a plurality of groups of mapping relations between the initial sample voice data and the emotion labels.
Further, the second generating module may be further configured to:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and denoising the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
A third aspect of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of the first aspect.
A fourth aspect of the present disclosure provides an electronic device comprising a computer program which, when executed by a processor, performs the steps of the speech recognition method based on facial expressions of a user according to any one of the first aspect.
The present disclosure can achieve at least the following advantageous effects:
acquiring the change condition of the facial feature points of the target user in the monitoring environment within a preset time period to generate a facial dynamic feature image corresponding to the target user, and segmenting the facial dynamic feature image to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.
Drawings
Fig. 1 is a flowchart of a speech recognition method based on facial expressions of a user according to an embodiment of the present disclosure.
Fig. 2 is a block diagram of a speech recognition apparatus based on facial expressions of a user in an embodiment of the present disclosure.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered limiting of the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.
In the present invention, unless expressly stated or limited otherwise, the first feature "on" or "under" the second feature may be directly contacting the second feature or the first and second features may be indirectly contacting each other through intervening media. Also, a first feature being "on," "over," and "on" a second feature may mean that the first feature is directly or diagonally above the second feature, or simply that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. As used herein, the terms "vertical," "horizontal," "upper," "lower," "left," "right," and the like are for purposes of illustration only and do not denote a single embodiment.
Fig. 1 is a flowchart of a speech recognition method based on facial expressions of a user in an embodiment of the present disclosure, as shown in fig. 1, the method includes the following steps:
in step S11, determining the change situation of the facial feature points of the target user in the monitoring environment within a preset time period according to an image recognition model to generate a facial dynamic feature image;
the method comprises the following steps of acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that human face features exist in the monitoring environment based on an image recognition model, and circularly executing the following steps based on the preset distribution rules of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating the face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the human face, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas, wherein the plurality of feature areas at least comprise an eye feature area, a nose feature area and a mouth feature area.
In step S12, matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, and determining an emotion tag corresponding to the target user;
the expression recognition result is used for representing the emotion label corresponding to the target user, the preset weight is set according to the strength relation of the representation emotion label of each feature area, and the emotion label is determined by fusing a plurality of expression recognition results according to the preset weight and the expression recognition result corresponding to the dynamic sub-image of each feature area.
In step S13, collecting audio data of the target user in the monitoring environment within the preset time period, and generating the user voice corresponding to the target user;
acquiring audio data of the target user in the monitoring environment within the preset time period to generate target audio data, identifying a user voice frequency band in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, performing voice extraction on the target audio data subjected to noise reduction according to voice characteristics, and generating the user voice corresponding to the target user. And the voice of the user is acquired by the microphone to give a control instruction to the intelligent terminal.
In step S14, a voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for carrying out semantic recognition on the user voice to generate semantic information corresponding to the target user;
the initial sample voice data corresponding to the emotion labels are screened out from an initial database, the initial sample voice data are added into a sample training set of a voice recognition model, the voice recognition model is recognized and trained on the basis of the sample training set, the voice of the user is recognized through the trained voice recognition model, and semantic information corresponding to the target user is generated. The initial database comprises a plurality of groups of initial sample voice data and a plurality of mapping relations between emotion labels.
By adopting the technical scheme, the change condition of the facial feature points of the target user in the monitoring environment in a preset time period is collected to generate a facial dynamic feature image corresponding to the target user, and the facial dynamic feature image is segmented to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.
Further, the step S14 includes:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and performing noise reduction processing on the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
Further, the step S14 includes:
screening initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;
performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a full-connection neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantics under the condition that the target semantic information is determined to be inconsistent with the first emotion semantics;
and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.
Further, the step S12 includes:
normalizing any one of the feature region dynamic sub-images to generate a dynamic gray level sub-image with the same size;
identifying the gray level sub-image to determine the characteristic region corresponding to the gray level sub-image;
acquiring a plurality of preset dynamic sub-images corresponding to the characteristic area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each frame of preset dynamic sub-image corresponds to one preset expression recognition result;
and determining a target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.
Fig. 2 is a block diagram of a speech recognition apparatus based on facial expressions of a user according to an embodiment of the present disclosure, where the recognition apparatus 100 includes: a first generation module 110, a determination module 120, a second generation module 130, and a third generation module 140.
A first generating module 110, configured to acquire a thermal image in a monitoring environment through an infrared acquisition device, determine, according to a feature recognition algorithm, facial feature points of a target user corresponding to a human face when it is determined that the human face exists in the monitoring environment based on an image recognition model, and cyclically execute the following steps based on the preset distribution rule of the facial feature points until it is determined that the facial feature points of the target user in the monitoring environment change: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, obtaining the frame image with the corresponding duration again, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on the preset distribution rule and a plurality of feature areas corresponding to the target user, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the feature regions include at least an eye feature region, a nose feature region, and a mouth feature region.
The determining module 120 is configured to match the plurality of feature region dynamic sub-images with preset dynamic sub-images of the plurality of feature regions, determine a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, and fuse the plurality of expression recognition results according to a preset weight to determine an emotion tag corresponding to the target user, where the expression recognition results are used for representing the emotion tag corresponding to the target user, and the preset weight is set according to a strength relationship of the representation emotion tags of the respective feature regions.
A third generating module 140, configured to screen initial sample voice data corresponding to the emotion tags from an initial database, add the initial sample voice data into a sample training set of a voice recognition model, perform recognition training on the voice recognition model based on the sample training set, perform semantic recognition on the user voice through the trained voice recognition model, and generate semantic information corresponding to the target user, where the initial database includes mapping relationships between a plurality of initial sample voice data and a plurality of emotion tags.
The device generates a face dynamic feature image corresponding to the target user by acquiring the change condition of the face feature point of the target user in the monitoring environment within a preset time period, and divides the face dynamic feature image to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.
Further, the first generating module 110 may be further configured to:
circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed;
selecting a frame image with corresponding duration from the initial dynamic image according to the preset target duration to generate a facial dynamic feature image corresponding to the target user; matching the face dynamic characteristic image with a preset standard dynamic image to generate a comparison result, and judging whether the matching is successful according to the comparison result;
if the matching is successful, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration, prolonging the used target duration, and obtaining the frame image with the corresponding duration again;
and if the matching is unsuccessful, extracting the frame image of the duration to generate a face dynamic characteristic image corresponding to the target user.
Further, the second generating module 130 may be further configured to:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and denoising the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
Further, the third generating module 140 may be further configured to:
screening the initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;
performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a full-connection neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantics under the condition that the target semantic information is determined to be inconsistent with the first emotion semantics;
and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.
Further, the third generating module 140 may be further configured to:
normalizing the dynamic sub-images in any characteristic region to generate dynamic gray sub-images with the same size;
identifying the gray level sub-image to determine the characteristic region corresponding to the gray level sub-image;
acquiring a plurality of preset dynamic sub-images corresponding to the feature area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each preset dynamic sub-image corresponds to a preset expression recognition result;
and determining a target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.
The present disclosure also provides a computer storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the voice recognition method based on facial expressions of a user according to any one of the preceding claims.
The present disclosure also provides an electronic device comprising a computer program which, when executed by a processor, carries out the steps of the method for speech recognition based on facial expressions of a user according to any one of the preceding claims.
All possible combinations of the technical features of the above embodiments may not be described for the sake of brevity, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.
Claims (8)
1. A method of speech recognition based on facial expressions of a user, the method comprising:
acquiring thermal images in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the human face, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas, wherein the plurality of feature areas at least comprise an eye feature area, a nose feature area and a mouth feature area;
matching the plurality of feature area dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature areas, determining a plurality of expression recognition results corresponding to the plurality of feature area dynamic sub-images, fusing the plurality of expression recognition results according to preset weights, and determining an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the emotion label represented by each feature area;
acquiring audio data of the target user in the monitoring environment within the preset time period to generate target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics to generate user voice corresponding to the target user, wherein the user voice acquired through a microphone issues a control instruction to an intelligent terminal;
screening initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user, wherein the initial database comprises mapping relations between a plurality of initial sample voice data and a plurality of emotion labels.
2. The identification method according to claim 1, wherein the performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to a set voice feature to generate the user voice corresponding to the target user comprises:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and performing noise reduction processing on the target audio data based on the user voice frequency band to remove the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
3. The recognition method according to claim 1, wherein the step of screening out initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user comprises:
screening the initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;
performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a fully-connected neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantic information under the condition that the target semantic information is determined to be inconsistent with the first emotion semantic information;
and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.
4. The identification method according to claim 1, wherein the matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions to determine a plurality of expression identification results corresponding to the plurality of feature region dynamic sub-images, and the fusing the plurality of expression identification results according to preset weights to determine the emotion label corresponding to the target user comprises:
normalizing the dynamic sub-images in any characteristic region to generate dynamic gray sub-images with the same size;
identifying the gray level sub-image to determine the characteristic region corresponding to the gray level sub-image;
acquiring a plurality of preset dynamic sub-images corresponding to the feature area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each preset dynamic sub-image corresponds to a preset expression recognition result;
and determining a target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.
5. A speech recognition apparatus based on facial expressions of a user, comprising:
the first generation module is used for acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment is confirmed based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a facial dynamic feature image corresponding to a target user, matching the facial dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating a facial dynamic feature image corresponding to the target user, and segmenting the facial dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the target user to generate a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the plurality of feature regions includes at least an eye feature region, a nose feature region, and a mouth feature region;
the determining module is used for matching the plurality of feature area dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature areas, determining a plurality of expression recognition results corresponding to the plurality of feature area dynamic sub-images, fusing the plurality of expression recognition results according to preset weights, and determining an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of representing the emotion label in each feature area;
the second generation module is used for acquiring audio data of the target user in the monitoring environment within the preset time period, generating target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics, and generating user voice corresponding to the target user, wherein a control instruction is issued to an intelligent terminal by acquiring the user voice acquired by a microphone;
the third generation module is used for screening out initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user, wherein the initial database comprises mapping relations between a plurality of initial sample voice data and a plurality of emotion labels.
6. The identification device of claim 5, wherein the second generation module is further configured to:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and denoising the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
7. A computer storage medium, having a computer program stored thereon, which, when being executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of claims 1 to 4.
8. An electronic device comprising a computer program, wherein the computer program, when executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211163199.0A CN115440196A (en) | 2022-09-23 | 2022-09-23 | Voice recognition method, device, medium and equipment based on user facial expression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211163199.0A CN115440196A (en) | 2022-09-23 | 2022-09-23 | Voice recognition method, device, medium and equipment based on user facial expression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115440196A true CN115440196A (en) | 2022-12-06 |
Family
ID=84249871
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211163199.0A Pending CN115440196A (en) | 2022-09-23 | 2022-09-23 | Voice recognition method, device, medium and equipment based on user facial expression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115440196A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115641837A (en) * | 2022-12-22 | 2023-01-24 | 北京资采信息技术有限公司 | Intelligent robot conversation intention recognition method and system |
CN116916497B (en) * | 2023-09-12 | 2023-12-26 | 深圳市卡能光电科技有限公司 | Nested situation identification-based illumination control method and system for floor cylindrical atmosphere lamp |
-
2022
- 2022-09-23 CN CN202211163199.0A patent/CN115440196A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115641837A (en) * | 2022-12-22 | 2023-01-24 | 北京资采信息技术有限公司 | Intelligent robot conversation intention recognition method and system |
CN116916497B (en) * | 2023-09-12 | 2023-12-26 | 深圳市卡能光电科技有限公司 | Nested situation identification-based illumination control method and system for floor cylindrical atmosphere lamp |
CN117641667A (en) * | 2023-09-12 | 2024-03-01 | 深圳市卡能光电科技有限公司 | Intelligent control method and system for brightness of atmosphere lamp |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111181939B (en) | Network intrusion detection method and device based on ensemble learning | |
EP3478728B1 (en) | Method and system for cell annotation with adaptive incremental learning | |
CN115440196A (en) | Voice recognition method, device, medium and equipment based on user facial expression | |
WO2019176994A1 (en) | Facial image identification system, identifier generation device, identification device, image identification system and identification system | |
CN110503054B (en) | Text image processing method and device | |
CN113779308B (en) | Short video detection and multi-classification method, device and storage medium | |
CN110807314A (en) | Text emotion analysis model training method, device and equipment and readable storage medium | |
JP2011013732A (en) | Information processing apparatus, information processing method, and program | |
JP6897749B2 (en) | Learning methods, learning systems, and learning programs | |
CN108268823A (en) | Target recognition methods and device again | |
CN110418204B (en) | Video recommendation method, device, equipment and storage medium based on micro expression | |
JP2012203422A (en) | Learning device, method and program | |
CN114639150A (en) | Emotion recognition method and device, computer equipment and storage medium | |
CN111326139A (en) | Language identification method, device, equipment and storage medium | |
CN117197904A (en) | Training method of human face living body detection model, human face living body detection method and human face living body detection device | |
CN110874835B (en) | Crop leaf disease resistance identification method and system, electronic equipment and storage medium | |
CN115690514A (en) | Image recognition method and related equipment | |
CN112949456B (en) | Video feature extraction model training and video feature extraction method and device | |
CN114445691A (en) | Model training method and device, electronic equipment and storage medium | |
JP2002251592A (en) | Learning method for pattern recognition dictionary | |
CN112132239B (en) | Training method, device, equipment and storage medium | |
CN115101074A (en) | Voice recognition method, device, medium and equipment based on user speaking emotion | |
CN104778479B (en) | A kind of image classification method and system based on sparse coding extraction | |
CN114913602A (en) | Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium | |
CN111382703B (en) | Finger vein recognition method based on secondary screening and score fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |