CN115440196A - Voice recognition method, device, medium and equipment based on user facial expression - Google Patents

Voice recognition method, device, medium and equipment based on user facial expression Download PDF

Info

Publication number
CN115440196A
CN115440196A CN202211163199.0A CN202211163199A CN115440196A CN 115440196 A CN115440196 A CN 115440196A CN 202211163199 A CN202211163199 A CN 202211163199A CN 115440196 A CN115440196 A CN 115440196A
Authority
CN
China
Prior art keywords
user
voice
feature
target
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211163199.0A
Other languages
Chinese (zh)
Inventor
陶贵宾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tonglian Financial Network Technology Service Co ltd
Original Assignee
Shenzhen Tonglian Financial Network Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tonglian Financial Network Technology Service Co ltd filed Critical Shenzhen Tonglian Financial Network Technology Service Co ltd
Priority to CN202211163199.0A priority Critical patent/CN115440196A/en
Publication of CN115440196A publication Critical patent/CN115440196A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Engineering & Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure provides a voice recognition method, apparatus, medium, and device based on a user's facial expression, the method comprising: determining the change condition of a facial feature point of a target user in a monitoring environment in a preset time period according to the recognition model to generate a facial dynamic feature image, matching a plurality of feature area dynamic sub-images with preset dynamic sub-images of corresponding feature areas, determining an emotion label corresponding to the target user, collecting audio data of the target user in the monitoring environment in the preset time period to generate user voice corresponding to the target user, training the voice recognition model according to the emotion label, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the intelligent device can more accurately recognize the user intention corresponding to the user voice, the accuracy of voice recognition is improved, and better product experience is brought to the user.

Description

Voice recognition method, device, medium and equipment based on user facial expressions
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, medium, and device based on user facial expressions.
Background
In the prior art, only character recognition is performed on user voice, character extraction is performed on user language, and user intention is recognized through text semantics. However, in a human-computer interaction voice conversation scene, the user intention obtained by analyzing the semantic is not accurate, the subsequent interaction process between the intelligent device and the user is seriously influenced, and bad experience is brought to the user.
Disclosure of Invention
In view of the above, an object of the present disclosure is to provide a method, an apparatus, a medium, and a device for speech recognition based on facial expressions of a user, so as to solve the technical problem of inaccurate speech recognition in the related art.
Based on the above object of the invention, a first aspect of the present disclosure provides a speech recognition method based on facial expressions of a user, the method comprising:
acquiring thermal images in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the human face, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas, wherein the plurality of feature areas at least comprise an eye feature area, a nose feature area and a mouth feature area;
matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, fusing the plurality of expression recognition results according to preset weights to determine an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the representation emotion labels of the feature regions;
acquiring audio data of the target user in the monitoring environment within the preset time period to generate target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics to generate user voice corresponding to the target user, wherein the user voice acquired through a microphone issues a control instruction to an intelligent terminal;
screening initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user, wherein the initial database comprises mapping relations between a plurality of initial sample voice data and a plurality of emotion labels.
Further, the denoising the target audio data according to the user voice frequency band, and performing voice extraction on the denoised target audio data according to a set voice feature to generate a user voice corresponding to the target user includes:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and performing noise reduction processing on the target audio data based on the user voice frequency band to remove the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
Further, the screening out initial sample voice data corresponding to the emotion tag from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user includes:
screening the initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;
performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a fully-connected neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantic under the condition that the target semantic information is inconsistent with the first emotion semantic;
and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.
Further, the matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, and fusing the plurality of expression recognition results according to preset weights to determine the emotion label corresponding to the target user includes:
normalizing the dynamic subimages in any characteristic area to generate dynamic grayscale subimages with the same size;
identifying the gray level sub-image, and determining the characteristic region corresponding to the gray level sub-image;
acquiring a plurality of preset dynamic sub-images corresponding to the feature area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each preset dynamic sub-image corresponds to a preset expression recognition result;
and determining the target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.
A second aspect of the present disclosure provides a voice recognition apparatus based on a facial expression of a user, the apparatus including:
the first generation module is used for acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, obtaining the frame image with the corresponding duration again, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on the preset distribution rule and a plurality of feature areas corresponding to the target user, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the plurality of feature regions includes at least an eye feature region, a nose feature region, and a mouth feature region;
the determining module is used for matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, fusing the plurality of expression recognition results according to preset weights to determine an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the representation emotion labels of the feature regions;
the second generation module is used for acquiring audio data of the target user in the monitoring environment within the preset time period, generating target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics to generate user voice corresponding to the target user, wherein the user voice acquired through a microphone issues a control instruction to the intelligent terminal;
and the third generation module is used for screening initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the voice of the user through the trained voice recognition model, and generating semantic information corresponding to the target user, wherein the initial database comprises a plurality of groups of mapping relations between the initial sample voice data and the emotion labels.
Further, the second generating module may be further configured to:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and denoising the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
A third aspect of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of the first aspect.
A fourth aspect of the present disclosure provides an electronic device comprising a computer program which, when executed by a processor, performs the steps of the speech recognition method based on facial expressions of a user according to any one of the first aspect.
The present disclosure can achieve at least the following advantageous effects:
acquiring the change condition of the facial feature points of the target user in the monitoring environment within a preset time period to generate a facial dynamic feature image corresponding to the target user, and segmenting the facial dynamic feature image to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.
Drawings
Fig. 1 is a flowchart of a speech recognition method based on facial expressions of a user according to an embodiment of the present disclosure.
Fig. 2 is a block diagram of a speech recognition apparatus based on facial expressions of a user in an embodiment of the present disclosure.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered limiting of the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.
In the present invention, unless expressly stated or limited otherwise, the first feature "on" or "under" the second feature may be directly contacting the second feature or the first and second features may be indirectly contacting each other through intervening media. Also, a first feature being "on," "over," and "on" a second feature may mean that the first feature is directly or diagonally above the second feature, or simply that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. As used herein, the terms "vertical," "horizontal," "upper," "lower," "left," "right," and the like are for purposes of illustration only and do not denote a single embodiment.
Fig. 1 is a flowchart of a speech recognition method based on facial expressions of a user in an embodiment of the present disclosure, as shown in fig. 1, the method includes the following steps:
in step S11, determining the change situation of the facial feature points of the target user in the monitoring environment within a preset time period according to an image recognition model to generate a facial dynamic feature image;
the method comprises the following steps of acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that human face features exist in the monitoring environment based on an image recognition model, and circularly executing the following steps based on the preset distribution rules of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating the face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the human face, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas, wherein the plurality of feature areas at least comprise an eye feature area, a nose feature area and a mouth feature area.
In step S12, matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, and determining an emotion tag corresponding to the target user;
the expression recognition result is used for representing the emotion label corresponding to the target user, the preset weight is set according to the strength relation of the representation emotion label of each feature area, and the emotion label is determined by fusing a plurality of expression recognition results according to the preset weight and the expression recognition result corresponding to the dynamic sub-image of each feature area.
In step S13, collecting audio data of the target user in the monitoring environment within the preset time period, and generating the user voice corresponding to the target user;
acquiring audio data of the target user in the monitoring environment within the preset time period to generate target audio data, identifying a user voice frequency band in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, performing voice extraction on the target audio data subjected to noise reduction according to voice characteristics, and generating the user voice corresponding to the target user. And the voice of the user is acquired by the microphone to give a control instruction to the intelligent terminal.
In step S14, a voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for carrying out semantic recognition on the user voice to generate semantic information corresponding to the target user;
the initial sample voice data corresponding to the emotion labels are screened out from an initial database, the initial sample voice data are added into a sample training set of a voice recognition model, the voice recognition model is recognized and trained on the basis of the sample training set, the voice of the user is recognized through the trained voice recognition model, and semantic information corresponding to the target user is generated. The initial database comprises a plurality of groups of initial sample voice data and a plurality of mapping relations between emotion labels.
By adopting the technical scheme, the change condition of the facial feature points of the target user in the monitoring environment in a preset time period is collected to generate a facial dynamic feature image corresponding to the target user, and the facial dynamic feature image is segmented to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.
Further, the step S14 includes:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and performing noise reduction processing on the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
Further, the step S14 includes:
screening initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;
performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a full-connection neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantics under the condition that the target semantic information is determined to be inconsistent with the first emotion semantics;
and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.
Further, the step S12 includes:
normalizing any one of the feature region dynamic sub-images to generate a dynamic gray level sub-image with the same size;
identifying the gray level sub-image to determine the characteristic region corresponding to the gray level sub-image;
acquiring a plurality of preset dynamic sub-images corresponding to the characteristic area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each frame of preset dynamic sub-image corresponds to one preset expression recognition result;
and determining a target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.
Fig. 2 is a block diagram of a speech recognition apparatus based on facial expressions of a user according to an embodiment of the present disclosure, where the recognition apparatus 100 includes: a first generation module 110, a determination module 120, a second generation module 130, and a third generation module 140.
A first generating module 110, configured to acquire a thermal image in a monitoring environment through an infrared acquisition device, determine, according to a feature recognition algorithm, facial feature points of a target user corresponding to a human face when it is determined that the human face exists in the monitoring environment based on an image recognition model, and cyclically execute the following steps based on the preset distribution rule of the facial feature points until it is determined that the facial feature points of the target user in the monitoring environment change: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, obtaining the frame image with the corresponding duration again, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on the preset distribution rule and a plurality of feature areas corresponding to the target user, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the feature regions include at least an eye feature region, a nose feature region, and a mouth feature region.
The determining module 120 is configured to match the plurality of feature region dynamic sub-images with preset dynamic sub-images of the plurality of feature regions, determine a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, and fuse the plurality of expression recognition results according to a preset weight to determine an emotion tag corresponding to the target user, where the expression recognition results are used for representing the emotion tag corresponding to the target user, and the preset weight is set according to a strength relationship of the representation emotion tags of the respective feature regions.
Second generation module 130, it is right to be used for in the preset time quantum the target user is in audio data in the monitoring environment gathers, generates target audio data, and discerns the user's voice frequency channel that the target user corresponds among the target audio data, according to user's voice frequency channel is right the target audio data carries out the noise reduction processing, according to setting for the pronunciation characteristic to falling after making an uproar the target audio data carries out pronunciation and draws, generates the user's pronunciation that the target user corresponds, wherein, gather through the microphone user's pronunciation assigns control command to intelligent terminal.
A third generating module 140, configured to screen initial sample voice data corresponding to the emotion tags from an initial database, add the initial sample voice data into a sample training set of a voice recognition model, perform recognition training on the voice recognition model based on the sample training set, perform semantic recognition on the user voice through the trained voice recognition model, and generate semantic information corresponding to the target user, where the initial database includes mapping relationships between a plurality of initial sample voice data and a plurality of emotion tags.
The device generates a face dynamic feature image corresponding to the target user by acquiring the change condition of the face feature point of the target user in the monitoring environment within a preset time period, and divides the face dynamic feature image to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.
Further, the first generating module 110 may be further configured to:
circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed;
selecting a frame image with corresponding duration from the initial dynamic image according to the preset target duration to generate a facial dynamic feature image corresponding to the target user; matching the face dynamic characteristic image with a preset standard dynamic image to generate a comparison result, and judging whether the matching is successful according to the comparison result;
if the matching is successful, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration, prolonging the used target duration, and obtaining the frame image with the corresponding duration again;
and if the matching is unsuccessful, extracting the frame image of the duration to generate a face dynamic characteristic image corresponding to the target user.
Further, the second generating module 130 may be further configured to:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and denoising the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
Further, the third generating module 140 may be further configured to:
screening the initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;
performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a full-connection neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantics under the condition that the target semantic information is determined to be inconsistent with the first emotion semantics;
and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.
Further, the third generating module 140 may be further configured to:
normalizing the dynamic sub-images in any characteristic region to generate dynamic gray sub-images with the same size;
identifying the gray level sub-image to determine the characteristic region corresponding to the gray level sub-image;
acquiring a plurality of preset dynamic sub-images corresponding to the feature area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each preset dynamic sub-image corresponds to a preset expression recognition result;
and determining a target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.
The present disclosure also provides a computer storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the voice recognition method based on facial expressions of a user according to any one of the preceding claims.
The present disclosure also provides an electronic device comprising a computer program which, when executed by a processor, carries out the steps of the method for speech recognition based on facial expressions of a user according to any one of the preceding claims.
All possible combinations of the technical features of the above embodiments may not be described for the sake of brevity, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims (8)

1. A method of speech recognition based on facial expressions of a user, the method comprising:
acquiring thermal images in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the human face, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas, wherein the plurality of feature areas at least comprise an eye feature area, a nose feature area and a mouth feature area;
matching the plurality of feature area dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature areas, determining a plurality of expression recognition results corresponding to the plurality of feature area dynamic sub-images, fusing the plurality of expression recognition results according to preset weights, and determining an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the emotion label represented by each feature area;
acquiring audio data of the target user in the monitoring environment within the preset time period to generate target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics to generate user voice corresponding to the target user, wherein the user voice acquired through a microphone issues a control instruction to an intelligent terminal;
screening initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user, wherein the initial database comprises mapping relations between a plurality of initial sample voice data and a plurality of emotion labels.
2. The identification method according to claim 1, wherein the performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to a set voice feature to generate the user voice corresponding to the target user comprises:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and performing noise reduction processing on the target audio data based on the user voice frequency band to remove the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
3. The recognition method according to claim 1, wherein the step of screening out initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user comprises:
screening the initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;
performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a fully-connected neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantic information under the condition that the target semantic information is determined to be inconsistent with the first emotion semantic information;
and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.
4. The identification method according to claim 1, wherein the matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions to determine a plurality of expression identification results corresponding to the plurality of feature region dynamic sub-images, and the fusing the plurality of expression identification results according to preset weights to determine the emotion label corresponding to the target user comprises:
normalizing the dynamic sub-images in any characteristic region to generate dynamic gray sub-images with the same size;
identifying the gray level sub-image to determine the characteristic region corresponding to the gray level sub-image;
acquiring a plurality of preset dynamic sub-images corresponding to the feature area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each preset dynamic sub-image corresponds to a preset expression recognition result;
and determining a target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.
5. A speech recognition apparatus based on facial expressions of a user, comprising:
the first generation module is used for acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment is confirmed based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a facial dynamic feature image corresponding to a target user, matching the facial dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating a facial dynamic feature image corresponding to the target user, and segmenting the facial dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the target user to generate a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the plurality of feature regions includes at least an eye feature region, a nose feature region, and a mouth feature region;
the determining module is used for matching the plurality of feature area dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature areas, determining a plurality of expression recognition results corresponding to the plurality of feature area dynamic sub-images, fusing the plurality of expression recognition results according to preset weights, and determining an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of representing the emotion label in each feature area;
the second generation module is used for acquiring audio data of the target user in the monitoring environment within the preset time period, generating target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics, and generating user voice corresponding to the target user, wherein a control instruction is issued to an intelligent terminal by acquiring the user voice acquired by a microphone;
the third generation module is used for screening out initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user, wherein the initial database comprises mapping relations between a plurality of initial sample voice data and a plurality of emotion labels.
6. The identification device of claim 5, wherein the second generation module is further configured to:
analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;
and denoising the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.
7. A computer storage medium, having a computer program stored thereon, which, when being executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of claims 1 to 4.
8. An electronic device comprising a computer program, wherein the computer program, when executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of claims 1 to 4.
CN202211163199.0A 2022-09-23 2022-09-23 Voice recognition method, device, medium and equipment based on user facial expression Pending CN115440196A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211163199.0A CN115440196A (en) 2022-09-23 2022-09-23 Voice recognition method, device, medium and equipment based on user facial expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211163199.0A CN115440196A (en) 2022-09-23 2022-09-23 Voice recognition method, device, medium and equipment based on user facial expression

Publications (1)

Publication Number Publication Date
CN115440196A true CN115440196A (en) 2022-12-06

Family

ID=84249871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211163199.0A Pending CN115440196A (en) 2022-09-23 2022-09-23 Voice recognition method, device, medium and equipment based on user facial expression

Country Status (1)

Country Link
CN (1) CN115440196A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641837A (en) * 2022-12-22 2023-01-24 北京资采信息技术有限公司 Intelligent robot conversation intention recognition method and system
CN116916497B (en) * 2023-09-12 2023-12-26 深圳市卡能光电科技有限公司 Nested situation identification-based illumination control method and system for floor cylindrical atmosphere lamp

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641837A (en) * 2022-12-22 2023-01-24 北京资采信息技术有限公司 Intelligent robot conversation intention recognition method and system
CN116916497B (en) * 2023-09-12 2023-12-26 深圳市卡能光电科技有限公司 Nested situation identification-based illumination control method and system for floor cylindrical atmosphere lamp
CN117641667A (en) * 2023-09-12 2024-03-01 深圳市卡能光电科技有限公司 Intelligent control method and system for brightness of atmosphere lamp

Similar Documents

Publication Publication Date Title
CN111181939B (en) Network intrusion detection method and device based on ensemble learning
EP3478728B1 (en) Method and system for cell annotation with adaptive incremental learning
CN115440196A (en) Voice recognition method, device, medium and equipment based on user facial expression
WO2019176994A1 (en) Facial image identification system, identifier generation device, identification device, image identification system and identification system
CN110503054B (en) Text image processing method and device
CN113779308B (en) Short video detection and multi-classification method, device and storage medium
CN110807314A (en) Text emotion analysis model training method, device and equipment and readable storage medium
JP2011013732A (en) Information processing apparatus, information processing method, and program
JP6897749B2 (en) Learning methods, learning systems, and learning programs
CN108268823A (en) Target recognition methods and device again
CN110418204B (en) Video recommendation method, device, equipment and storage medium based on micro expression
JP2012203422A (en) Learning device, method and program
CN114639150A (en) Emotion recognition method and device, computer equipment and storage medium
CN111326139A (en) Language identification method, device, equipment and storage medium
CN117197904A (en) Training method of human face living body detection model, human face living body detection method and human face living body detection device
CN110874835B (en) Crop leaf disease resistance identification method and system, electronic equipment and storage medium
CN115690514A (en) Image recognition method and related equipment
CN112949456B (en) Video feature extraction model training and video feature extraction method and device
CN114445691A (en) Model training method and device, electronic equipment and storage medium
JP2002251592A (en) Learning method for pattern recognition dictionary
CN112132239B (en) Training method, device, equipment and storage medium
CN115101074A (en) Voice recognition method, device, medium and equipment based on user speaking emotion
CN104778479B (en) A kind of image classification method and system based on sparse coding extraction
CN114913602A (en) Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium
CN111382703B (en) Finger vein recognition method based on secondary screening and score fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination