CN115440196A

CN115440196A - Voice recognition method, device, medium and equipment based on user facial expression

Info

Publication number: CN115440196A
Application number: CN202211163199.0A
Authority: CN
Inventors: 陶贵宾
Original assignee: Shenzhen Tonglian Financial Network Technology Service Co ltd
Current assignee: Shenzhen Tonglian Financial Network Technology Service Co ltd
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-06

Abstract

The present disclosure provides a voice recognition method, apparatus, medium, and device based on a user's facial expression, the method comprising: determining the change condition of a facial feature point of a target user in a monitoring environment in a preset time period according to the recognition model to generate a facial dynamic feature image, matching a plurality of feature area dynamic sub-images with preset dynamic sub-images of corresponding feature areas, determining an emotion label corresponding to the target user, collecting audio data of the target user in the monitoring environment in the preset time period to generate user voice corresponding to the target user, training the voice recognition model according to the emotion label, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the intelligent device can more accurately recognize the user intention corresponding to the user voice, the accuracy of voice recognition is improved, and better product experience is brought to the user.

Description

Voice recognition method, device, medium and equipment based on user facial expressions

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, medium, and device based on user facial expressions.

Background

In the prior art, only character recognition is performed on user voice, character extraction is performed on user language, and user intention is recognized through text semantics. However, in a human-computer interaction voice conversation scene, the user intention obtained by analyzing the semantic is not accurate, the subsequent interaction process between the intelligent device and the user is seriously influenced, and bad experience is brought to the user.

Disclosure of Invention

In view of the above, an object of the present disclosure is to provide a method, an apparatus, a medium, and a device for speech recognition based on facial expressions of a user, so as to solve the technical problem of inaccurate speech recognition in the related art.

Based on the above object of the invention, a first aspect of the present disclosure provides a speech recognition method based on facial expressions of a user, the method comprising:

acquiring thermal images in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the human face, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas, wherein the plurality of feature areas at least comprise an eye feature area, a nose feature area and a mouth feature area;

matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, fusing the plurality of expression recognition results according to preset weights to determine an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the representation emotion labels of the feature regions;

acquiring audio data of the target user in the monitoring environment within the preset time period to generate target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics to generate user voice corresponding to the target user, wherein the user voice acquired through a microphone issues a control instruction to an intelligent terminal;

screening initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user, wherein the initial database comprises mapping relations between a plurality of initial sample voice data and a plurality of emotion labels.

Further, the denoising the target audio data according to the user voice frequency band, and performing voice extraction on the denoised target audio data according to a set voice feature to generate a user voice corresponding to the target user includes:

analyzing the user voice in the target audio data according to the historical user voice corresponding to the target user so as to generate the user voice frequency band and the environmental audio according to the target audio data;

and performing noise reduction processing on the target audio data based on the user voice frequency band to remove the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.

Further, the screening out initial sample voice data corresponding to the emotion tag from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user includes:

screening the initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;

performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a fully-connected neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantic under the condition that the target semantic information is inconsistent with the first emotion semantic;

and performing semantic recognition on the user voice based on the updated voice recognition model to generate the semantic information corresponding to the target user.

Further, the matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, and fusing the plurality of expression recognition results according to preset weights to determine the emotion label corresponding to the target user includes:

normalizing the dynamic subimages in any characteristic area to generate dynamic grayscale subimages with the same size;

identifying the gray level sub-image, and determining the characteristic region corresponding to the gray level sub-image;

acquiring a plurality of preset dynamic sub-images corresponding to the feature area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each preset dynamic sub-image corresponds to a preset expression recognition result;

and determining the target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.

A second aspect of the present disclosure provides a voice recognition apparatus based on a facial expression of a user, the apparatus including:

the first generation module is used for acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, obtaining the frame image with the corresponding duration again, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on the preset distribution rule and a plurality of feature areas corresponding to the target user, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the plurality of feature regions includes at least an eye feature region, a nose feature region, and a mouth feature region;

the determining module is used for matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, determining a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, fusing the plurality of expression recognition results according to preset weights to determine an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the representation emotion labels of the feature regions;

the second generation module is used for acquiring audio data of the target user in the monitoring environment within the preset time period, generating target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics to generate user voice corresponding to the target user, wherein the user voice acquired through a microphone issues a control instruction to the intelligent terminal;

and the third generation module is used for screening initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the voice of the user through the trained voice recognition model, and generating semantic information corresponding to the target user, wherein the initial database comprises a plurality of groups of mapping relations between the initial sample voice data and the emotion labels.

Further, the second generating module may be further configured to:

and denoising the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.

A third aspect of the present disclosure provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of the first aspect.

A fourth aspect of the present disclosure provides an electronic device comprising a computer program which, when executed by a processor, performs the steps of the speech recognition method based on facial expressions of a user according to any one of the first aspect.

The present disclosure can achieve at least the following advantageous effects:

acquiring the change condition of the facial feature points of the target user in the monitoring environment within a preset time period to generate a facial dynamic feature image corresponding to the target user, and segmenting the facial dynamic feature image to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.

Drawings

Fig. 1 is a flowchart of a speech recognition method based on facial expressions of a user according to an embodiment of the present disclosure.

Fig. 2 is a block diagram of a speech recognition apparatus based on facial expressions of a user in an embodiment of the present disclosure.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered limiting of the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.

In the present invention, unless expressly stated or limited otherwise, the first feature "on" or "under" the second feature may be directly contacting the second feature or the first and second features may be indirectly contacting each other through intervening media. Also, a first feature being "on," "over," and "on" a second feature may mean that the first feature is directly or diagonally above the second feature, or simply that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. As used herein, the terms "vertical," "horizontal," "upper," "lower," "left," "right," and the like are for purposes of illustration only and do not denote a single embodiment.

Fig. 1 is a flowchart of a speech recognition method based on facial expressions of a user in an embodiment of the present disclosure, as shown in fig. 1, the method includes the following steps:

in step S11, determining the change situation of the facial feature points of the target user in the monitoring environment within a preset time period according to an image recognition model to generate a facial dynamic feature image;

the method comprises the following steps of acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that human face features exist in the monitoring environment based on an image recognition model, and circularly executing the following steps based on the preset distribution rules of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating the face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the human face, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas, wherein the plurality of feature areas at least comprise an eye feature area, a nose feature area and a mouth feature area.

In step S12, matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions, and determining an emotion tag corresponding to the target user;

the expression recognition result is used for representing the emotion label corresponding to the target user, the preset weight is set according to the strength relation of the representation emotion label of each feature area, and the emotion label is determined by fusing a plurality of expression recognition results according to the preset weight and the expression recognition result corresponding to the dynamic sub-image of each feature area.

In step S13, collecting audio data of the target user in the monitoring environment within the preset time period, and generating the user voice corresponding to the target user;

acquiring audio data of the target user in the monitoring environment within the preset time period to generate target audio data, identifying a user voice frequency band in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, performing voice extraction on the target audio data subjected to noise reduction according to voice characteristics, and generating the user voice corresponding to the target user. And the voice of the user is acquired by the microphone to give a control instruction to the intelligent terminal.

In step S14, a voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for carrying out semantic recognition on the user voice to generate semantic information corresponding to the target user;

the initial sample voice data corresponding to the emotion labels are screened out from an initial database, the initial sample voice data are added into a sample training set of a voice recognition model, the voice recognition model is recognized and trained on the basis of the sample training set, the voice of the user is recognized through the trained voice recognition model, and semantic information corresponding to the target user is generated. The initial database comprises a plurality of groups of initial sample voice data and a plurality of mapping relations between emotion labels.

By adopting the technical scheme, the change condition of the facial feature points of the target user in the monitoring environment in a preset time period is collected to generate a facial dynamic feature image corresponding to the target user, and the facial dynamic feature image is segmented to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.

Further, the step S14 includes:

and performing noise reduction processing on the target audio data based on the user voice frequency band, removing the environmental audio in the target audio data, and performing topology recovery on the processed target audio data to generate the user voice corresponding to the target user.

Further, the step S14 includes:

screening initial sample voice data in the initial database based on the emotion tags to obtain a preset number of first sample voice data and corresponding first emotion semantics, wherein the first emotion semantics are semantic information of the first sample voice data under the emotion tags;

performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a full-connection neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantics under the condition that the target semantic information is determined to be inconsistent with the first emotion semantics;

Further, the step S12 includes:

normalizing any one of the feature region dynamic sub-images to generate a dynamic gray level sub-image with the same size;

identifying the gray level sub-image to determine the characteristic region corresponding to the gray level sub-image;

acquiring a plurality of preset dynamic sub-images corresponding to the characteristic area, and matching the preset dynamic sub-images with the gray level sub-images to determine the similarity between the preset dynamic sub-images and the gray level sub-images, wherein each frame of preset dynamic sub-image corresponds to one preset expression recognition result;

and determining a target expression recognition result corresponding to the target preset dynamic sub-image with the maximum similarity as the expression recognition result.

Fig. 2 is a block diagram of a speech recognition apparatus based on facial expressions of a user according to an embodiment of the present disclosure, where the recognition apparatus 100 includes: a first generation module 110, a determination module 120, a second generation module 130, and a third generation module 140.

A first generating module 110, configured to acquire a thermal image in a monitoring environment through an infrared acquisition device, determine, according to a feature recognition algorithm, facial feature points of a target user corresponding to a human face when it is determined that the human face exists in the monitoring environment based on an image recognition model, and cyclically execute the following steps based on the preset distribution rule of the facial feature points until it is determined that the facial feature points of the target user in the monitoring environment change: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a face dynamic feature image corresponding to a target user, matching the face dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, obtaining the frame image with the corresponding duration again, extracting the frame image with the duration if the matching is unsuccessful, generating a face dynamic feature image corresponding to the target user, segmenting the face dynamic feature image based on the preset distribution rule and a plurality of feature areas corresponding to the target user, and generating a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the feature regions include at least an eye feature region, a nose feature region, and a mouth feature region.

The determining module 120 is configured to match the plurality of feature region dynamic sub-images with preset dynamic sub-images of the plurality of feature regions, determine a plurality of expression recognition results corresponding to the plurality of feature region dynamic sub-images, and fuse the plurality of expression recognition results according to a preset weight to determine an emotion tag corresponding to the target user, where the expression recognition results are used for representing the emotion tag corresponding to the target user, and the preset weight is set according to a strength relationship of the representation emotion tags of the respective feature regions.

Second generation module 130, it is right to be used for in the preset time quantum the target user is in audio data in the monitoring environment gathers, generates target audio data, and discerns the user's voice frequency channel that the target user corresponds among the target audio data, according to user's voice frequency channel is right the target audio data carries out the noise reduction processing, according to setting for the pronunciation characteristic to falling after making an uproar the target audio data carries out pronunciation and draws, generates the user's pronunciation that the target user corresponds, wherein, gather through the microphone user's pronunciation assigns control command to intelligent terminal.

A third generating module 140, configured to screen initial sample voice data corresponding to the emotion tags from an initial database, add the initial sample voice data into a sample training set of a voice recognition model, perform recognition training on the voice recognition model based on the sample training set, perform semantic recognition on the user voice through the trained voice recognition model, and generate semantic information corresponding to the target user, where the initial database includes mapping relationships between a plurality of initial sample voice data and a plurality of emotion tags.

The device generates a face dynamic feature image corresponding to the target user by acquiring the change condition of the face feature point of the target user in the monitoring environment within a preset time period, and divides the face dynamic feature image to generate a plurality of feature area dynamic sub-images; matching the plurality of feature region dynamic sub-images with preset dynamic sub-images of corresponding feature regions, determining an emotion label corresponding to a target user, acquiring audio data of the target user in the monitoring environment within a preset time period, performing voice extraction on the target audio data subjected to noise reduction according to voice features, generating user voice corresponding to the target user, screening initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, performing semantic recognition on the user voice through the trained voice recognition model, and generating semantic information corresponding to the target user. Therefore, the emotion labels are generated through judging the facial emotion of the user, the voice recognition model is trained according to the emotion labels, and the trained voice recognition model is used for recognizing the semantics of the voice of the user, so that the intelligent device can more accurately recognize the user intention corresponding to the voice of the user, the accuracy of voice recognition is improved, and better product experience is brought to the user.

Further, the first generating module 110 may be further configured to:

circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed;

selecting a frame image with corresponding duration from the initial dynamic image according to the preset target duration to generate a facial dynamic feature image corresponding to the target user; matching the face dynamic characteristic image with a preset standard dynamic image to generate a comparison result, and judging whether the matching is successful according to the comparison result;

if the matching is successful, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration, prolonging the used target duration, and obtaining the frame image with the corresponding duration again;

and if the matching is unsuccessful, extracting the frame image of the duration to generate a face dynamic characteristic image corresponding to the target user.

Further, the second generating module 130 may be further configured to:

Further, the third generating module 140 may be further configured to:

normalizing the dynamic sub-images in any characteristic region to generate dynamic gray sub-images with the same size;

The present disclosure also provides a computer storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the voice recognition method based on facial expressions of a user according to any one of the preceding claims.

The present disclosure also provides an electronic device comprising a computer program which, when executed by a processor, carries out the steps of the method for speech recognition based on facial expressions of a user according to any one of the preceding claims.

All possible combinations of the technical features of the above embodiments may not be described for the sake of brevity, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. A method of speech recognition based on facial expressions of a user, the method comprising:

matching the plurality of feature area dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature areas, determining a plurality of expression recognition results corresponding to the plurality of feature area dynamic sub-images, fusing the plurality of expression recognition results according to preset weights, and determining an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of the emotion label represented by each feature area;

2. The identification method according to claim 1, wherein the performing noise reduction processing on the target audio data according to the user voice frequency band, and performing voice extraction on the target audio data subjected to noise reduction according to a set voice feature to generate the user voice corresponding to the target user comprises:

3. The recognition method according to claim 1, wherein the step of screening out initial sample voice data corresponding to the emotion label from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user comprises:

performing feature extraction on the first sample voice data through a feature extraction network of the voice recognition model to generate a feature vector corresponding to the first sample voice data, performing semantic recognition on the feature vector through a fully-connected neural network of the voice recognition model to generate target semantic information, and updating the voice recognition model according to the first emotion semantic information under the condition that the target semantic information is determined to be inconsistent with the first emotion semantic information;

4. The identification method according to claim 1, wherein the matching the plurality of feature region dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature regions to determine a plurality of expression identification results corresponding to the plurality of feature region dynamic sub-images, and the fusing the plurality of expression identification results according to preset weights to determine the emotion label corresponding to the target user comprises:

5. A speech recognition apparatus based on facial expressions of a user, comprising:

the first generation module is used for acquiring a thermal image in a monitoring environment through an infrared acquisition device, determining facial feature points of a target user corresponding to a human face according to a feature recognition algorithm under the condition that the human face exists in the monitoring environment is confirmed based on an image recognition model, and circularly executing the following steps based on a preset distribution rule of the facial feature points until the facial feature points of the target user in the monitoring environment are determined to be changed: selecting a frame image with corresponding duration from an initial dynamic image according to preset target duration to generate a facial dynamic feature image corresponding to a target user, matching the facial dynamic feature image with a preset standard dynamic image to generate a comparison result, judging whether the matching is successful according to the comparison result, judging that the target user does not generate emotion fluctuation in the frame image with the corresponding duration if the matching is successful, prolonging the used target duration, re-obtaining the frame image with the corresponding duration, extracting the frame image with the duration if the matching is unsuccessful, generating a facial dynamic feature image corresponding to the target user, and segmenting the facial dynamic feature image based on a preset distribution rule and a plurality of feature areas corresponding to the target user to generate a plurality of feature area dynamic sub-images corresponding to the plurality of feature areas; wherein the plurality of feature regions includes at least an eye feature region, a nose feature region, and a mouth feature region;

the determining module is used for matching the plurality of feature area dynamic sub-images with a plurality of preset dynamic sub-images corresponding to the plurality of feature areas, determining a plurality of expression recognition results corresponding to the plurality of feature area dynamic sub-images, fusing the plurality of expression recognition results according to preset weights, and determining an emotion label corresponding to the target user, wherein the expression recognition results are used for representing the emotion label corresponding to the target user, and the preset weights are set according to the strength relation of representing the emotion label in each feature area;

the second generation module is used for acquiring audio data of the target user in the monitoring environment within the preset time period, generating target audio data, identifying a user voice frequency band corresponding to the target user in the target audio data, performing noise reduction processing on the target audio data according to the user voice frequency band, performing voice extraction on the target audio data subjected to noise reduction according to set voice characteristics, and generating user voice corresponding to the target user, wherein a control instruction is issued to an intelligent terminal by acquiring the user voice acquired by a microphone;

the third generation module is used for screening out initial sample voice data corresponding to the emotion labels from an initial database, adding the initial sample voice data into a sample training set of a voice recognition model, performing recognition training on the voice recognition model based on the sample training set, and performing semantic recognition on the user voice through the trained voice recognition model to generate semantic information corresponding to the target user, wherein the initial database comprises mapping relations between a plurality of initial sample voice data and a plurality of emotion labels.

6. The identification device of claim 5, wherein the second generation module is further configured to:

7. A computer storage medium, having a computer program stored thereon, which, when being executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of claims 1 to 4.

8. An electronic device comprising a computer program, wherein the computer program, when executed by a processor, performs the steps of the method for speech recognition based on facial expressions of a user according to any one of claims 1 to 4.