CN109766759A

CN109766759A - Emotion identification method and Related product

Info

Publication number: CN109766759A
Application number: CN201811519898.8A
Authority: CN
Inventors: 陈奕丹; 谢利民; 莫磊
Original assignee: Chengdu Yuntian Lifei Technology Co Ltd
Current assignee: Chengdu Yuntian Lifei Technology Co Ltd
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2019-05-17

Abstract

The embodiment of the present application provides a kind of Emotion identification method and Related product, which comprises obtains the video clip and audio fragment for target user of designated time period；The video clip is analyzed, the behavior of target limbs and target face expression of the target user is obtained；Parameter extraction is carried out to the audio fragment, obtains the target voice characteristic parameter of the target user；Go out the target emotion of the target user according to the target limbs behavior, the facial expression and the speech characteristic parameter decision.Limbs behavior, facial expression and phonetic feature can be parsed from video by the embodiment of the present application, three above dimension reflects user emotion to a certain extent, in turn, the mood that user is gone out by three dimension Shared Decision Makings, can precisely identify user emotion.

Description

Emotion recognition method and related product

Technical Field

The application relates to the technical field of video monitoring, in particular to an emotion recognition method and a related product.

Background

With the rapid development of economy, society and culture, the influence at home and abroad is increasing day by day, more and more foreign people flow to cities, the increase of the population accelerates the urbanization process and brings greater challenges to city management. In life, the expression of a person is mainly recognized by acquiring image information of the person through a camera, and then the emotion of the person is analyzed, but the emotion recognition accuracy of the single judgment mode is low.

Disclosure of Invention

The embodiment of the application provides an emotion recognition method and a related product, which can accurately recognize the expression of a user.

In a first aspect, an embodiment of the present application provides an emotion recognition method, including:

acquiring a video clip and an audio clip aiming at a target user in a specified time period;

analyzing the video segments to obtain target limb behaviors and target facial expressions of the target user;

extracting parameters of the audio clip to obtain target voice characteristic parameters of the target user;

and deciding the target emotion of the target user according to the target limb behaviors, the facial expression and the voice characteristic parameters.

Optionally, the method further comprises:

performing identity verification on the target user;

after the target user passes the identity verification, acquiring an indoor map;

marking the position of the target user in the indoor map to obtain a target area where the target user is located;

determining a target control parameter corresponding to the target emotion and used for controlling at least one piece of intelligent home equipment corresponding to the target area according to a mapping relation between preset emotion and the control parameter;

and adjusting the at least one piece of intelligent household equipment according to the target control parameter.

Further optionally, the authenticating the target user includes:

acquiring a first face image of the target user;

performing image quality evaluation on the first face image to obtain a target image quality evaluation value;

determining a target matching threshold corresponding to the target image quality evaluation value according to a mapping relation between a preset image quality evaluation value and the matching threshold;

extracting the contour of the first face image to obtain a first peripheral contour;

extracting feature points of the first face image to obtain a first feature point set;

matching the first peripheral outline with a second peripheral outline of the preset face template to obtain a first matching value;

matching the first characteristic point set with a second characteristic point set of the preset face template to obtain a second matching value;

determining a target matching value according to the first matching value and the second matching value;

and when the target matching value is larger than the target matching threshold value, confirming that the target user identity authentication is passed.

A second aspect of embodiments of the present application provides an emotion recognition apparatus, including:

an acquisition unit configured to acquire a video clip and an audio clip for the target user for a specified time period;

the analysis unit is used for analyzing the video segments to obtain target limb behaviors and target facial expressions of the target users;

the extraction unit is used for extracting parameters of the audio clips to obtain target voice characteristic parameters of the target user;

and the decision unit is used for deciding the target emotion of the target user according to the target limb behaviors, the facial expression and the voice characteristic parameters.

In a third aspect, embodiments of the present application provide an emotion recognition apparatus, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for performing the steps in the first aspect of the embodiments of the present application.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program enables a computer to perform some or all of the steps described in the first aspect of the embodiment of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

The embodiment of the application has the following beneficial effects:

according to the emotion recognition method and the relevant products, the video segment and the audio segment of the target user in the appointed time period are obtained, the video segment is analyzed, the target limb behavior and the target facial expression of the target user are obtained, the parameters of the audio segment are extracted, the target voice characteristic parameters of the target user are obtained, the target emotion of the target user is decided according to the target limb behavior, the facial expression and the voice characteristic parameters, therefore, the limb behavior, the facial expression and the voice characteristic are analyzed from the video, the emotion of the user is reflected to a certain degree through the three dimensions, the emotion of the user is decided through the three dimensions, and the emotion of the user can be recognized accurately.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1A is a schematic flowchart of an embodiment of an emotion recognition method provided in an embodiment of the present application;

fig. 1B is a schematic illustration of an illustration of another emotion recognition method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of an embodiment of another emotion recognition method provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of an embodiment of an emotion recognition device provided in an embodiment of the present application;

fig. 4 is a schematic structural diagram of another emotion recognition device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The emotion recognition device described in the embodiment of the present application may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a video matrix, a monitoring platform, a Mobile Internet device (MID, Mobile Internet Devices), or a wearable device, which are merely examples, but not exhaustive, and include but are not limited to the foregoing Devices, and of course, the emotion recognition device may also be a server.

It should be noted that, the emotion recognition apparatus in the embodiment of the present application may be connected to a plurality of cameras, each of which may be used to capture a video image, and each of which may have a position mark corresponding to the camera, or may have a number corresponding to the camera. Typically, the camera may be located in a public place, such as a school, museum, intersection, pedestrian street, office building, garage, airport, hospital, subway station, bus station, supermarket, hotel, entertainment venue, and the like. After the camera shoots the video image, the video image can be stored in a memory of a system where the emotion recognition device is located. The memory may store a plurality of image libraries, each image library may contain different video images of the same person, and of course, each image library may also be used to store video images of an area or video images captured by a specific camera.

Further optionally, in this embodiment of the application, each frame of video image shot by the camera corresponds to one attribute information, where the attribute information is at least one of the following: the shooting time of the video image, the position of the video image, the attribute parameters (format, size, resolution, etc.) of the video image, the number of the video image, and the character feature attributes in the video image. The character attributes in the video image may include, but are not limited to: number of persons in the video image, position of persons, angle of persons, age, image quality, and the like.

It should be further noted that the video image acquired by each camera is usually a dynamic human face image, and therefore, in the embodiment of the present application, the angle information of the human face image may be planned, and the angle information may include but is not limited to: horizontal rotation angle, pitch angle or inclination. For example, it is possible to define that the dynamic face image data requires a interocular distance of not less than 30 pixels, and it is recommended to have more than 60 pixels. The horizontal rotation angle is not more than +/-30 degrees, the pitch angle is not more than +/-20 degrees, and the inclination angle is not more than +/-45 degrees. The horizontal rotation angle is recommended not to exceed +/-15 degrees, the pitch angle is recommended not to exceed +/-10 degrees, and the inclination angle is recommended not to exceed +/-15 degrees. For example, whether the face image is blocked by other objects can be screened, in general, the main area of the face should not be blocked by ornaments, such as dark sunglasses, masks, exaggerated jewelry, etc., and of course, dust may be distributed on the camera, which may cause the face image to be blocked. The image format of the video image in the embodiment of the present application may include, but is not limited to: BMP, JPEG, JPEG2000, PNG and the like, the size of the video image can be 10-30KB, each video image can also correspond to information such as shooting time, the unified serial number of a camera for shooting the video image, the link of a panoramic big image corresponding to the face image and the like (the face image and the global image establish a characteristic corresponding relation file).

Please refer to fig. 1A, which is a flowchart illustrating an embodiment of a method for emotion recognition according to an embodiment of the present application. The emotion recognition method described in this embodiment includes the following steps:

101. video segments and audio segments for a target user are obtained for a specified time period.

The specified time period can be set by the user or defaulted by the system. The video clip and the audio clip can be for the same physical scene and the same designated time period. The target user can be any user, tracking shooting can be carried out on the target user to obtain a video clip of the target user, and meanwhile shooting is carried out on a shooting site to obtain an audio clip.

102. And analyzing the video segments to obtain the target limb behaviors and the target facial expressions of the target users.

The video clip is a series of limb actions of the target user, so that the limb actions can be recognized to obtain a target limb action, and the target limb action can be at least one action, in this embodiment, the limb action can be at least one of the following: walking, forking, running, hand lifting, keyboard typing, thinking, hand pulling, frame putting, etc., without limitation. Certainly, the video segment may further include a human face, and the expression recognition is performed on the human face to obtain a target facial expression, in this embodiment of the application, the expression may be at least one of the following expressions: happiness, anger, grief, happiness, depression, irritability, surprise, embarrassment, excitement, tension and the like, but not limited thereto.

Optionally, in the step 102, analyzing the video segment to obtain the body behavior and the facial expression of the target user may include the following steps:

21. analyzing the video clips to obtain multi-frame video images;

22. carrying out image segmentation on the multi-frame video image to obtain a plurality of target images, wherein each target image is a human body image of the target user;

23. performing behavior recognition according to the target images to obtain the target limb behaviors;

24. carrying out face recognition on the target images to obtain a plurality of face images;

25. and performing expression recognition on the plurality of facial images to obtain a plurality of expressions, and taking the expression with the largest occurrence frequency in the plurality of expressions as the target facial expression.

In a specific implementation, the video segments may be parsed to obtain multiple frames of video images, and certainly, not each frame of video image includes a human body image, so that the multiple frames of video images may be image-segmented to obtain multiple target images, each target image includes a whole human body image of a user, and further, behavior recognition may be performed on the multiple target images to obtain target body behaviors, specifically, the multiple target images may be input to a preset neural network model to obtain at least body behaviors, the preset neural network model may be set by the user or default, for example, a convolutional neural network model, and face recognition may be performed on the multiple target images to obtain multiple face images, and expression recognition may be performed on the multiple face images, and different face images may correspond to different expressions, therefore, a plurality of expressions can be obtained, and then, the expression with the largest occurrence frequency in the plurality of expressions can be used as a target facial expression, so that the limb behaviors can be analyzed through a series of video images, and the facial expression of a user can be accurately identified.

103. And extracting parameters of the audio clip to obtain target voice characteristic parameters of the target user.

In this embodiment, the voice feature parameter may include at least one of the following: speed of speech, intonation, keywords, timbre, etc., without limitation, the timbre is associated with the frequency of the sound. Since the inclusion of some information in the speech may reflect the user's mood from the side, e.g., the user says "beauty," it may indicate that his mood is a favorite, or happy. For another example, if the user says "embarrassment," the user may indicate that the user is embarrassed.

Optionally, the target speech feature parameters include at least one of: target speed of speech, target intonation, at least one target keyword; in step 103, extracting parameters of the audio segment to obtain the target speech feature parameters of the target user, the method may include the following steps:

31. performing semantic analysis on the audio clip to obtain a plurality of characters;

32. determining pronunciation duration corresponding to the characters, and determining the speed of speech according to the characters and the pronunciation duration;

or,

33. performing character segmentation on the characters to obtain a plurality of keywords;

34. matching the plurality of keywords with a preset keyword set to obtain at least one target keyword;

or,

35. extracting a waveform image of the voice segment;

36. and analyzing the oscillogram to obtain the target intonation.

In a specific implementation, semantic analysis may be performed on the audio segment to obtain a plurality of characters, each of the plurality of characters may correspond to a pronunciation duration, and then, a pronunciation duration corresponding to the plurality of characters may be determined, a speed of speech is the pronunciation duration corresponding to the plurality of characters/the total number of the plurality of characters, and further, character segmentation may be performed on the plurality of characters to obtain a plurality of keywords, each of which may be understood as a word, in this embodiment, a preset keyword set may be stored in advance, and the preset keyword set may include at least one keyword, and then the keyword is matched with the preset keyword set to obtain at least one target keyword that is successfully matched, for a tone of speech, a waveform diagram of the speech segment may be extracted, and the waveform diagram may be analyzed to obtain a target tone of speech, for example, and taking the average amplitude as a target intonation, so that characteristic parameters of multiple dimensions can be obtained by analyzing the voice segments.

104. And deciding the target emotion of the target user according to the target limb behaviors, the facial expression and the voice characteristic parameters.

The body behavior, the facial emotion and the voice characteristic parameters reflect the emotion of the user to a certain degree, so that the decision can be made through the three dimensions to obtain the final target emotion.

Optionally, in the step 104, deciding the target emotion of the target user according to the target limb behavior, the target facial expression and the target voice feature parameter may include the following steps:

41. determining a first emotion set corresponding to the target limb behavior, wherein the first emotion set comprises at least one emotion, and each emotion corresponds to a first emotion probability value;

42. determining a second emotion set corresponding to the target facial expression, wherein the second emotion set comprises at least one emotion, and each emotion corresponds to a second emotion probability value;

43. determining a third emotion set corresponding to the target voice characteristic parameter, wherein the third emotion set comprises at least one emotion, and each emotion corresponds to a third emotion probability value;

44. acquiring a first weight corresponding to the body behavior, a second weight corresponding to the facial expression and a third weight corresponding to the voice characteristic parameter;

45. determining score values of each type of emotion according to the first weight, the second weight, the third weight, the first emotion set, the second emotion set and the third emotion set to obtain a plurality of score values;

46. and selecting a maximum value from the plurality of score values, and taking the emotion corresponding to the maximum value as the target emotion.

As shown in fig. 1B, each body behavior may correspond to at least one emotion, each facial expression may correspond to at least one emotion, each voice feature parameter may also correspond to at least one emotion, each emotion corresponds to an emotion probability value, and a final target emotion may be determined according to all the emotions and the emotion probability values.

Specifically, different body behaviors may correspond to different emotions, and therefore, the emotion recognition device may pre-store a mapping relationship between the body behaviors and the emotions, and further, determine at least one emotion corresponding to the target body behavior according to the mapping relationship, where the at least one emotion may form a first emotion set, and of course, each emotion may correspond to a first emotion probability value, where the first emotion probability value may be preset or default to the system. Of course, different facial expressions may also correspond to different emotions, and therefore, the emotion recognition apparatus may pre-store a mapping relationship between the facial expressions and the emotions, and may determine at least one emotion corresponding to the target facial expression through the mapping relationship, where the at least one emotion may constitute a second emotion set, and each emotion in the second emotion set may correspond to a second emotion probability value, where the second emotion probability value may be preset or default to the system. In addition, each voice characteristic parameter can also correspond to different emotions, so that the emotion recognition device can also pre-store a mapping relation between the voice characteristic parameters and the emotions, further determine at least one emotion corresponding to the target voice characteristic parameter according to the mapping relation, form a third emotion set by the at least one emotion, wherein each emotion in the third emotion set can correspond to a third emotion probability value, and the third emotion probability value can be preset or default in a system. Further, a first weight corresponding to the body behavior, a second weight corresponding to the facial expression, and a third weight corresponding to the voice feature parameter may also be obtained, where of course, the first weight, the second weight, and the third weight may be preset, and the first weight + the second weight + the third weight is equal to 1. The score value of each emotion is the emotion value and the emotion probability value, so that the score value of each emotion can be obtained, the score value corresponding to each emotion is determined, the score value corresponding to each emotion is obtained, the maximum value is selected from the score values of each emotion, the emotion corresponding to the maximum value is used as the target emotion, the final emotion can be determined through three dimensions of body behaviors, facial expressions and voice, the single-dimension emotion recognition error is large, the multiple dimensions can weaken the recognition error caused by the single dimension, and the emotion recognition precision is improved.

The first probability value, the second probability value and the third probability value can be obtained from big data, specifically, the body behaviors, facial expressions and voices of a large number of users are collected, and probability values of emotions corresponding to the body behaviors are analyzed.

Further optionally, the target limb behavior comprises at least one limb behavior; in step 23, the behavior recognition is performed according to the target images to obtain the target limb behaviors, which may be implemented as follows:

inputting the target images into a preset neural network model to obtain at least one limb behavior and the identification probability corresponding to each limb behavior;

then, in step 41, determining a first emotion set corresponding to the target limb behavior, where the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value, which may include the following steps:

411. determining the emotion corresponding to each limb behavior in the at least one limb behavior according to a preset mapping relation between the limb behavior and the emotion to obtain at least one emotion, wherein each emotion corresponds to a preset probability value;

412. and calculating according to the recognition probability corresponding to each body behavior and a preset probability value corresponding to each emotion in the at least one emotion to obtain at least one first emotion probability value, wherein each emotion corresponds to one first emotion probability value.

In a specific implementation, a plurality of target images can be input into a preset neural network model, and each target image is subjected to behavior recognition by the preset neural network model, so that at least one limb behavior can be obtained, and each limb behavior can correspond to one recognition probability. The emotion recognition device can pre-store a mapping relation between preset body behaviors and emotions, further determine the emotion corresponding to at least one body behavior according to the mapping relation to obtain at least one emotion, each emotion can correspond to a preset probability value, the preset probability value can be preset, further calculate according to the recognition probability corresponding to each body behavior and the preset probability value corresponding to each emotion in at least one emotion to obtain at least one first emotion probability value, namely the first emotion probability value is the recognition probability and the preset probability value, therefore, the final emotion probability value can be accurately determined according to the recognition precision of the body behaviors and the probability value of the emotion, and accurate analysis of the emotion of a user is facilitated.

Of course, in the above steps 42 to 43, the face image or the voice may be input into a neural network model, and the corresponding emotion is obtained through the neural network model, and the specific idea may refer to the above steps 411 to 412, which is not described again.

For example, the emotion recognition device may capture information of a test object by using a camera, where the information includes a snapshot image based on a snapshot machine and a video including the test object within a certain time, where the snapshot image is used to identify facial expression information of the test object, and the video content may be used to identify a body behavior of the test object, and may also test a facial expression of the test object, specifically, the facial information captured by the snapshot machine may be analyzed and obtained, and the body behavior of the test object within a video segment may be analyzed, specifically, a deep learning pre-training model is used to simultaneously take the facial expression information and the body behavior as inputs, and output an emotion multi-classification probability, which is used as an emotion pre-determination of the test object. Of course, the category of emotion may be determined according to specific situations, for example, including happiness, excitement, tension, anger, and the like, and is not limited herein. In addition, the emotion recognition device can adopt a recording device (such as a loudspeaker) to obtain an audio clip of the test object in the same time, the audio content is used for recognizing the voice, the speed and the tone of the test object, finally, the voice, the speed and the tone of the test object can be utilized and input into a pre-training model based on deep learning to output emotion multi-classification probability, so that emotion result pre-judgment based on the audio of the test object is performed, and finally, the three emotion multi-classification probability pre-judgment results are combined, and a final integrated emotion judgment result is output based on the deep learning model.

Further optionally, after the step 104, the following steps may be further included:

a1, performing identity authentication on the target user;

a2, obtaining an indoor map after the target user passes the identity verification;

a3, marking the position of the target user in the indoor map to obtain a target area where the target user is located;

a4, determining a target control parameter corresponding to the target emotion and used for controlling at least one piece of intelligent home equipment corresponding to the target area according to a mapping relation between preset emotion and the control parameter;

a5, adjusting the at least one smart home device according to the target control parameter.

The emotion recognition device can pre-store a mapping relation between preset emotion and control parameters, and the smart home device can be at least one of the following: the intelligent air conditioner, the intelligent humidifier, the intelligent sound box, the intelligent lamp, the intelligent curtain, the intelligent massage armchair, the intelligent television and the like are not limited herein. In the specific implementation, the identity of the target user can be verified, after the identity verification is passed, an indoor map of the current scene can be obtained, the position of the target user is marked in the indoor map, a target area where the target user is located is obtained, and according to the mapping relation between the pre-stored preset emotion and the control parameters, the target control parameters corresponding to the target emotion and used for controlling at least one piece of smart home equipment corresponding to the target area can be determined, wherein the control parameters can be at least one of the following: temperature accommodate parameter, humidity accommodate parameter, audio amplifier broadcast parameter (for example, volume, song, audio etc.), luminance or colour temperature accommodate parameter, curtain control parameter (closed degree parameter), massage armchair control parameter (massage mode, massage duration), TV control parameter (volume, curved surface, luminance, colour temperature etc.), and then, adjust at least one intelligent household equipment according to this target control parameter, so, can be according to different moods, adjust different environment, help alleviating user's mood, promote user experience.

Further optionally, the step a1, performing identity verification on the target user, may include the following steps:

a11, acquiring a first face image of the target user;

a12, carrying out image quality evaluation on the first face image to obtain a target image quality evaluation value;

a13, determining a target matching threshold corresponding to the target image quality evaluation value according to a preset mapping relation between the image quality evaluation value and the matching threshold;

a14, extracting the contour of the first face image to obtain a first peripheral contour;

a15, extracting characteristic points of the first face image to obtain a first characteristic point set;

a16, matching the first peripheral outline with a second peripheral outline of the preset face template to obtain a first matching value;

a17, matching the first feature point set with a second feature point set of the preset face template to obtain a second matching value;

a18, determining a target matching value according to the first matching value and the second matching value;

a19, confirming that the target user identity verification is passed when the target matching value is larger than the target matching threshold value.

The emotion recognition device can pre-store a preset face template, and in the face recognition process, success or failure depends on the quality of a face image to a great extent, so in the embodiment of the application, a dynamic matching threshold value can be considered, that is, if the quality is good, the matching threshold value can be improved, if the quality is poor, the matching threshold value can be reduced, and because the quality of a shot image is not necessarily good in a dark visual environment, the matching threshold value can be properly adjusted. The emotion recognition device may further store a mapping relationship between a preset image quality evaluation value and a matching threshold, further determine a target matching threshold corresponding to the target image quality evaluation value according to the mapping relationship, on the basis, perform contour extraction on the first face image to obtain a first peripheral contour, perform feature point extraction on the first face image to obtain a first feature point set, match the first peripheral contour with a second peripheral contour of a preset face template to obtain a first matching value, match the first feature point set with a second feature point set of the preset face template to obtain a second matching value, further determine a target matching value according to the first matching value and the second matching value, for example, may store a mapping relationship between an environmental parameter and a weight value pair in advance to obtain a first weight coefficient corresponding to the first matching value and a second weight coefficient corresponding to the second matching value, and finally, when the target matching value is greater than a target matching threshold value, confirming that the first face image is successfully matched with the preset face template, otherwise, confirming that the face recognition fails, thus dynamically adjusting the face matching process, and being beneficial to improving the face recognition efficiency aiming at specific environments.

In addition, the algorithm of the contour extraction may be at least one of: hough transform, canny operator, etc., and the algorithm for feature point extraction may be at least one of the following algorithms: harris corners, Scale Invariant Feature Transform (SIFT), and the like, without limitation.

Alternatively, in the step a12, the image quality evaluation is performed on the first face image to obtain the target image quality evaluation value, and the method may be implemented as follows:

and carrying out image quality evaluation on the first face image by adopting at least one image quality evaluation index to obtain a target image quality evaluation value.

The image quality evaluation index may include, but is not limited to: mean gray scale, mean square error, entropy, edge preservation, signal-to-noise ratio, and the like. It can be defined that the larger the resulting image quality evaluation value is, the better the image quality is.

According to the emotion recognition method, the video segments and the audio segments of the target user in the appointed time period are obtained, the video segments are analyzed to obtain the target body behaviors and the target facial expressions of the target user, the audio segments are subjected to parameter extraction to obtain the target voice characteristic parameters of the target user, and the target emotion of the target user is decided according to the target body behaviors, the facial expressions and the voice characteristic parameters.

In accordance with the above, please refer to fig. 2, which is a flowchart illustrating an embodiment of an emotion recognition method according to an embodiment of the present application. The emotion recognition method described in this embodiment includes the following steps:

201. video segments and audio segments for a target user are obtained for a specified time period.

202. And analyzing the video segments to obtain the target limb behaviors and the target facial expressions of the target users.

203. And extracting parameters of the audio clip to obtain target voice characteristic parameters of the target user.

204. And deciding the target emotion of the target user according to the target limb behaviors, the facial expression and the voice characteristic parameters.

205. And performing identity verification on the target user.

206. And acquiring an indoor map after the target user passes the identity verification.

207. And marking the position of the target user in the indoor map to obtain a target area where the target user is located.

208. And determining a target control parameter corresponding to the target emotion and used for controlling at least one piece of intelligent household equipment corresponding to the target area according to a mapping relation between preset emotion and the control parameter.

209. And adjusting the at least one piece of intelligent household equipment according to the target control parameter.

The emotion recognition method described in the above steps 201-209 may refer to the corresponding steps of the emotion recognition method described in fig. 1A.

It can be seen that, by the emotion recognition method in the embodiment of the application, a video segment and an audio segment for a target user in a specified time period are obtained, the video segment is analyzed to obtain a target body behavior and a target facial expression of the target user, a parameter of the audio segment is extracted to obtain a target voice characteristic parameter of the target user, a target emotion of the target user is decided according to the target body behavior, the facial expression and the voice characteristic parameter, the target user is authenticated, an indoor map is obtained after the target user passes the authentication, the position of the target user is marked in the indoor map to obtain a target area where the target user is located, and a target control parameter corresponding to the target emotion and used for controlling at least one piece of smart home equipment corresponding to the target area is determined according to a mapping relationship between a preset emotion and the control parameter, adjust at least one intelligent home equipment according to target control parameter, so, analyze out limbs action, facial expression and speech characteristics from the video, user's mood has all been reflected to a certain extent to above three dimension, and then, decides out user's mood through this three dimension jointly, can accurate discernment user's mood, can also help alleviating user's mood according to the intelligent home equipment in mood regulation ring border, promote user experience.

In accordance with the above, the following is a device for implementing the emotion recognition method, specifically as follows:

please refer to fig. 3, which is a schematic structural diagram of an embodiment of an emotion recognition apparatus according to an embodiment of the present application. The emotion recognition apparatus described in this embodiment includes: the obtaining unit 301, the analyzing unit 302, the extracting unit and the deciding unit 304 are specifically as follows:

an acquisition unit 301 configured to acquire a video clip and an audio clip for a target user in a specified time period;

an analysis unit 302, configured to analyze the video segment to obtain a target limb behavior and a target facial expression of the target user;

an extracting unit 303, configured to perform parameter extraction on the audio segment to obtain a target speech feature parameter of the target user;

a decision unit 304, configured to decide a target emotion of the target user according to the target limb behavior, the facial expression, and the voice feature parameter.

The obtaining unit 301 may be configured to implement the method described in the step 101, the analyzing unit 302 may be configured to implement the method described in the step 102, the extracting unit 303 may be configured to implement the method described in the step 103, the deciding unit 304 may be configured to implement the method described in the step 104, and so on.

The emotion recognition device is applied to the emotion recognition device, video segments and audio segments of a target user in a specified time period are obtained, the video segments are analyzed to obtain target body behaviors and target facial expressions of the target user, parameters of the audio segments are extracted to obtain target voice characteristic parameters of the target user, the target emotion of the target user is decided according to the target body behaviors, the facial expressions and the voice characteristic parameters, therefore, the body behaviors, the facial expressions and the voice characteristics are analyzed from a video, the emotion of the user is reflected to a certain degree through the three dimensions, the emotion of the user is decided through the three dimensions, and the emotion of the user can be recognized accurately.

In a possible example, in the analyzing the video segment to obtain the body behavior and the facial expression of the target user, the analyzing unit 302 is specifically configured to:

analyzing the video clips to obtain multi-frame video images;

carrying out image segmentation on the multi-frame video image to obtain a plurality of target images, wherein each target image is a human body image of the target user;

performing behavior recognition according to the target images to obtain the target limb behaviors;

carrying out face recognition on the target images to obtain a plurality of face images;

and performing expression recognition on the plurality of facial images to obtain a plurality of expressions, and taking the expression with the largest occurrence frequency in the plurality of expressions as the target facial expression.

In one possible example, in the aspect of deciding the target emotion of the target user according to the target limb behavior, the target facial expression and the target voice feature parameter, the decision unit 304 is specifically configured to:

determining a first emotion set corresponding to the target limb behavior, wherein the first emotion set comprises at least one emotion, and each emotion corresponds to a first emotion probability value;

determining a second emotion set corresponding to the target facial expression, wherein the second emotion set comprises at least one emotion, and each emotion corresponds to a second emotion probability value;

determining a third emotion set corresponding to the target voice characteristic parameter, wherein the third emotion set comprises at least one emotion, and each emotion corresponds to a third emotion probability value;

acquiring a first weight corresponding to the body behavior, a second weight corresponding to the facial expression and a third weight corresponding to the voice characteristic parameter;

determining score values of each type of emotion according to the first weight, the second weight, the third weight, the first emotion set, the second emotion set and the third emotion set to obtain a plurality of score values;

and selecting a maximum value from the plurality of score values, and taking the emotion corresponding to the maximum value as the target emotion.

In one possible example, the target limb behavior comprises at least one limb behavior;

in the aspect of performing behavior recognition according to the target images to obtain the target limb behaviors, the analysis unit 302 is specifically configured to:

in the determining of the first emotion set corresponding to the target limb behavior, where the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value, the decision unit is specifically configured to:

determining the emotion corresponding to each limb behavior in the at least one limb behavior according to a preset mapping relation between the limb behavior and the emotion to obtain at least one emotion, wherein each emotion corresponds to a preset probability value;

and calculating according to the recognition probability corresponding to each body behavior and a preset probability value corresponding to each emotion in the at least one emotion to obtain at least one first emotion probability value, wherein each emotion corresponds to one first emotion probability value.

the determining a first emotion set corresponding to the target limb behavior, the first emotion set comprising at least one emotion, each emotion corresponding to a first emotion probability value, includes:

In one possible example, the target speech feature parameters include at least one of: target speed of speech, target intonation, at least one target keyword;

in the aspect of extracting parameters of the audio segment to obtain the target speech feature parameters of the target user, the extracting unit 303 is specifically configured to:

performing semantic analysis on the audio clip to obtain a plurality of characters;

determining pronunciation duration corresponding to the characters, and determining the speed of speech according to the characters and the pronunciation duration;

or,

performing character segmentation on the characters to obtain a plurality of keywords;

matching the plurality of keywords with a preset keyword set to obtain at least one target keyword;

or,

extracting a waveform image of the voice segment;

and analyzing the oscillogram to obtain the target intonation.

It can be understood that the functions of each program module of the emotion recognition apparatus in this embodiment may be specifically implemented according to the method in the above method embodiment, and the specific implementation process may refer to the relevant description of the above method embodiment, which is not described herein again.

In accordance with the above, please refer to fig. 4, which is a schematic structural diagram of an embodiment of an emotion recognition apparatus provided in an embodiment of the present application. The emotion recognition apparatus described in this embodiment includes: at least one input device 1000; at least one output device 2000; at least one processor 3000, e.g., a CPU; and a memory 4000, the input device 1000, the output device 2000, the processor 3000, and the memory 4000 being connected by a bus 5000.

The input device 1000 may be a touch panel, a physical button, or a mouse.

The output device 2000 may be a display screen.

The memory 4000 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 4000 is used for storing a set of program codes, and the input device 1000, the output device 2000 and the processor 3000 are used for calling the program codes stored in the memory 4000 to execute the following operations:

the processor 3000 is configured to:

Therefore, through the emotion recognition device provided by the embodiment of the application, the video segment and the audio segment of the specified time period for the target user are obtained, the video segment is analyzed to obtain the target body behavior and the target facial expression of the target user, the audio segment is subjected to parameter extraction to obtain the target voice characteristic parameter of the target user, and the target emotion of the target user is decided according to the target body behavior, the target facial expression and the voice characteristic parameter.

In one possible example, in analyzing the video segment to obtain the body behavior and the facial expression of the target user, the processor 3000 is specifically configured to:

analyzing the video clips to obtain multi-frame video images;

In one possible example, in said deciding the target emotion of the target user according to the target limb behavior, the target facial expression and the target voice feature parameter, the processor 3000 is specifically configured to:

in the aspect of performing behavior recognition according to the target images to obtain the target limb behaviors, the processor 3000 is specifically configured to:

in the determining the first emotion set corresponding to the target limb behavior, where the first emotion set includes at least one emotion, and each emotion corresponds to a first emotion probability value, the processor 3000 is specifically configured to:

in the aspect of extracting parameters of the audio segment to obtain the target speech feature parameters of the target user, the processor 3000 is specifically configured to:

or,

extracting a waveform image of the voice segment;

and analyzing the oscillogram to obtain the target intonation.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium may store a program, and the program includes, when executed, some or all of the steps of any one of the emotion recognition methods described in the above method embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device), or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. A computer program stored/distributed on a suitable medium supplied together with or as part of other hardware, may also take other distributed forms, such as via the Internet or other wired or wireless telecommunication systems.

Although the present application has been described in conjunction with specific features and embodiments thereof, it will be evident that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and figures are merely exemplary of the present application as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the present application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of emotion recognition, comprising:

2. The method of claim 1, wherein analyzing the video segments for the target user's limb behavior and facial expression comprises:

analyzing the video clips to obtain multi-frame video images;

3. The method of claim 1 or 2, wherein said deciding a target emotion of the target user based on the target limb behavior, the target facial expression and the target voice feature parameters comprises:

4. The method of claim 3, wherein the target limb behavior comprises at least one limb behavior;

the behavior recognition according to the target images to obtain the target limb behaviors comprises the following steps:

5. The method according to any of claims 1-4, wherein the target speech feature parameters comprise at least one of: target speed of speech, target intonation, at least one target keyword;

the parameter extraction of the audio clip to obtain the target voice characteristic parameter of the target user comprises:

or,

extracting a waveform image of the voice segment;

and analyzing the oscillogram to obtain the target intonation.

6. An emotion recognition apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video segment and an audio segment aiming at a target user in a specified time period;

7. The apparatus according to claim 6, wherein in the analyzing the video segments to obtain the body behaviors and facial expressions of the target user, the analyzing unit is specifically configured to:

analyzing the video clips to obtain multi-frame video images;

8. The apparatus according to claim 6 or 7, wherein in said deciding a target emotion of the target user based on the target limb behavior, the target facial expression and the target speech feature parameters, the deciding unit is specifically configured to:

9. The apparatus of claim 8, wherein the target limb behavior comprises at least one limb behavior;

in the aspect of performing behavior recognition according to the target images to obtain the target limb behaviors, the analysis unit is specifically configured to:

10. A computer-readable storage medium storing a computer program for execution by a processor to implement the method of any one of claims 1-5.