CN115047824A

CN115047824A - Digital twin multimodal device control method, storage medium, and electronic apparatus

Info

Publication number: CN115047824A
Application number: CN202210601439.4A
Authority: CN
Inventors: 邓邱伟; 魏玉琼; 栾天祥; 王凯; 贾基东; 王迪; 张丽
Original assignee: Qingdao Haier Technology Co Ltd; Qingdao Haier Intelligent Home Appliance Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Qingdao Haier Intelligent Home Appliance Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-13

Abstract

The application discloses a digital twin multi-modal equipment control method, a storage medium and an electronic device, and relates to the technical field of smart home, wherein the method comprises the following steps: performing emotion state recognition on the target object based on the target object image and/or the object voice data to obtain the current emotion state of the target object; under the condition that the current emotional state belongs to a negative state, acquiring the historical emotional state of the target object in a historical time period; determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state; and controlling the target intelligent device to execute the target device operation. By adopting the technical scheme, the problem that the sustainability of the equipment operation is poor due to the fact that the feedback information of the user cannot be timely obtained in the method for controlling the equipment to execute the equipment operation based on the recognized emotion category in the related technology is solved.

Description

Digital twin multimodal device control method, storage medium, and electronic apparatus

Technical Field

The application relates to the technical field of smart home, in particular to a digital twin multi-mode device control method, a storage medium and an electronic device.

Background

Currently, the smart device may perform emotion recognition according to a user voice or text, and perform a corresponding device operation based on the recognized emotion. However, after the device operation is executed, if the user does not make a voice or input a text, the smart device cannot obtain an operation effect of the device operation, that is, cannot obtain an effect of adjusting the emotion of the user by executing the device operation.

For example, when a user's emotional sadness is recognized, the user may be pacified by performing voice interaction or playing a song. Such soothing can only temporarily relieve the user's negative emotions, and the user cannot perceive whether the user's emotion has returned to calm if he is no longer making a sound.

Therefore, in the method for controlling the device to execute the device operation based on the recognized emotion category in the related art, the sustainability of the device operation is poor due to the fact that the feedback information of the user cannot be timely obtained.

Disclosure of Invention

The application aims to provide a digital twin multi-modal device control method, a storage medium and an electronic device, so as to at least solve the problem that in the related art, the sustainability of device operation is poor due to the fact that feedback information of a user cannot be timely obtained in a method for controlling a device to execute the device operation based on recognized emotion categories.

According to an aspect of an embodiment of the present application, there is provided a digital twin multimodal apparatus control method, including: performing emotion state recognition on a target object based on the target object image and/or the object voice data to obtain the current emotion state of the target object; under the condition that the current emotional state belongs to a negative state, acquiring a historical emotional state of the target object in a historical time period; determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state; and controlling the target intelligent equipment to execute the target equipment operation.

According to another aspect of the embodiment of the application, the digital twin multi-modal device control device comprises a first recognition unit, a second recognition unit and a control unit, wherein the first recognition unit is used for recognizing the emotion state of a target object based on the image and/or voice data of the target object to obtain the current emotion state of the target object; a first acquisition unit, configured to acquire a historical emotional state of the target object in a historical time period if the current emotional state belongs to a negative state; the determining unit is used for determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state; and the control unit is used for controlling the target intelligent equipment to execute the target equipment operation.

In one exemplary embodiment, the apparatus further comprises: a second obtaining unit, configured to obtain an image of a target object acquired by an image acquisition component before performing emotional state recognition on the target object based on an image of the target object and/or object voice data to obtain a current emotional state of the target object, where the image of the target object is an image of an object face including the target object; and the third acquisition unit is used for acquiring the object voice data acquired by the voice acquisition component, wherein the object voice data is the voice data sent by the target object.

In one exemplary embodiment, the apparatus further comprises: a second recognition unit configured to perform face region recognition on the target object image after acquiring the target object image acquired by the image acquisition unit, to obtain a set of face images; and the third identification unit is used for respectively carrying out object identification on each face image in the group of face images to obtain object information of a group of objects, wherein the group of objects comprises the target object.

In one exemplary embodiment, the apparatus further comprises: the adjusting unit is used for adjusting the image acquisition component from a closed state to an open state under the condition that the distance between the target object and the target intelligent equipment is detected to be smaller than or equal to a preset distance threshold before the emotion state of the target object is identified based on the image and/or the voice data of the target object to obtain the current emotion state of the target object; and the acquisition unit is used for acquiring the image of the target object through the image acquisition component under the condition that the target part of the target object in the image to be acquired of the image acquisition component meets a preset condition, so as to obtain the image of the target object.

In one exemplary embodiment, the acquisition unit includes: the first acquisition module is used for acquiring an image of the target object through the image acquisition component under the condition that an object part of the target object, which is positioned in the image to be acquired, comprises an object face and the ratio of the area of the object face to the area of the image to be acquired is greater than or equal to a target ratio to obtain an image of the target object; or, the second acquisition module is configured to acquire an image of the target object through the image acquisition component under the condition that an object part of the target object located in the image to be acquired includes an object face and an object hand, so as to obtain the target object image.

In one exemplary embodiment, the first recognition unit includes: the first identification module is used for carrying out emotion category identification on the facial image of the object to obtain the current emotion category of the target object, wherein the current emotion state comprises the current emotion category; the second identification module is used for carrying out part action identification on the part image of the preset part under the condition that the target object image also comprises the part image of the preset part of the target object to obtain a target part action; a first determining module, configured to determine an emotion intensity matched with the target part motion as a current emotion intensity of the target object, where the current emotion state further includes the current emotion category.

In one exemplary embodiment, the preset parts comprise a subject hand and a subject torso, and the target part motion comprises a target hand motion and a target body motion; the first determining module includes: a first determination submodule configured to determine a weighted sum of an emotional intensity matched to the target hand action and an emotional intensity matched to the target body action as the current emotional intensity of the target object.

In one exemplary embodiment, the first recognition unit includes: the third recognition module is used for respectively recognizing the emotional states of the target object based on the target object image and the object voice data to obtain a plurality of emotional states; and the fusion module is used for carrying out emotional state fusion on the multiple emotional states to obtain the current emotional state of the target object.

In one exemplary embodiment, the third identifying module comprises: a first identification submodule, configured to perform emotion state identification on a target face image of the target object, so as to obtain a first emotion state, where the first emotion state includes a first emotion category, and the target object image includes the target face image; and the second recognition submodule is used for carrying out emotion state recognition on the voice data of the object to obtain a second emotion state, wherein the second emotion state comprises a second emotion category.

In one exemplary embodiment, the fusion module includes: a second determination sub-module configured to determine, as the current emotion category, any one of the first emotion category and the second emotion category when the first emotion category and the second emotion category are consistent; a third determining submodule, configured to determine, as the current emotion category, an emotion category with a high confidence level in the first emotion category and the second emotion category when the first emotion category and the second emotion category are inconsistent; or, the emotion category matched with the part state of the preset part of the target object in the first emotion category and the second emotion category is determined as the current emotion category.

In an exemplary embodiment, the current emotional state further comprises a current emotional intensity of the target object; the fusion module further comprises: a fourth determining sub-module, configured to determine, as the current emotional intensity, the first emotional intensity in the second emotional state if the first emotional category and the second emotional category are consistent; or, a fifth determining sub-module, configured to determine, as the current emotional intensity, a weighted sum of a first emotional intensity and a second emotional intensity when the target object image further includes a part image of a preset part of the target object, where the first emotional intensity is an emotional intensity included in the second emotional state, and the second emotional intensity is an emotional intensity recognized from the part image of the preset part.

In an exemplary embodiment, the current emotional state further comprises a current emotional intensity of the target object; the fusion module further comprises: a sixth determining sub-module, configured to determine, as the current emotion intensity, an emotion intensity recognized from a part image of a preset part of the target object when the first emotion category and the second emotion category are inconsistent, the current emotion category is the first emotion category, and the target object image further includes the part image of the preset part of the target object; a seventh determining sub-module, configured to determine, as the current emotion intensity, the first emotion intensity in the second emotion state if the first emotion category and the second emotion category are not consistent and the current emotion category is the second emotion category; an eighth determining sub-module, configured to determine, as the current emotional intensity, a weighted sum of a first emotional intensity and a second emotional intensity when the first emotional category and the second emotional category are inconsistent, the current emotional category is the second emotional category, and the target object image further includes a location image of a preset location of the target object, where the first emotional intensity is an emotional intensity included in the second emotional state, and the second emotional intensity is an emotional intensity identified from the location image of the preset location.

In one exemplary embodiment, the determining unit further includes: a second determination module, configured to determine, if it is determined that the target object is in a negative emotion according to the current emotional state and the historical emotional state, and an emotional intensity of the negative emotion is reduced, that a device operation to be performed by the target smart device is a first device operation; a third determination module, configured to determine, if it is determined that the target object is in a negative emotion according to the current emotional state and the historical emotional state, and an emotional intensity of the negative emotion is increased, that a device operation to be performed by the target smart device is a second device operation; and the fourth determination module is used for determining that the equipment operation to be executed by the target intelligent equipment is a third equipment operation under the condition that the target object is determined to be in a negative emotion according to the current emotion state and the historical emotion state, and the duration of the negative emotion intensity greater than or equal to a preset intensity threshold reaches a preset duration threshold.

According to another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned digital twin multimodal apparatus control method when running.

According to another aspect of the embodiments of the present application, there is also provided an electronic apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the above digital twin multimodal device control method through the computer program.

In the embodiment of the application, the current emotional state of a specific object is identified, and the current emotional state of the target object is obtained by identifying the emotional state of the target object based on the image and/or the voice data of the target object in a mode of executing corresponding equipment operation on the object by combining the historical emotional state of the object under the condition that the current emotional state is a negative state; under the condition that the current emotional state belongs to a negative state, acquiring the historical emotional state of the target object in a historical time period; determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state; the target intelligent device is controlled to execute the target device operation, when the current emotion state of the specific object is detected to be a negative state, the device operation executed by the intelligent device is determined according to the current emotion state and the historical emotion state, and the influence of the executed device operation on the emotion state of the user can be reflected by the historical emotion state and the current emotion state, so that the purpose of timely acquiring feedback information of the user is achieved, the technical effect of improving the sustainability of the device operation is achieved, and the problem that the sustainability of the device operation is poor due to the fact that the feedback information of the user cannot be timely acquired in a method for controlling the device to execute the device operation based on the recognized emotion category in the related art is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic diagram of a hardware environment of an alternative digital twin multimodal device control method according to an embodiment of the present application;

FIG. 2 is a flow diagram illustrating an alternative digital twin multimodal device control method according to an embodiment of the application;

FIG. 3 is a schematic flow diagram of an alternative digital twin multimodal device control method according to an embodiment of the application;

FIG. 4 is a block diagram of an alternative digital twin multimodal device control apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of an embodiment of the present application, there is provided a digital twin multimodal apparatus control method. The digital twin multi-mode device control method is widely applied to full-House intelligent digital control application scenes such as intelligent homes (Smart Home), intelligent homes, intelligent Home device ecology, intelligent House (Intelligent House) ecology and the like. Alternatively, in the present embodiment, the digital twin multimodal apparatus control method described above may be applied to a hardware environment constituted by the terminal apparatus 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal device 102 through a network, and may be configured to provide a service (e.g., an application service) for the terminal or a client installed on the terminal, set a database on the server or independent of the server, and provide a data storage service for the server 104, and configure a cloud computing and/or edge computing service on the server or independent of the server, and provide a data operation service for the server 104.

The network may include, but is not limited to, at least one of: wired network, wireless network. The wired network may include, but is not limited to, at least one of: wide area networks, metropolitan area networks, local area networks, which may include, but are not limited to, at least one of the following: WIFI (Wireless Fidelity), bluetooth. Terminal equipment 102 can be but not limited to be PC, the cell-phone, the panel computer, intelligent air conditioner, intelligent cigarette machine, intelligent refrigerator, intelligent oven, intelligent kitchen range, intelligent washing machine, intelligent water heater, intelligent washing equipment, intelligent dish washer, intelligent projection equipment, intelligent TV, intelligent clothes hanger, intelligent (window) curtain, intelligence audio-visual, smart jack, intelligent stereo set, intelligent audio amplifier, intelligent new trend equipment, intelligent kitchen guarding equipment, intelligent bathroom equipment, intelligence robot of sweeping the floor, intelligence robot of wiping the window, intelligence robot of mopping the ground, intelligent air purification equipment, intelligent steam ager, intelligent microwave oven, intelligent kitchen is precious, intelligent clarifier, intelligent water dispenser, intelligent lock etc..

The digital twin multi-modal device control method according to the embodiment of the present application may be executed by the server 104, the terminal 102, or both the server 104 and the terminal 102. The terminal 102 may execute the digital twin multimodal device control method according to the embodiment of the present application by a client installed thereon.

Taking the server 104 as an example to execute the digital twin multimodal device control method in the embodiment, fig. 2 is a schematic flow chart of an optional digital twin multimodal device control method according to the embodiment of the present application, and as shown in fig. 2, the flow chart of the method may include the following steps:

step S202, based on the target object image and/or the object voice data, the emotion state of the target object is identified, and the current emotion state of the target object is obtained.

The digital twin multimodal device control method in this embodiment may control the target smart device to perform a corresponding device operation based on the emotional state of the target object, so as to adjust the emotional state of the target object in a scene. The target object may be a use object (e.g., a user) of a target smart device, the target smart device may be a smart home device, such as a smart refrigerator, a smart speaker, and the like, and the executed device operation may include a TTS (text to Speech) voice playing operation and may also include other device operations, which is not limited in this embodiment.

The target intelligent device can be a digital twin multi-mode device, the digital twin multi-mode device can be a simulation intelligent device which fully utilizes data such as a physical model, sensor updating, operation history and the like, integrates multidisciplinary, multi-physical quantity, multi-scale and multi-probability, and can be used for completing simulation mapping in a virtual space, so that the full life cycle process of an entity intelligent device corresponding to the simulation intelligent device is reflected.

The target smart device may acquire object data of the target object, where the object data of the target object may include at least one of a target object image and object voice data, for example, the target object image of the target object may be face image data, limb image data, and the like of the target user, and the object voice data of the target object may be voice data of the target user and the like, and may also include other object data, which is not limited in this embodiment. The target smart device may acquire the object data of the target object in one or more ways, which may include but is not limited to one of the following: the object data of the target object is acquired by the acquisition component on the target intelligent device, the object data of the target object is acquired by other devices associated with the target intelligent device, and the object data of the target object may also be acquired in other manners.

The target intelligent device can upload the object data of the target object to the server, and the server can receive the object data of the target object and perform emotion state recognition based on the object data of the target object to obtain the current emotion state of the target object. The current emotional state of the target object may belong to one of an active state, a neutral state and a passive state, which may include a current emotional category of the target object, such as happy, neutral, sad, surprised, afraid, angry, disliked, and the like, and may further include a current emotional intensity, such as a degree of violence, which may be represented by an emotional grade, such as a grade 0 to a grade 10, wherein a grade 0 represents that the emotion is most violent, and a grade 10 represents that the emotion is most violent, or other representations.

The server performs emotional state recognition based on the object data of the target object, and there may be one or more ways to obtain the current emotional state of the target object, where the way may be to input the object data of the target object into an emotional state recognition model to obtain the current emotional state of the target object output by the emotional state recognition model, or call an Interface related to an emotional recognition algorithm through a corresponding API (Application Programming Interface), obtain the current emotional state of the target object corresponding to the object data of the target object through the Interface, and perform emotional state recognition on the object data of the target object through other ways to obtain the current emotional state of the target object, which is not limited in this embodiment.

And step S204, acquiring the historical emotional state of the target object in the historical time period under the condition that the current emotional state belongs to the passive state.

If the current emotional state of the target object belongs to a negative state, the target smart device may be controlled to perform a corresponding device operation to adjust the negative state of the target object. At present, most of smart home devices perform one-time emotion recognition and perform corresponding soothing operation according to user voice or text, for example, when a user emotion is recognized as falling and sad, voice or songs can be played for soothing. However, such soothing can only temporarily relieve the negative emotion of the user, and if the user does not make any more sound, the user cannot perceive whether the emotion of the user is really restored to calm, and if the emotion of the user is negatively lowered for a long time, the user is likely to be adversely affected on the work life and the like of the user.

In order to at least partially solve the above problem, in this embodiment, if the current emotional state of the target object belongs to a negative state, the server may obtain a historical emotional state of the target object in a historical time period, detect a change in a historical emotion of the target object, and make a corresponding feedback based on the emotion change of the target object, which may ensure that the feedback made can better soothe the negative emotion of the target object.

The server may obtain the historical emotional state of the target object in the historical time period in one or more manners, may obtain the historical emotional state of the target object in the historical time period from the database according to the object information (e.g., name, etc.) of the target object, may obtain the historical emotional state of the target object from the target smart device or other associated smart devices, and may also obtain the historical emotional state of the target object in other manners, which is not limited in this embodiment.

And step S206, determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state.

After the historical emotional state of the target object in the historical time period is obtained, the server can determine the target device operation to be executed by the target intelligent device according to the current emotional state and the historical emotional state. The target device operation may be one or more of, and may include, but is not limited to, one of: playing the soothing music, continuing to play the soothing NLP voice, playing the enhanced soothing NLP voice, and the like, and may also be other device operations, for example, sending a prompt message to a preset device to prompt other objects to sooth the passive emotion of the target object, which is not limited in this embodiment.

Step S208, the control target smart device executes the target device operation.

After determining the target device operation to be executed, the server may control the target intelligent device to execute the corresponding target device operation. Optionally, the server may send a device operation instruction to the target smart device to instruct the target smart device to perform the target device operation. After the target intelligent device receives the device operation instruction, the target intelligent device can execute the target device operation indicated by the device operation instruction.

Optionally, the target smart device may re-acquire the current emotional state of the target object, re-determine a next target device operation to be executed by the target smart device in combination with the re-acquired current emotional state and the historical emotional state, and continue to execute the next target device operation after the target device operation is executed.

It should be noted that the above steps S202 to S208 may also be executed by the target smart device or the target smart device in combination with the server, and the above is only an exemplary description executed by the server and does not limit the execution subject of the digital twin multimodal device control method in the present embodiment.

Through the steps S202 to S208, emotion state recognition is carried out on the target object based on the target object image and/or the object voice data, and the current emotion state of the target object is obtained; under the condition that the current emotional state belongs to a negative state, acquiring the historical emotional state of the target object in a historical time period; determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state; the method for controlling the target intelligent device to execute the target device operation solves the problem that the sustainability of the device operation is poor due to the fact that the feedback information of the user cannot be timely obtained in the method for controlling the device to execute the device operation based on the recognized emotion category in the related art, and improves the sustainability of the device operation.

In an exemplary embodiment, before performing emotional state recognition on the target object based on the target object image and/or the object voice data to obtain a current emotional state of the target object, the method further includes:

s11, acquiring a target object image acquired by the image acquisition component, wherein the target object image is an image containing a target face of the target object;

and S12, acquiring the object voice data acquired by the voice acquisition component, wherein the object voice data is the voice data sent by the target object.

The object data of the target object may include at least one of an object image and an object voice of the target object. Alternatively, in the present embodiment, the object data of the target object may include a target object image, which is an image containing the object face of the target object, and target voice data, which is voice data uttered by the target object.

The target intelligent device can be provided with an image acquisition component and a voice acquisition component, and target object images and object voice data of a target object can be acquired through the image acquisition component and the voice acquisition component respectively. The image acquisition component can be a camera, a camera and the like, and the voice acquisition component can be a sound pickup, a microphone and the like. The collected target object image and the collected target object voice data can be sent to a server, so that the server can obtain the target data of the target object.

For example, the target smart device may be a smart device with a human body sensor, a camera, and a radio, and may further have a human body sensor, and is configured to trigger and start the camera when detecting that a human body is approaching, and may also start a radio component such as a microphone (the radio component may be always in an on state based on user authorization), and the target smart device may be a smart sound box with a screen, a smart refrigerator with a screen, and the like. The target intelligent device can respectively acquire the facial image of the user and the voice data sent by the user through the camera and the radio component which are arranged on the target intelligent device.

According to the embodiment, the emotion state recognition is performed by acquiring the facial image of the user and the voice data sent by the user, so that the emotion state recognition accuracy can be improved, and the sustainability of the device to execute device operation can be improved.

In an exemplary embodiment, after acquiring the target object image acquired by the image acquisition component, the method further includes:

s21, carrying out face region identification on the target object image to obtain a group of face images;

s22, performing object recognition on each of a group of face images to obtain object information of a group of objects, wherein the group of objects includes the target object.

The emotion recognition approaches in the related art fail to correlate the emotions of multiple users present with the identity of the user and thus fail to track future emotional characteristics of a particular user. In this embodiment, identification may be performed based on the face image, and the object corresponding to the face image may be determined. For the target object image, the server may perform face region recognition on the target object image of the target object, resulting in a set of face images.

For a group of face images, the server may perform object recognition on each face image respectively to obtain object information of a group of objects, where the group of objects includes the target object and may also include other objects other than the target object, and for the other objects, emotion tracking may be performed in a manner similar to the target object, so as to control the target smart device or other smart devices to perform corresponding device operations. Here, the number of objects included in the above-described group of objects is smaller than or equal to the number of face images included in the group of face images (the corresponding objects may not be recognized by the partial images).

The way of performing object recognition separately for each face image may be: and respectively inputting each facial image into the face recognition model to obtain an object recognition result output by the face recognition model. Here, the input of the face recognition model is a user face image (i.e., each face image), and the output is user information of the current user (i.e., object information, which requires that the current user has registered a face in the smart device).

Through this embodiment, through carrying out object recognition to the facial image, obtain the object information of each object, can conveniently carry out emotion tracking to different users, improve the sustainability of equipment execution equipment operation.

In an exemplary embodiment, in recognizing an emotional state of the target object based on the image of the target object and/or the voice data of the target object, the current emotional state of the target object is obtained, the method further includes:

s31, under the condition that the distance between the detected target object and the target intelligent device is smaller than or equal to a preset distance threshold, the image acquisition component is adjusted from a closed state to an open state;

and S32, under the condition that the target position of the target object in the image to be acquired of the image acquisition component meets the preset condition, acquiring the image of the target object by the image acquisition component.

In the present embodiment, the object data of the target object includes a target object image, which may be similar to the foregoing embodiments. To improve the accuracy of image acquisition, the target smart device may detect the distance to the target object. For example, a human body sensor may be disposed on the target smart device, and the human body sensor may be an infrared distance measuring sensor, and detect whether a distance between the target smart device and the target object is smaller than a preset distance threshold (which may be a sensing distance of the human body sensor). Under the condition that the distance between the target object and the target intelligent device is smaller than or equal to the preset distance threshold value, the image acquisition component can be controlled to be adjusted from the off state to the on state so as to acquire the image of the target object.

The target intelligent device can directly acquire images of the target object to obtain a target acquisition image, and emotional state identification failure caused by the fact that the target image acquired in the mode does not contain the target part capable of identifying the emotional state exists. In this embodiment, the target object may be subjected to image acquisition by the image acquisition component under the condition that an object portion of the target object located in the image to be acquired of the image acquisition component satisfies a preset condition, so as to obtain a target acquisition image. The target portion may include a target face of the target object, or may include a torso of the target object, or fingers, and the preset condition may include one or more types, where the image to be acquired includes the target face of the target object, the target object is subjected to image acquisition to obtain a target acquired image, or the image to be acquired includes the target face of the target object and the fingers, the target object is subjected to image acquisition to obtain a target acquired image, or other conditions are used, which are not limited in this embodiment.

For example, the target smart device may perform image acquisition on the user through the camera to obtain the target acquisition image when detecting that the user is located on the face, the trunk, the fingers and other parts of the camera that satisfy the preset conditions (for example, the user's face, trunk and fingers are detected at the same time).

According to the embodiment, the image acquisition component is started when the fact that the distance between the user and the target intelligent device meets the target distance threshold is detected, the user is subjected to image acquisition when the fact that the target part of the user located on the camera meets the preset condition is detected, the usability of the acquired target image can be improved, and the accuracy of emotion state recognition is improved.

In an exemplary embodiment, in a case that a target object is located in an object portion of an image to be captured of an image capturing component and meets a preset condition, image capturing is performed on the target object through the image capturing component to obtain a target object image, including:

s41, when the target object is located in the image to be collected, the target part comprises an object face and the ratio of the area of the object face to the area of the image to be collected is greater than or equal to the target ratio, image collection is carried out on the target object through an image collection component to obtain a target object image; or,

and S42, under the condition that the target part of the target object in the image to be collected comprises the face and the hand of the target object, carrying out image collection on the target object through the image collection component to obtain the image of the target object.

In this embodiment, the preset condition may include one or more conditions, which may include but are not limited to at least one of the following:

and under the condition that the target part of the target object in the image to be acquired comprises the object face and the ratio of the area of the object face to the area of the image to be acquired is greater than or equal to the target ratio, acquiring the image of the target object by the image acquisition component to obtain the image of the target object.

For example, when the target smart device detects the face of the user and the ratio of the area of the face of the user to the area of the image to be acquired is greater than or equal to the target ratio, the camera may acquire the image of the user to obtain the target object image.

Under the condition that the target part of the target object, which is positioned in the image to be acquired, comprises the face and the hand of the target object, the image acquisition component acquires the image of the target object to obtain an image of the target object.

For example, when the target smart device detects that the image to be captured simultaneously includes the face or the hand of the user, the target smart device may capture the image of the user through the camera to obtain the target object image.

In addition, the target object image may be obtained by performing image acquisition on the target object through the image acquisition component under the condition that other conditions are satisfied, which is not limited in this embodiment.

Illustratively, when the user meets a specific condition, a camera is started to take a picture of the user, and the user can shoot a complete face. The following ways to determine whether to turn on the camera are available: the ratio of the face area of the user to the image area meets a certain threshold; the human face and the finger part of the user simultaneously appear in the image.

By the embodiment, under the condition that the face proportion of the user is detected to be larger than the target threshold value or the face and the hands of the user are detected at the same time, the image of the user is acquired, the usability of the acquired object image can be improved, and the accuracy of emotion state identification can be improved.

In one exemplary embodiment, the target object image includes an object face image of the target object; the method for recognizing the emotion state of the target object based on the image and/or the voice data of the target object to obtain the current emotion state of the target object comprises the following steps:

s51, performing emotion type recognition on the object face image to obtain the current emotion type of the object, wherein the current emotion state comprises the current emotion type;

s52, under the condition that the target object image also comprises a position image of a preset position of the target object, performing position action recognition on the position image of the preset position to obtain a target position action;

and S53, determining the emotion intensity matched with the target part action as the current emotion intensity of the target object, wherein the current emotion state also comprises the current emotion category.

In the present embodiment, the target object image may contain an object face image of the target object. The server can perform emotion type recognition on the object face image of the target object to obtain the current emotion type of the target object, and can also obtain the confidence corresponding to the current emotion type. The manner in which the server performs emotion category identification on the target face image of the target object to obtain the current emotion category of the target object is similar to the manner in which the server performs emotion state identification on the target image to obtain the current emotion state of the target object in the foregoing embodiment, and details are not described here.

The target object image may further include a position image of a preset portion of the target object, and the preset portion may be a portion capable of representing an emotional state of the target object, such as a trunk, a finger, and the like. The server can perform part action recognition on the part image of the preset part to obtain the target part action. The manner in which the server performs the part motion recognition on the part image of the preset part to obtain the target part motion is similar to the manner in which the server performs the emotion state recognition on the target object image to obtain the current emotion state of the target object in the foregoing embodiment, and details are not repeated here.

Optionally, the preset portion may include a torso of the target object, the target object image includes a whole-body image (whole-body picture) of the target object, the whole-body image is transmitted to a body motion recognition model (the model may be located in the cloud), and a body motion tag of the target object is obtained, where the body motion recognition model is a classification model for recognizing a stretching state of a body state of the user, and the stretching state tag includes: crouching, standing shoulder crouching, sitting relaxing, standing relaxing, etc.

Optionally, the preset portion may include a hand of the target object, the target object image includes an object hand image (a two-hand image) of the target object, the object hand image is transmitted to a finger motion recognition model (the model may be located in the cloud), so as to obtain a gesture motion of the target object, where the finger motion recognition model is a classification model for recognizing a stretching state of a finger of the user, and the label of the stretching state includes: fist making, relaxing, etc.

For example, the smart device may perform recognition and segmentation processing on the face region, the two-hand region, and the human body trunk region of the captured whole-body image of the user to obtain three pictures of the face image, the two-hand image, and the whole-body image of the user, and transmit the three pictures to the program module of the next processing. Except that the face of the person must be completely photographed, if the hands and the trunk are not recognized, the empty data can be transmitted in the next step.

The facial image can be simultaneously transmitted to a face recognition model and a facial emotion recognition model in the cloud, the identity of the user (for example, the user is small and bright) is obtained through the face recognition model, and the facial expression label (crying) and the confidence coefficient of the label are obtained through the facial emotion recognition model. And respectively transmitting the whole-body pictures and the pictures of the two hands to a body action recognition model and a finger action recognition model at the cloud end to obtain body action labels (curling in standing shoulders) and gesture actions (making a fist) of the user. If the body motion or finger motion picture is empty, the output is also empty.

The server may also determine the emotional intensity matched with the target part action as the current emotional intensity of the target object. The current emotional state includes a current emotional category and a current emotional intensity (which may be the severity of the current emotion) of the target subject. For the current emotion intensity, the current emotion intensity can be represented by an intensity numerical value, the intensity value range can be [0,10], the greater the intensity value is, the stronger the intensity of the current emotion is, wherein 0 represents the lowest emotion intensity, and 10 represents the highest emotion intensity.

For example, each body action tag may correspond to a degree of emotional excitement. The minimum value of the severity is 0 and the maximum value is 10. The input of the body action recognition model is a user body action picture, and the output is a label name and a corresponding severity. Each gesture action tag may correspond to a degree of emotional excitement. The minimum value of the severity is 0 and the maximum value is 10. The input of the finger motion recognition model is a user finger motion picture, and the output is a label name and a corresponding severity.

According to the embodiment, the emotional state recognition is carried out on the face image of the object and the part action of the preset part, so that the efficiency and the accuracy of the emotional state recognition can be improved.

In one exemplary embodiment, the preset parts include a subject hand and a subject torso, and the target part motion includes a target hand motion and a target body motion. Correspondingly, the step of determining the emotion intensity matched with the target part action as the current emotion intensity of the target object comprises the following steps:

and S61, determining the weighted sum of the emotion intensity matched with the target hand motion and the emotion intensity matched with the target body motion as the current emotion intensity of the target object.

For the preset part, it may include a subject hand and a subject torso, and correspondingly, the target part motion may include a target hand motion and a target body motion. The target hand motion and the target body motion are similar to those described in the foregoing embodiments, wherein the target hand motion may include making a fist, relaxing fingers, etc., the target body motion may include squatting, standing, and shoulder curling, etc., and the present embodiment is not limited to the types of the target hand motion and the target body motion.

In this embodiment, the server may obtain the emotional intensity matched with the target hand motion and the emotional intensity matched with the target body motion, determine the emotional intensity matched with the target hand motion based on a preset corresponding relationship between the hand motion and the emotional intensity, determine the emotional intensity matched with the target body motion based on a preset corresponding relationship between the body motion and the emotional intensity, and may also obtain the emotional intensity matched with the target hand motion and the emotional intensity matched with the target body motion in other manners, which is not limited in this embodiment.

The server may determine any one of the emotional intensity matching the target hand action and the emotional intensity matching the target body action as the current emotion of the target object. To improve the accuracy of emotion recognition, a weighted sum of the emotional intensity matching the target hand action and the emotional intensity matching the target body action may be determined as the current emotional intensity of the target subject, e.g., an average of the emotional intensity matching the target hand action and the emotional intensity matching the target body action may be determined as the current emotional intensity of the target subject.

Through the embodiment, the current emotion intensity is determined as the weighted sum of the emotion intensity matched with the hand action and the emotion intensity matched with the body action, so that the accuracy and the efficiency of emotion state recognition can be improved.

In an exemplary embodiment, performing emotional state recognition on the target object based on the target object image and/or the object voice data to obtain a current emotional state of the target object, includes:

s71, respectively carrying out emotion state recognition on the target object based on the target object image and the target voice data to obtain a plurality of emotion states;

and S72, performing emotional state fusion on the plurality of emotional states to obtain the current emotional state of the target object.

In the present embodiment, the object data of the target object may contain object data of a plurality of modalities, for example, an object image, an object voice, and the like. And under the condition that the object data of the target object comprises the target object image and the object voice data, the server respectively identifies the emotional states of the target object based on the target object image and the object voice data to obtain a plurality of emotional states. Alternatively, the emotional state may include an emotion category of the target object and a severity corresponding to the emotion category.

After obtaining the plurality of emotional states, the server may randomly select one emotional state from the plurality of emotional states as the current emotional state of the target object. In order to improve the accuracy of emotion state recognition, the server can perform emotion state fusion on multiple emotion states to obtain the current emotion state of the target object.

Through this embodiment, through discerning and fusing the object data of multiple modality, obtain user's current emotional state, can improve emotional state discernment's high efficiency and convenience to and improve the sustainability that equipment execution equipment operated.

In an exemplary embodiment, the emotional state recognition is performed on the target object based on the target object image and the target object voice data, respectively, to obtain a plurality of emotional states, including:

s81, performing emotion state recognition on the target face image of the target object to obtain a first emotion state, wherein the first emotion state comprises a first emotion category, and the target object image comprises the target face image;

s82, performing emotion state recognition on the subject speech data of the target subject to obtain a second emotion state, wherein the second emotion state includes a second emotion category.

In the present embodiment, the target object image may include an object face image of the target object. The server respectively performing emotion state recognition on the target object based on the target object image and the object voice data may include: and performing emotion state recognition on the target face image and the target voice data of the target object respectively to obtain a first emotion state and a second emotion state. The first emotional state may comprise a first category of emotions and the second emotional state may comprise a second category of emotions.

The manner of performing emotional state recognition on the target face image of the target object is similar to that in the foregoing embodiment, and details are not repeated here. The manner of recognizing the emotional state of the target voice data of the target object by the server may be: the voice data of the object is input into the voice emotion model, the emotion type (or emotion type) of the target object is identified through the voice emotion model, and the emotion intensity of the target object can also be identified.

For example, the smart device receives a speech query Q1 spoken by a user, the smart device transmits the received speech query Q1 to the cloud, and identifies the emotion type and degree (degree 0 is lightest and degree 10 is heaviest) of the user through a speech emotion model, and the identification result is: the emotion of Xiaoming is the degree of being delegated and excited, and the degree is 7. And correcting the emotion and degree of the current user according to the emotion judgment comprehensive model strategy by combining the analysis result of the user image and the emotion analysis result of the user voice.

Here, it should be noted that the object voice data may be obtained by performing voice data acquisition through a sound receiving component of the target smart device after the target object wakes up the target smart device, or may be obtained by performing voice data acquisition through a sound receiving component of the target smart device in a non-wake state based on the authorization information of the user, which is not limited in this embodiment.

Through the embodiment, the emotion state recognition is performed on the object face image and the object voice data of the user, so that the convenience and the accuracy of the emotion state recognition can be improved.

In one exemplary embodiment, the current emotional state includes a current emotional category of the target object. Correspondingly, performing emotional state fusion on the multiple emotional states to obtain the current emotional state of the target object, including:

s91, determining any emotion category of the first emotion category and the second emotion category as a current emotion category when the first emotion category and the second emotion category are consistent;

s92, determining the emotion classification with high confidence coefficient in the first emotion classification and the second emotion classification as the current emotion classification under the condition that the first emotion classification and the second emotion classification are inconsistent; or determining the emotion category matched with the part state of the preset part of the target object in the first emotion category and the second emotion category as the current emotion category.

In this embodiment, the first emotion category is an emotion category recognized from a face image of the subject, and the second emotion category is an emotion category recognized from voice data of the subject, the current emotional state of which includes the current emotion category of the target subject. The server respectively identifies the emotional states of the target object based on the target object image and the target voice data, and the obtaining of the plurality of emotional states may include: and fusing the first emotion category and the second emotion category to obtain the current emotion category of the target object.

The first emotion category and the second emotion category may or may not be consistent, where consistent means that both are identical or match, e.g., both belong to positive emotions, both belong to negative emotions, etc., and fusing the first emotion category and the second emotion category may be performed based on the consistency of the first emotion category and the second emotion category. In a case where the first emotion category and the second emotion category are consistent, determining any one of the first emotion category and the second emotion category as a current emotion category; in the case that the first emotion category and the second emotion category are not consistent, the emotion category with a high confidence coefficient in the first emotion category and the second emotion category is determined as the current emotion category, or the emotion category matched with the preset part state of the target object in the first emotion category and the second emotion category is determined as the current category, or the current emotion category may be determined in other manners, which is not limited in this embodiment.

As an alternative embodiment, the confidence of the first emotion class may be determined according to the face image of the target object, that is, the confidence of the first emotion class and the first emotion class are identified according to the face image of the target object, and the confidence of the second emotion class may be determined according to the voice data of the target object, that is, the confidence of the second emotion class and the second emotion class are identified according to the face image of the target object. The higher the confidence of the emotion classification, the higher the confidence thereof, and the emotion classification with high confidence in the first emotion classification and the second emotion classification may be determined as the current emotion classification in the case where the first emotion classification and the second emotion classification are not consistent.

As another optional embodiment, in a case that the target object image includes a part image of a preset part of the target object, the server may identify the part image of the preset part to obtain a part state of the preset part, and determine, as the current emotion category, an emotion category, which is matched with the preset part state of the target object, in the first emotion category and the second emotion category.

For example, the emotion judgment comprehensive model can correct the situation that the conclusions obtained by speech emotion recognition and image emotion recognition are inconsistent through a strategy. And judging the current emotion classification of the user according to the emotion classification result of the facial emotion recognition model and the emotion classification result of the voice emotion recognition model. The judgment method for classifying the current emotion of the user is as follows: if the face recognition is consistent with the emotion label recognized by the voice emotion, the current emotion of the user is considered to be the label. If the emotion labels recognized by the face recognition and the voice emotion are inconsistent, the emotion of the user can be judged through two modes:

the first method is as follows: and judging the emotion of the user according to the label with higher confidence coefficient.

For example, if the confidence that the user is angry is 99% in speech emotion recognition and the execution degree is 60% in image recognition that the user is neutral in emotion, the true emotion of the user is considered to be angry.

The second method comprises the following steps: and (4) judging by combining body movement and finger movement.

For example, if the user is in a angry state as recognized by speech emotion, but if the user is smiling, has a stretched body movement, and has relaxed fingers as recognized by facial expression, the speech emotion recognition is considered to be biased, and the emotion of the user is corrected to be neutral and calm.

Illustratively, if the current user is small and bright through facial picture recognition, the current facial expression is crying and exciting state, shoulder shaking, finger making a fist, voice emotion committing and exciting, and the voice emotion recognition result is not in conflict with the image emotion recognition result, the user small and bright emotion is considered to be committing and exciting.

Through the embodiment, the emotion categories identified by the object data with different dimensions are fused, the current emotion category of the user is determined, and convenience and efficiency of emotion identification can be improved.

In an exemplary embodiment, the current emotional state further comprises a current emotional intensity of the target object, which may be a degree of violence of the current emotion. Correspondingly, the method for fusing the emotional states of the multiple emotional states to obtain the current emotional state of the target object further comprises the following steps:

s101, determining the first emotion intensity in the second emotion state as the current emotion intensity under the condition that the first emotion category is consistent with the second emotion category; or,

and S102, under the condition that the target object image further comprises a position image of a preset position of the target object, determining a weighted sum of a first emotion intensity and a second emotion intensity as the current emotion intensity, wherein the first emotion intensity is the emotion intensity contained in the second emotion state, and the second emotion intensity is the emotion intensity recognized from the position image of the preset position.

In this embodiment, the server may perform emotional state fusion on multiple emotional states, or may perform fusion on emotional intensities in different multiple emotional states. The way of fusing the emotional intensity in different emotional states may be: and determining the first emotion intensity in the second emotion state as the current emotion intensity under the condition that the first emotion category and the second emotion category are consistent.

Alternatively, in a case where the first emotion category and the second emotion category coincide, when the target object image further includes a position image of a preset portion of the target object, a weighted sum (e.g., an average) of the first emotion intensity and the second emotion intensity may be determined as the current emotion intensity. The first emotion intensity is emotion intensity included in the second emotion state, and the second emotion intensity is emotion intensity recognized from the part image of the preset part.

The server can identify the emotion intensity of the preset part of the target object to obtain a second emotion intensity matched with the state of the preset part of the target object. The manner in which the server performs emotion intensity recognition on the preset portion of the target object to obtain the second emotion intensity matched with the preset portion state of the target object is similar to the manner in which the server performs emotion state recognition on the target object image to obtain the current emotion state of the target object in the foregoing embodiment, and details are not repeated here.

For example, the following methods may be used to determine the degree of excitement of the user: if the voice emotion recognition result is consistent with the image emotion recognition result, the body action intensity of the user, the finger action intensity of the user and the voice emotion intensity can be averaged to obtain the emotion intensity of the user.

Through the embodiment, when the recognized various emotion types are consistent, the emotion intensity of the user is obtained by performing weighted summation on the emotion intensity (namely, the intensity) recognized based on the voice data and the emotion intensity recognized based on the part image of the preset part, and the accuracy of emotion state recognition can be improved.

In one exemplary embodiment, similar to the previous embodiment, the current emotional state further comprises a current emotional intensity of the target subject. Correspondingly, the method for fusing the emotional states of the multiple emotional states to obtain the current emotional state of the target object further comprises the following steps:

s111, determining the emotion intensity recognized from the part image of the preset part as the current emotion intensity under the conditions that the first emotion type and the second emotion type are inconsistent, the current emotion type is the first emotion type, and the target object image further comprises the part image of the preset part of the target object;

s112, determining the first emotion intensity in the second emotion state as the current emotion intensity under the condition that the first emotion category is inconsistent with the second emotion category and the current emotion category is the second emotion category;

and S113, under the condition that the first emotion type and the second emotion type are inconsistent, the current emotion type is the second emotion type, and the target object image further comprises a position image of a preset position of the target object, determining the current emotion intensity as the weighted sum of the first emotion intensity and the second emotion intensity, wherein the first emotion intensity is the emotion intensity included in the second emotion state, and the second emotion intensity is the emotion intensity recognized from the position image of the preset position.

In this embodiment, if the first emotion category and the second emotion category are not consistent, the current emotion intensity of the target object may be determined based on whether the current emotion category is the first emotion category or the second emotion category and whether the target object image includes a position image of a preset portion of the target object.

If the current emotion category is the first emotion category and the target object image includes a part image of a preset part of the target object, the emotion intensity of the target object may be determined in a manner similar to that in the foregoing embodiment, and the emotion intensity recognized from the part image of the preset part may be determined as the current emotion intensity. If there are a plurality of preset portions, a weighted sum of the emotional intensity recognized from the portion image of each preset portion may be determined as the current emotional intensity.

The first emotional intensity in the second emotional state may be determined as the current emotional intensity if the current emotional category is the second emotional category. In this case, the target object image may include a part image of the preset part of the target object, or may not include a part image of the preset part of the target object.

Alternatively, if the current emotion category is the second emotion category and the target object image further includes a part image of a preset part of the target object, the emotion intensity recognized from the part image of the preset part, that is, the second emotion category, may be determined in a manner similar to the foregoing, and the weighted sum of the first emotion intensity and the second emotion intensity may be determined as the current emotion intensity.

For example, the following methods may be used to determine the degree of excitement of the user:

if the voice emotion recognition result is consistent with the image emotion recognition result, averaging the body action intensity of the user, the finger action intensity of the user and the voice emotion intensity of the user to obtain the emotion intensity of the user;

if the voice emotion recognition result is inconsistent with the image emotion recognition result and the real emotion of the user is consistent with the voice emotion recognition result after judgment, the emotion violence degree of the user is equal to the emotion violence degree of the voice emotion recognition result;

if the voice emotion recognition result is not consistent with the image emotion recognition result, and the real emotion of the user is consistent with the facial emotion recognition result after judgment, the emotion intensity of the user is equal to the average of the body action intensity of the user and the finger action intensity of the user, and the emotion intensity is determined as the emotion intensity of the user.

According to the embodiment, if the first emotion category and the second emotion category are inconsistent, the final emotion category and the emotion intensity recognized from the part image of the preset part are comprehensively considered to determine the current emotion intensity, and the emotion state recognition accuracy can be improved.

In an exemplary embodiment, determining a target device operation to be performed by the target smart device according to the current emotional state and the historical emotional state further includes:

s121, under the condition that the target object is determined to be in a negative emotion according to the current emotional state and the historical emotional state, and the emotional intensity of the negative emotion is reduced, determining that the device operation to be executed by the target intelligent device is a first device operation;

s122, under the condition that the target object is determined to be in a negative emotion according to the current emotional state and the historical emotional state, and the emotional intensity of the negative emotion is enhanced, determining that the device operation to be executed by the target intelligent device is a second device operation;

and S123, under the condition that the target object is determined to be in the negative emotion according to the current emotional state and the historical emotional state, and the duration of the negative emotion, the intensity of which is greater than or equal to the preset intensity threshold value, reaches the preset duration threshold value, determining that the device operation to be executed by the target intelligent device is the third device operation.

In this embodiment, to determine the target device operation, the target device operation performed by the target smart device may be determined based on the current emotional state and the historical emotional state. The target device operation at least comprises a first device operation, a second device operation and a third device operation, wherein the first device operation can be an operation for maintaining or reducing the emotional soothing degree, the second device operation can be an operation for strengthening the emotional soothing degree, and the third device operation can be an operation for giving out a warning to the target intelligent device to prompt that the target object is in a negative emotion for a long time.

In the case where it is determined that the target object is in a negative emotion according to the current emotional state and the historical emotional state, and the emotional intensity of the negative emotion is reduced, the server may determine the first device operation as a device operation to be performed by the target smart device; in the case where it is determined that the target object is in a negative emotion according to the current emotional state and the historical emotional state, and the emotional intensity of the negative emotion is enhanced, the server may determine the second device operation as a device operation to be performed by the target smart device; and in the case that the target object is determined to be in a negative emotion according to the current emotional state and the historical emotional state, and the duration of the negative emotion, which is greater than or equal to the preset intensity threshold, reaches the preset duration threshold, the server may determine the third device operation as the device operation to be performed by the target smart device.

For example, according to the user identity, the historical emotional state and the degree of the user in the past continuous period of time are inquired from the cloud end (for example, the user is inquired to be in a negative state such as negative, low and the like in the past three days, and the degrees are 4, 5 and 6 respectively). And judging whether the emotion of the user changes to the positive direction or not according to the current and historical emotion state labels and the degree of the user. For example, the current user emotion is a coughing, excitement, with a degree of 7, and the negative emotional state becomes more severe than three days historically,

if the passive emotion of the user deepens, soothing and interaction to the user are strengthened by means of guiding communication and the like, and if the passive emotion of the user is higher than a certain threshold (for example, the threshold is 9) and lasts for a period of time, the user is recommended to consult a professional. If the negative emotion of the user is relieved, normal emotion soothing means is carried out. For example, after the negative emotion of xiao ming is aggravated, TTS concerned with gentleness can be used to guide xiao ming to explain psychological discontent, such as "baby, how to give way to say with me".

For example, in a kindergarten scenario, it may help identify and track the emotions of children. When a child is quarried or cries loudly due to a small story of a toy, a snack and the like, the child can be broadcasted with a speech operation through the intelligent equipment in time: "the friends are courtesy each other, and the two people play the toy bar together", give the children corresponding comfort, and track the subsequent emotions of the two children respectively to identify the degree of influence of the event on the children. When finding that the expressions of the two children when seeing the face each time are not very happy, the intelligent device can broadcast the expressions: "also forget your happy time of eating ice cream together because a toy noisy frame", so as to relieve the relation between them.

For example, in a home scenario, relationships between parents often have an impact on the growth of a child, such as parents often quarrel adversely affecting the growth of a child. When detecting that child receives the influence of parental quarrel, whether child's mood is very low difficult can be judged out through facial expression discernment to smart machine to report pronunciation to child: the intelligent device can also correlate the historical emotion change of the child through the face recognition result of the child, and when the emotion of the child is continuously lowered, the intelligent device can timely inform the parent and give scientific counseling advice.

Through the embodiment, the target operation corresponding to the target intelligent device is determined through the current emotional state and the historical emotional state, and the accuracy and sustainability of the device to execute the device operation can be improved.

The digital twin multimodal apparatus control method in the embodiment of the present application is explained below with reference to an alternative example. In this optional example, the target smart device is a smart home device, the image capturing component is a camera, and the preset portion includes a trunk and a hand.

In the related art, the intelligent home device performs emotion recognition of the user by receiving the voice of the user in real time, and the commonly used emotion recognition modes include two types: 1) analyzing the emotion of the user from the voice signal wave; 2) after the user voice is converted into the text, the emotion of the user is recognized through an NLP (Natural Language Processing) technology, for example, the text of the user voice is "your real worried", it is detected that the emotion of the user is angry, and at this time, the smart home device may give some kayaking words or play music to relieve the emotion of the user.

However, in the above manner of controlling the smart home device to execute the device operation, it is difficult to perform accurate emotion recognition on the type of the user who is difficult to obtain voice data in time based on the single-mode emotion recognition of the user voice. In addition, the single-mode emotion recognition technology only recognizes and relieves the emotion of the user at the current moment, and when the user does not speak any more, whether the emotion of the user is really and effectively relieved cannot be judged.

The invention provides a multi-modal user emotion recognition and tracking technology combining voice and images, which is a multi-modal emotion recognition and tracking method based on faces, human bodies, fingers and voice, can judge the current emotion state of a user, provides psychological comfort for the user in a negative state in time, continuously recognizes the emotion fluctuation state of the user in the mode of image and voice in the next few days, judges whether the negative emotion of the user is relieved, and provides different soothing encouragements in time according to the relieving degree so as to ensure that the negative emotion of the user is really solved and avoid abnormal situations.

In addition, the multi-modal user emotion recognition and tracking technology provided in this optional example can accurately recognize the emotional state of each person in a multi-person scene and associate the emotional state with the identity of each person through the face, and when there are multiple users, the identity and the emotion of each user can be accurately associated. The historical emotional state of the user is analyzed through a tracking mechanism, and the psychological tendency of the user is judged, so that the AI (Artificial Intelligence) can further provide proper emotional comforting measures or give corresponding psychological persuasion.

The core building blocks of the multi-modal user emotion recognition and tracking technology in combination with speech and images provided in this alternative example may include the following models: 1. a speech emotion feature extraction model; 2. extracting a human face emotional characteristic model; 3. a body motion feature extraction model; 4. a finger action feature extraction model; 5. a mood judgment comprehensive model; 6. a face identity recognition model; 7. human face emotion correlation and user emotion tracking mechanisms.

The models 1, 2, 3 and 4 are used as feature vectors and are sent to the model 5, and emotion of the user is identified and detected. The emotion judgment integrated model herein should be compatible with single-modality data of only voice or images, considering a specific user group. And after the emotion of the user is judged, associating the emotion to the identity of the user according to the face recognition result of the user, tracking the historical emotion change of the user, and making corresponding feedback according to the emotion change of the user.

As shown in connection with fig. 3, when the user interacts with the smart device, the flow of the digital twin multimodal device control method in this alternative example may include the following steps:

step 1, when a human body sensor of the intelligent equipment recognizes that the distance from a user to the equipment is smaller than a certain threshold value, the intelligent equipment starts a camera.

And 2, when the user meets a specific condition (for example, the camera detects that the ratio of the face area of the user to the image area meets a certain threshold value, and the face and finger parts of the user appear in the camera at the same time), the intelligent equipment starts the camera to take a picture of the user, and the user can be required to take a complete picture of the face of the user. In addition, if the smart device detects that the user speaks, step 6 to step 7 are triggered at the same time.

And 3, the intelligent equipment identifies and segments the face area, the hand area and the human body trunk area of the captured whole body image of the user to obtain three pictures of the face image, the two hand image and the whole body image of the user.

And 4, simultaneously transmitting the facial image to a face recognition model and a face emotion recognition model at the cloud, and obtaining the identity of the user, the facial expression label and the confidence coefficient of the label through the face recognition model.

And 5, respectively transmitting the whole-body picture and the two-hand picture to a body action recognition model and a finger action recognition model at the cloud end to obtain a body action label of the user and the gesture action of the user, and further determining the emotional intensity of the user.

And 6, when the intelligent equipment detects that the user speaks, acquiring a section of voice spoken by the user.

And 7, recognizing the emotion type and the intensity of the user by the received voice through the voice emotion recognition model at the cloud.

And 8, correcting the emotion and degree of the current user according to an emotion judgment comprehensive model strategy by combining the identity of the user, the facial expression label, the body emotion recognition and the finger action recognition of the user and the emotion category and degree of the voice of the user.

And 9, inquiring the historical emotional state and degree of the user in a past continuous period of time from the cloud according to the user identity.

And step 10, judging whether the emotion of the user changes to the positive direction or not according to the current and historical emotion state labels and the degrees of the user.

And step 11, if the passive emotion of the user deepens, the user is strengthened in soothing and interaction through a guiding communication mode and the like, and if the passive degree of the user is larger than a certain threshold value and lasts for a period of time, the user is recommended to consult a professional. If the negative emotion of the user is relieved, normal emotion soothing means is carried out.

By the optional example, the emotion of the user is recognized through data of two modes, namely the image (human face + human body + finger) and the voice, so that the emotion abnormality of the user can be found even if the user does not speak, and the emotion relieving capability of the AI is improved; in a multi-person scene, associating user identities and user emotions through a face emotion association and user emotion tracking mechanism, and giving corresponding emotion appeasing feedback to users in different emotion states; for a single user, historical emotion changes of the user are correlated through face emotion correlation and a user emotion tracking mechanism, corresponding emotion soothing measures are given for different emotion changes, and the generation of psychological abnormality is reduced.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., a ROM (Read-Only Memory)/RAM (Random Access Memory), a magnetic disk, an optical disk) and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the methods according to the embodiments of the present application.

According to another aspect of the embodiments of the present application, there is also provided a digital twin multimodal apparatus control device for implementing the above-described digital twin multimodal apparatus control method. Fig. 4 is a block diagram of an alternative digital twin multi-modality device control apparatus according to an embodiment of the present application, and as shown in fig. 4, the apparatus may include:

a first recognition unit 402, configured to perform emotion state recognition on a target object based on a target object image and/or object voice data, to obtain a current emotion state of the target object;

a first obtaining unit 404, connected to the first identifying unit 402, for obtaining a historical emotional state of the target object in a historical time period if the current emotional state belongs to a negative state;

a determining unit 406, connected to the first obtaining unit 404, configured to determine, according to the current emotional state and the historical emotional state, a target device operation to be performed by the target smart device;

and a control unit 408, connected to the determination unit 406, for controlling the target smart device to perform the target device operation.

It should be noted that the first identifying unit 402 in this embodiment may be configured to execute the step S202, the first obtaining unit 404 in this embodiment may be configured to execute the step S204, the determining unit 406 in this embodiment may be configured to execute the step S206, and the control unit 408 in this embodiment may be configured to execute the step S208.

Through the module, emotion state recognition is carried out on the target object based on the target object image and/or the object voice data, and the current emotion state of the target object is obtained; under the condition that the current emotional state belongs to a passive state, acquiring a historical emotional state of the target object in a historical time period; determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state; the method for controlling the target intelligent device to execute the target device operation solves the problem that the sustainability of the device operation is poor due to the fact that the feedback information of the user cannot be timely obtained in the method for controlling the device to execute the device operation based on the recognized emotion category in the related art, and improves the sustainability of the device operation.

In an exemplary embodiment, the apparatus further includes:

a second acquisition unit, configured to acquire the target object image acquired by the image acquisition component before performing emotional state recognition on the target object based on the target object image and/or the object voice data to obtain a current emotional state of the target object, where the target object image is an image including an object face of the target object;

and the third acquisition unit is used for acquiring the object voice data acquired by the voice acquisition component, wherein the object voice data is the voice data sent by the target object.

In an exemplary embodiment, the apparatus further includes:

a second recognition unit for performing face region recognition on the target object image after acquiring the target object image acquired by the image acquisition part to obtain a group of face images;

and a third identification unit, configured to perform object identification on each of a set of face images to obtain object information of a set of objects, where the set of objects includes the target object.

In an exemplary embodiment, the apparatus further includes:

the adjusting unit is used for adjusting the image acquisition part from a closed state to an open state under the condition that the distance between the target object and the target intelligent equipment is smaller than or equal to a preset distance threshold before the emotion state of the target object is identified based on the image and/or the voice data of the target object to obtain the current emotion state of the target object;

and the acquisition unit is used for acquiring the image of the target object through the image acquisition component under the condition that the target part of the target object in the image to be acquired of the image acquisition component meets the preset condition, so as to obtain the image of the target object.

In one exemplary embodiment, the acquisition unit includes:

the first acquisition module is used for acquiring an image of the target object through the image acquisition component under the condition that the target part of the target object in the image to be acquired comprises an object face and the ratio of the area of the object face to the area of the image to be acquired is greater than or equal to the target ratio to obtain a target object image; or,

and the second acquisition module is used for acquiring the image of the target object through the image acquisition component under the condition that the target part of the target object in the image to be acquired comprises the face and the hand of the target object, so as to obtain the image of the target object.

In one exemplary embodiment, the first recognition unit includes:

the first identification module is used for carrying out emotion category identification on the face image of the object to obtain the current emotion category of the target object, wherein the current emotion state comprises the current emotion category;

the second recognition module is used for carrying out part action recognition on the part image of the preset part under the condition that the target object image also comprises the part image of the preset part of the target object to obtain a target part action;

and the first determination module is used for determining the emotion intensity matched with the action of the target part as the current emotion intensity of the target object, wherein the current emotion state also comprises the current emotion category.

In one exemplary embodiment, the preset parts comprise a subject hand and a subject torso, and the target part motion comprises a target hand motion and a target body motion; the first determining module includes:

and the first determination submodule is used for determining the weighted sum of the emotion intensity matched with the target hand action and the emotion intensity matched with the target body action as the current emotion intensity of the target object.

In one exemplary embodiment, the first recognition unit includes:

the third recognition module is used for respectively recognizing the emotional states of the target object based on the target object image and the target voice data to obtain various emotional states;

and the fusion module is used for carrying out emotional state fusion on the plurality of emotional states to obtain the current emotional state of the target object.

In one exemplary embodiment, the third identifying module includes:

the first recognition submodule is used for carrying out emotion state recognition on a target face image of a target object to obtain a first emotion state, wherein the first emotion state comprises a first emotion category, and the target object image comprises the target face image;

and the second identification submodule is used for carrying out emotion state identification on the object voice data to obtain a second emotion state, wherein the second emotion state comprises a second emotion category.

In one exemplary embodiment, the fusion module includes:

a second determination sub-module configured to determine, as the current emotion category, any one of the first emotion category and the second emotion category when the first emotion category and the second emotion category are consistent;

the third determining submodule is used for determining the emotion category with high confidence coefficient in the first emotion category and the second emotion category as the current emotion category under the condition that the first emotion category and the second emotion category are inconsistent; or determining the emotion category matched with the part state of the preset part of the target object in the first emotion category and the second emotion category as the current emotion category.

In an exemplary embodiment, the current emotional state further comprises a current emotional intensity of the target object; the fusion module further comprises:

the fourth determining submodule is used for determining the first emotion intensity in the second emotion state as the current emotion intensity under the condition that the first emotion category is consistent with the second emotion category; or,

and a fifth determining sub-module, configured to determine, as the current emotional intensity, a weighted sum of a first emotional intensity and a second emotional intensity when the target object image further includes a position image of a preset position of the target object, where the first emotional intensity is an emotional intensity included in the second emotional state, and the second emotional intensity is an emotional intensity recognized from the position image of the preset position.

In one exemplary embodiment, the current emotional state further comprises a current emotional intensity of the target object; the fusion module further comprises:

a sixth determining sub-module, configured to determine, as the current emotion intensity, an emotion intensity recognized from the part image of the preset part when the first emotion category and the second emotion category are inconsistent, the current emotion category is the first emotion category, and the target object image further includes a part image of the preset part of the target object;

a seventh determining sub-module, configured to determine, as the current emotion intensity, the first emotion intensity in the second emotion state when the first emotion category and the second emotion category are inconsistent and the current emotion category is the second emotion category;

and an eighth determining sub-module, configured to determine, as the current emotion intensity, a weighted sum of the first emotion intensity and the second emotion intensity when the first emotion category and the second emotion category are inconsistent, the current emotion category is the second emotion category, and the target object image further includes a part image of a preset part of the target object, where the first emotion intensity is an emotion intensity included in the second emotion state, and the second emotion intensity is an emotion intensity recognized from the part image of the preset part.

In one exemplary embodiment, the determining unit further includes:

the second determination module is used for determining that the device operation to be executed by the target intelligent device is the first device operation under the condition that the target object is in the negative emotion and the emotional intensity of the negative emotion is reduced according to the current emotional state and the historical emotional state;

a third determining module, configured to determine, when it is determined that the target object is in a negative emotion and the emotional intensity of the negative emotion is enhanced according to the current emotional state and the historical emotional state, that a device operation to be performed by the target smart device is a second device operation;

and the fourth determination module is used for determining that the equipment operation to be executed by the target intelligent equipment is the third equipment operation under the condition that the target object is determined to be in the negative emotion according to the current emotional state and the historical emotional state, and the duration of the negative emotion greater than or equal to the preset intensity threshold reaches the preset duration threshold.

According to still another aspect of an embodiment of the present application, there is also provided a storage medium. Alternatively, in this embodiment, the storage medium may be configured to execute a program code of any one of the digital twin multimodal apparatus control methods described in the embodiments of the present application.

Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

s1, recognizing the emotion state of the target object based on the target object image and/or the object voice data to obtain the current emotion state of the target object;

s2, acquiring the historical emotional state of the target object in the historical time period under the condition that the current emotional state belongs to the negative state;

s3, determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state;

and S4, controlling the target intelligent device to execute the target device operation.

Optionally, the specific example in this embodiment may refer to the example described in the above embodiment, which is not described again in this embodiment.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.

According to still another aspect of the embodiments of the present application, there is also provided an electronic apparatus for implementing the above digital twin multimodal device control method, which may be a server, a terminal, or a combination thereof.

Fig. 5 is a block diagram of an alternative electronic device according to an embodiment of the present application, as shown in fig. 5, including a processor 502, a communication interface 504, a memory 506, and a communication bus 508, wherein the processor 502, the communication interface 504, and the memory 506 are communicated with each other via the communication bus 508, and wherein,

a memory 506 for storing a computer program;

the processor 502, when executing the computer program stored in the memory 506, implements the following steps:

Alternatively, the communication bus may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus. The communication interface is used for communication between the electronic device and other equipment.

The memory may include RAM, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

As an example, the memory 506 may include, but is not limited to, the first recognition unit 402, the first obtaining unit 404, the determining unit 406, and the control unit 408 of the digital twin multimodal apparatus control method device. In addition, other module units in the digital twin multimodal device control apparatus can be included, but not limited to, and are not described in detail in this example.

The processor may be a general-purpose processor, and may include but is not limited to: a CPU (Central Processing Unit), an NP (Network Processor), and the like; but also a DSP (Digital Signal Processing), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.

It can be understood by those skilled in the art that the structure shown in fig. 5 is only an illustration, and the device implementing the digital twin multi-modal device control method may be a terminal device, and the terminal device may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 5 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 5, or have a different configuration than shown in FIG. 5.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.

The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, units or modules, and may be electrical or in other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, and may also be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or at least two units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A digital twin multimodal apparatus control method, comprising:

performing emotion state recognition on a target object based on the target object image and/or the object voice data to obtain the current emotion state of the target object;

under the condition that the current emotional state belongs to a negative state, acquiring a historical emotional state of the target object in a historical time period;

determining target equipment operation to be executed by the target intelligent equipment according to the current emotional state and the historical emotional state;

and controlling the target intelligent equipment to execute the target equipment operation.

2. The method of claim 1, wherein prior to said identifying an emotional state of a target object based on target object images and/or object speech data, resulting in a current emotional state of the target object, the method further comprises:

acquiring the target object image acquired by an image acquisition component, wherein the target object image is an image containing a target face of the target object;

and acquiring the object voice data acquired by the voice acquisition component, wherein the object voice data is the voice data sent by the target object.

3. The method of claim 2, wherein after the acquiring the target object image acquired by the image acquisition component, the method further comprises:

carrying out facial region identification on the target object image to obtain a group of facial images;

and respectively carrying out object identification on each face image in the group of face images to obtain object information of a group of objects, wherein the group of objects comprises the target object.

4. The method of claim 1, wherein prior to said identifying an emotional state of a target object based on target object images and/or object speech data, resulting in a current emotional state of the target object, the method further comprises:

under the condition that the distance between the target object and the target intelligent equipment is smaller than or equal to a preset distance threshold value, adjusting the image acquisition component from a closed state to an open state;

and under the condition that the target part of the target object in the image to be acquired of the image acquisition component meets a preset condition, carrying out image acquisition on the target object through the image acquisition component to obtain the target object image.

5. The method according to claim 4, wherein in a case that a target portion of the target object located in the image to be captured by the image capturing component satisfies a preset condition, capturing the target object by the image capturing component to obtain the target object image, includes:

under the condition that the target part of the target object in the image to be acquired comprises an object face and the ratio of the area of the object face to the area of the image to be acquired is greater than or equal to a target ratio, acquiring an image of the target object by the image acquisition component to obtain the image of the target object; or,

and under the condition that the target part of the target object in the image to be acquired comprises an object face and an object hand, acquiring the image of the target object by the image acquisition component to obtain the image of the target object.

6. The method according to claim 1, wherein the target object image includes an object face image of the target object; the method for recognizing the emotion state of the target object based on the target object image and/or the object voice data to obtain the current emotion state of the target object comprises the following steps:

performing emotion type recognition on the face image of the object to obtain a current emotion type of the target object, wherein the current emotion state comprises the current emotion type;

under the condition that the target object image also comprises a part image of a preset part of the target object, carrying out part action recognition on the part image of the preset part to obtain a target part action;

determining an emotional intensity matched with the target part action as a current emotional intensity of the target object, wherein the current emotional state further comprises the current emotional category.

7. The method of claim 6, wherein the preset positions comprise a subject hand and a subject torso, and the target position actions comprise a target hand action and a target body action; the determining the emotion intensity matched with the target part action as the current emotion intensity of the target object comprises:

determining a weighted sum of the emotional intensity matched to the target hand action and the emotional intensity matched to the target body action as the current emotional intensity of the target object.

8. The method of claim 1, wherein performing emotion state recognition on a target object based on a target object image and/or object voice data to obtain a current emotion state of the target object comprises:

respectively carrying out emotion state recognition on the target object based on the target object image and the object voice data to obtain multiple emotion states;

and performing emotional state fusion on the plurality of emotional states to obtain the current emotional state of the target object.

9. The method of claim 8, wherein the target object is subjected to emotional state recognition based on the target object image and the object voice data, respectively, to obtain a plurality of emotional states, including:

performing emotional state recognition on the target object face image to obtain a first emotional state, wherein the first emotional state comprises a first emotion category, and the target object image comprises the target object face image;

and performing emotion state recognition on the voice data of the object to obtain a second emotion state, wherein the second emotion state comprises a second emotion category.

10. The method of claim 9, wherein the current emotional state comprises a current emotional category of the target object; the performing emotional state fusion on the plurality of emotional states to obtain the current emotional state of the target object includes:

determining any one of the first emotion category and the second emotion category as the current emotion category if the first emotion category and the second emotion category are consistent;

determining an emotion category with high confidence coefficient in the first emotion category and the second emotion category as the current emotion category when the first emotion category and the second emotion category are inconsistent; or, the emotion category matched with the part state of the preset part of the target object in the first emotion category and the second emotion category is determined as the current emotion category.

11. The method of claim 9, wherein the current emotional state further comprises a current emotional intensity of the target subject; the performing emotional state fusion on the plurality of emotional states to obtain the current emotional state of the target object further includes:

determining a first emotion intensity in the second emotion state as the current emotion intensity under the condition that the first emotion category and the second emotion category are consistent; or,

and under the condition that the target object image further comprises a position image of a preset position of the target object, determining a weighted sum of a first emotion intensity and a second emotion intensity as the current emotion intensity, wherein the first emotion intensity is the emotion intensity contained in the second emotion state, and the second emotion intensity is the emotion intensity identified from the position image of the preset position.

12. The method of claim 9, wherein the current emotional state further comprises a current emotional intensity of the target subject; the performing emotional state fusion on the plurality of emotional states to obtain the current emotional state of the target object further comprises:

determining the emotion intensity recognized from the part image of the preset part as the current emotion intensity under the condition that the first emotion type and the second emotion type are inconsistent, the current emotion type is the first emotion type, and the target object image further comprises a part image of the preset part of the target object;

determining a first emotion intensity in the second emotion state as the current emotion intensity in the case that the first emotion category and the second emotion category are not consistent and the current emotion category is the second emotion category;

and under the condition that the first emotion type is inconsistent with the second emotion type, the current emotion type is the second emotion type, and the target object image further comprises a position image of a preset position of the target object, determining a weighted sum of a first emotion intensity and a second emotion intensity as the current emotion intensity, wherein the first emotion intensity is the emotion intensity included in the second emotion state, and the second emotion intensity is the emotion intensity recognized from the position image of the preset position.

13. The method of any of claims 1-12, wherein determining a target device operation to be performed by a target smart device based on the current emotional state and the historical emotional state, further comprises:

determining a device operation to be performed by the target smart device as a first device operation if it is determined from the current emotional state and the historical emotional state that the target object is in a negative emotion and the emotional intensity of the negative emotion is reduced;

determining a device operation to be performed by the target smart device as a second device operation if it is determined from the current emotional state and the historical emotional state that the target object is in a negative emotion and the emotional intensity of the negative emotion is enhanced;

and under the condition that the target object is determined to be in a negative emotion according to the current emotional state and the historical emotional state, and the duration of the negative emotion, the intensity of which is greater than or equal to a preset intensity threshold value, reaches a preset duration threshold value, determining that the device operation to be executed by the target intelligent device is a third device operation.

14. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 13.

15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 13 by means of the computer program.