CN112183197A

CN112183197A - Method and device for determining working state based on digital person and storage medium

Info

Publication number: CN112183197A
Application number: CN202010847552.1A
Authority: CN
Inventors: 常向月
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2021-01-05
Anticipated expiration: 2040-08-21

Abstract

The application relates to a method, a device and a storage medium for determining a working state based on a digital person. The method comprises the following steps: acquiring a target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state. The digital person can be represented by the virtual image, and can control the virtual image to send out the working state prompt information according to the target working state corresponding to the target user so as to remind the user.

Description

Method and device for determining working state based on digital person and storage medium

Technical Field

The present application relates to the field of information technology, and in particular, to a method and an apparatus for determining a working state based on a digital person, and a storage medium.

Background

With the development of scientific technology, the life condition of a user can be intelligently monitored or supervised through an intelligent device in many cases, for example, the movement condition of the user in one day, such as how many steps are taken, or how long the user has been resting, etc. are monitored.

However, in many cases, manual supervision is still required, for example, the work state of the employee is determined manually according to the completion of the work target, resulting in inefficient work state determination.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus and a storage medium for determining an operating state based on a digital person.

A digital person-based work state determination method, the method comprising: acquiring a target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

In some embodiments, the analyzing the working state based on the facial features corresponding to the target image to obtain the first working state corresponding to the target user includes: acquiring the facial features corresponding to the target image, and processing the facial features by using a trained expression recognition model to obtain a target expression corresponding to the target user; and analyzing the working state according to the target expression to obtain a first working state corresponding to the target user.

In some embodiments, the target images are multiple, and the analyzing the working state based on the facial features corresponding to the target images to obtain the first working state corresponding to the target user includes: acquiring feature point positions respectively corresponding to a plurality of eye key feature points corresponding to the target image, and obtaining a target closing state corresponding to the eyes of the target user in the target image based on the position difference between the feature point positions; sequencing the closed states of the targets according to the acquisition sequence corresponding to the target images to obtain a closed state sequence; and analyzing the working state according to the closed state sequence to obtain a first working state corresponding to the target user.

In some embodiments, the analyzing the working state based on the voice feature corresponding to the target voice to obtain the second working state corresponding to the target user includes: acquiring voice attribute information corresponding to the target voice, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information; and analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user.

In some embodiments, the method further comprises: performing semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice; the analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user includes: and analyzing the working state based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

In some embodiments, the determining the target operating state corresponding to the target user according to the first operating state and the second operating state includes: calculating the number of the fatigue states in the first working state and the second working state; and when the state quantity is greater than a preset threshold value or the proportion corresponding to the state quantity is greater than a preset proportion, determining that the target working state corresponding to the target user is exhausted.

In some embodiments, the method further comprises: acquiring a virtual user image corresponding to the target user; and controlling the virtual user image to send out working state prompt information according to the target working state.

In some embodiments, the controlling the virtual user image to send out the working state prompt message according to the target working state includes: acquiring a face adjustment parameter corresponding to the first working state; and performing image adjustment on the virtual user image according to the face adjustment parameter, and controlling the virtual user image subjected to image adjustment to send out working state prompt information according to the target working state.

In some embodiments, the determining the target operating state corresponding to the target user in combination with the first operating state and the second operating state includes: and inputting the first working state and the second working state into a comprehensive state determination model to obtain a target working state corresponding to the target user.

In some embodiments, the first working state is obtained by processing the face feature using a first state determination model, and the second working state is obtained by processing the speech feature using a second state determination model, and the method further includes: acquiring a first training sample, wherein the first training sample comprises training face features and a corresponding first state label, training voice features and a corresponding second state label, a first state label and a comprehensive state label; inputting the training face features into a first state determination model to be trained to obtain a first prediction state; inputting the training voice features into a second state determination model to be trained to obtain a second prediction state; inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state; obtaining a target model loss value based on the state difference between the first state label and the first predicted state, the state difference between the second state label and the second predicted state, and the state difference between the integrated state label and the third predicted state; and adjusting model parameters of the first state determination model to be trained, the second state determination model to be trained and the comprehensive state determination model to be trained based on the target model loss value to obtain the first state determination model, the second state determination model and the comprehensive state determination model.

A digital person-based work state determination apparatus, the apparatus comprising: the image and voice acquisition module is used for acquiring a target voice of a target user and a target image corresponding to the target voice; the first working state obtaining module is used for carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; a second working state obtaining module, configured to perform working state analysis based on a voice feature corresponding to the target voice to obtain a second working state corresponding to the target user; and the target working state determining module is used for determining a target working state corresponding to the target user by combining the first working state and the second working state.

In some embodiments, the first operating state obtaining module is configured to: acquiring the facial features corresponding to the target image, and processing the facial features by using a trained expression recognition model to obtain a target expression corresponding to the target user; and analyzing the working state according to the target expression to obtain a first working state corresponding to the target user.

In some embodiments, the target image is a plurality of images, and the first operating state obtaining module is configured to: acquiring feature point positions respectively corresponding to a plurality of eye key feature points corresponding to the target image, and obtaining a target closing state corresponding to the eyes of the target user in the target image based on the position difference between the feature point positions; sequencing the closed states of the targets according to the acquisition sequence corresponding to the target images to obtain a closed state sequence; and analyzing the working state according to the closed state sequence to obtain a first working state corresponding to the target user.

In some embodiments, the second operating state obtaining module is configured to: acquiring voice attribute information corresponding to the target voice, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information; and analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user.

In some embodiments, the apparatus further comprises: the semantic emotion analysis module is used for performing semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice; the second operating state obtaining module is configured to: and analyzing the working state based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

In some embodiments, the first working state is a working state corresponding to each of the plurality of analysis dimensions, the second working state is a working state corresponding to each of the plurality of analysis dimensions, and the target working state determination module is configured to: calculating the number of the fatigue states in the first working state and the second working state; and when the state quantity is greater than a preset threshold value or the proportion corresponding to the state quantity is greater than a preset proportion, determining that the target working state corresponding to the target user is exhausted.

In some embodiments, the apparatus further comprises: the virtual user image acquisition module is used for acquiring the virtual user image corresponding to the target user; and the working state prompt information sending module is used for controlling the virtual user image to send out the working state prompt information according to the target working state.

In some embodiments, the working state prompt message issuing module is configured to: acquiring a face adjustment parameter corresponding to the first working state; and performing image adjustment on the virtual user image according to the face adjustment parameter, and controlling the virtual user image subjected to image adjustment to send out working state prompt information according to the target working state.

In some embodiments, the target operating state determination module is to: and inputting the first working state and the second working state into a comprehensive state determination model to obtain a target working state corresponding to the target user.

In some embodiments, the first working state is obtained by processing the face features using a first state determination model, and the second working state is obtained by processing the speech features using a second state determination model, and the apparatus further includes a model training module configured to: acquiring a first training sample, wherein the first training sample comprises training face features and a corresponding first state label, training voice features and a corresponding second state label, a first state label and a comprehensive state label; inputting the training face features into a first state determination model to be trained to obtain a first prediction state; inputting the training voice features into a second state determination model to be trained to obtain a second prediction state; inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state; obtaining a target model loss value based on the state difference between the first state label and the first predicted state, the state difference between the second state label and the second predicted state, and the state difference between the integrated state label and the third predicted state; and adjusting model parameters of the first state determination model to be trained, the second state determination model to be trained and the comprehensive state determination model to be trained based on the target model loss value to obtain the first state determination model, the second state determination model and the comprehensive state determination model.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: acquiring a target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

In some embodiments, the computer program further causes the processor to perform the steps of: performing semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice; the analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user includes: and analyzing the working state based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

In some embodiments, the computer program further causes the processor to perform the steps of: acquiring a virtual user image corresponding to the target user; and controlling the virtual user image to send out working state prompt information according to the target working state.

In some embodiments, the first operating state is a result of processing the facial features using a first state determination model, the second operating state is a result of processing the speech features using a second state determination model, and the computer program further causes the processor to perform the steps of: acquiring a first training sample, wherein the first training sample comprises training face features and a corresponding first state label, training voice features and a corresponding second state label, a first state label and a comprehensive state label; inputting the training face features into a first state determination model to be trained to obtain a first prediction state; inputting the training voice features into a second state determination model to be trained to obtain a second prediction state; inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state; obtaining a target model loss value based on the state difference between the first state label and the first predicted state, the state difference between the second state label and the second predicted state, and the state difference between the integrated state label and the third predicted state; and adjusting model parameters of the first state determination model to be trained, the second state determination model to be trained and the comprehensive state determination model to be trained based on the target model loss value to obtain the first state determination model, the second state determination model and the comprehensive state determination model.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: acquiring a target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

According to the method, the device, the computer equipment and the storage medium for determining the working state based on the digital human, the voice of the target user and the image corresponding to the voice can be obtained, the working state analysis is carried out according to the face characteristics corresponding to the target image to obtain the first working state corresponding to the target user, the working state analysis is carried out according to the voice characteristics corresponding to the voice to obtain the second working state corresponding to the target user, and the target working state corresponding to the target user is determined by combining the first working state and the second working state, so that the working state corresponding to the user can be accurately determined, and the efficiency and the accuracy for determining the working state are improved.

Drawings

FIG. 1 is a diagram of an application environment of a digital human-based work state determination method in some embodiments;

FIG. 2 is a schematic flow chart of a digital human-based work state determination method in some embodiments;

FIG. 3 is a flow diagram illustrating the steps of model training a comprehensive state determination model in some embodiments;

FIG. 4A is a schematic flow chart of a digital human-based work state determination method in some embodiments;

FIG. 4B is a schematic diagram of an interface for prompting a digital person according to a working state in some embodiments;

FIG. 5 is a block diagram of a digital human-based operation status determination apparatus in some embodiments;

FIG. 6 is a block diagram of a digital human-based operation status determination apparatus in some embodiments;

FIG. 7 is a diagram of the internal structure of a computer device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method for determining the working state based on the digital human can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 is placed in an area where a target user is located, for example, a computer used by the target user, and a camera and a recording device, for example, a microphone, may be installed on the terminal 102. When the user speaks, the terminal 102 may perform recording and image acquisition to acquire the voice and the image of the target user in real time and send the voice and the image to the server 104 in real time, and the server executes the method for determining the working state based on the digital person provided in the embodiment of the present application and may send the acquired target working state to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

It is understood that the method for determining the working state based on the digital human according to the embodiment of the present application may also be executed at the terminal 102. The digital person in the embodiment of the present application is a virtual person, and may refer to a virtual person that can assist or replace a real person to perform a task, for example, a set of developed programs may be used to assist or replace a real person to supervise the working state of the employee by executing the programs.

In some embodiments, as shown in fig. 2, there is provided a method for determining an operating state based on a digital person, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 202, acquiring a target voice of a target user and a target image corresponding to the target voice.

The target user may be any user, for example, a user who operates a terminal, for example, an employee who uses a computer. The server can also control the terminal to acquire a face image of the user, detect according to the face image, detect whether the acquired face is consistent with a face corresponding to an account logged in the terminal, and if so, determine that the user is a target user. For example, for employees in an office, if an account number of an employee is logged in each computer, the computer may collect face images of a person sitting on an office chair, compare the face images with a preset face corresponding to the account number logged in the computer, and determine that the user is a target user when the comparison is passed.

Specifically, when the telecommuting software is started, a camera and a recording device of the terminal can be automatically started, when a target user speaks, the terminal can acquire voice information of the user, record a target image obtained when the user speaks at the same time and upload the target image to the server.

In some embodiments, the image capturing device may be turned on again when the user is detected to speak, so as to reduce consumption of terminal resources.

And 204, analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user.

The operating state refers to a state during operation. The working state can be measured by working fatigue degree, and the quantification of the fatigue degree can be set according to requirements and can be divided into a plurality of levels. For example, three levels of exhaustion, slight exhaustion, and mental capacity may be included. The face features are features related to the face, and may include at least one of features corresponding to eyes, mouth, or nose, or may be morphological features obtained by combining the respective face features. The face features may include positions of respective feature points of the face and may also include features obtained from pixel values. The eye-corresponding characteristic may be, for example, at least one of open or closed. The nose-corresponding feature may be, for example, at least one of a nose inhalation or a nose exhalation. The corresponding characteristic of the mouth may be at least one of open or closed, for example. The first working state can be obtained according to a preset judgment rule or determined according to an artificial intelligence model.

Specifically, after the server obtains the target image, the server can extract the face features of the target image to obtain the face features, and the working state analysis is performed according to the face features.

In some embodiments, the facial features may be extracted by a facial feature extraction model, and the facial feature extraction model may be a deep learning model. A plurality of face feature extraction models may be included, and for example, at least one of a model extracting a feature corresponding to an eye or a model extracting a feature corresponding to a mouth may be included.

In some embodiments, the server may obtain a facial feature corresponding to the target image, and process the facial feature by using the trained expression recognition model to obtain a target expression corresponding to the target user; and analyzing the working state according to the target expression to obtain a first working state corresponding to the target user.

The expression is an emotional feeling expressed on the face, and may be, for example, a feeling of oppression, excitement, anger, or the like. The facial feature extraction model and the expression recognition model can be cascaded and obtained by performing combined training during model training. For example, the training image may be input into a facial feature extraction model to obtain facial features, and the facial features may be input into an expression recognition model to obtain predicted expressions. And obtaining a model loss value according to the difference between the predicted expression and the actual expression, and adjusting the parameters of the model according to a gradient descent method. And the difference between the predicted expression and the actual expression is in positive correlation with the model loss value. Therefore, the facial feature extraction model and the expression recognition model can be obtained through the combined training.

In some embodiments, the correspondence between the expression and the working state may be preset, for example, the working state corresponding to an excited expression may be set as spirit. The working state corresponding to the depressed label is tired. Therefore, after the target expression is obtained, the first working state corresponding to the target user can be determined.

In some embodiments, the server obtains feature point positions corresponding to a plurality of eye key feature points corresponding to the target image, and obtains a target closed state corresponding to the eyes of the target user in the target image based on the position difference between the feature point positions; sequencing the closed states of the targets according to the acquisition sequence corresponding to the target images to obtain a closed state sequence; and analyzing the working state according to the closed state sequence to obtain a first working state corresponding to the target user.

The eye key feature points may be set as needed, and may include, for example, feature points on the upper eyelid and feature points on the lower eyelid. The position difference represents a difference in the position of the feature point, and may be represented by a distance. The target closed state may be open or closed. The distance between the characteristic point on the upper eyelid and the characteristic point on the lower eyelid may be acquired, and whether the eye is open or closed may be determined according to the distance. When the distance is greater than a first preset distance, the eyes are determined to be open, and when the distance is less than a second preset distance, the eyes are determined to be closed. The first preset distance is larger than or equal to the second preset distance. The acquisition order refers to the order in which the target images are acquired. The ordering of the target images acquired first is before the ordering of the target images acquired later. As a plurality of target images are provided, the target closed states can be sequenced according to the sequencing of the acquisition sequence to obtain a closed state sequence. For example, it is assumed that 5 pictures are arranged in sequence, and the target closing states thereof are open, closed, opened, closed, and opened, respectively. The target closed states may be arranged in this order.

The server can analyze the closed state sequence to obtain a first working state corresponding to the target user. For example, a change rule of the states in the closed state sequence may be determined, and the corresponding working state may be determined according to the change rule and a preset judgment rule. For example, it may be set that fatigue is determined when the change rule is that in the closed state sequence, the duration corresponding to the continuous closed state exceeds a preset duration, and fatigue is determined otherwise to be normal, for example, if the duration of the closed eye exceeds 1 minute. Or counting the times that the duration corresponding to the state of each closing is greater than the preset time, and if the times are greater than the preset times, determining that the working state is tired. For example, if the number of times of closing the eyes for a length of time exceeding 1 minute is 5 times and the preset number of times is 3 times, the working state is determined to be exhausted.

In some embodiments, when the number of the closed states in the closed state sequence is greater than the preset number or the ratio is greater than the preset ratio, the working state is determined to be exhausted. Otherwise normal or excited.

In some embodiments, the target image may be extracted according to a preset time interval or a preset image interval, for example, one video frame may be extracted from a video obtained by video capturing of a user every 3 video frames to perform the analysis of the working state.

And step 206, analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user.

Wherein the voice feature is a feature for representing a characteristic of the voice. For example, may include at least one of intonation or speech rate. Intonation refers to the change of sound rise and fall in a sentence. For example, it may be raised, lowered, or ramped, etc. The change of the frequency of the voice can be counted to obtain the tone characteristics. One or more voice features may be obtained, and the second operating state may be obtained by combining the plurality of voice features. The second working state can be obtained according to a preset judgment rule or determined according to an artificial intelligence model.

Specifically, the server may perform speech feature recognition on the target speech by using a natural language processing technique to obtain a speech feature set. For example, the server obtains corresponding voice attribute information based on the target voice, and the voice attribute information includes at least one of voice speed information or voice tone variation information. The intonation change information may be counted in units of preset time lengths, for example, the average voice frequency corresponding to a time period corresponding to each preset time length is calculated, and the intonation is determined according to the change of the average voice frequency between adjacent time periods. For example, assuming that the preset time period is 1 second, the average speech frequency corresponding to the 1 st second, the average speech frequency corresponding to the 2 nd second, and the average speech frequency corresponding to the 2 nd second may be acquired, and when the speech pitch change information is continuously increased, the speech pitch change information is increased. The corresponding relation between the voice feature and the working state can be preset, and the second working state is determined according to the voice feature corresponding to the target voice.

In some embodiments, the server may obtain voice attribute information corresponding to the target voice, where the voice attribute information includes at least one of voice speed information or voice tone variation information; and analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user.

Specifically, the speech rate information refers to the speed of speaking. The corresponding relation between the voice attribute and the working state can be set, and the second working state corresponding to the target user is obtained according to the corresponding relation. Or comparing the voice attribute with a preset voice attribute corresponding to the target user, and determining a second working state corresponding to the target user according to the comparison result. For example, when the target user is in each working state, the corresponding voice attribute information may be preset, so that the second working state corresponding to the target user may be determined according to the voice attribute information corresponding to the target voice. For example, when the target user is tired, the voice speed is lower than the first speed, and the intonation is gradually decreased, and when the voice attribute information corresponding to the target voice is obtained and is lower than the first speed, and the intonation is gradually decreased, the second working state corresponding to the target user is determined to be tired. The working state is analyzed through the voice attribute information, and the working state of the user can be accurately determined.

In some embodiments, the digital person-based work state determination method further comprises: performing semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice; analyzing the working state based on the voice attribute information, and obtaining a second working state corresponding to the target user comprises: and analyzing the working state based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

Here, the semantic emotion means an emotion expressed by a meaning expressed in a sentence, and may be a positive emotion or a negative emotion. The recognition of semantic emotions may be performed, for example, based on a semantic emotion recognition model. The semantic emotion recognition model is obtained through supervised training, training sentences used for training and corresponding labels (semantic emotions) can be obtained, the training sentences are input into the semantic emotion recognition model to be trained, the semantic emotion recognition model outputs predicted semantic emotions, model loss values are obtained according to the differences of the predicted semantic emotions and the labels, model parameters are adjusted towards the direction of descending of the model loss values until the model converges, the condition of the model converging can be that the model loss values are smaller than a preset threshold value, the differences of the predicted semantic emotions and the labels are in positive correlation with the model loss values, and the larger the difference is, the larger the model loss value is.

Specifically, the server may recognize the target speech to obtain a target sentence, input the target sentence into the trained semantic emotion recognition model, and perform semantic emotion recognition on the target sentence by using the semantic emotion recognition model to obtain the target semantic emotion. The corresponding relation between the voice attribute information and the semantic emotion and the working state can be set, so that the corresponding second working state can be obtained according to the voice attribute information and the target semantic emotion. For example, it may be set to determine the working state as tired when the speech rate is lower than a preset speech rate and the target semantic emotion is negative.

And step 208, determining a target working state corresponding to the target user by combining the first working state and the second working state.

Specifically, the target operating state is obtained by combining the first operating state and the second operating state. For example, the first working state is a working state corresponding to each of the plurality of analysis dimensions, and the second working state is a working state corresponding to each of the plurality of analysis dimensions, and the working state quantity in the first working state and the second working state can be calculated, wherein the working state quantity is exhausted; and when the state quantity is greater than a preset threshold value or the proportion corresponding to the state quantity is greater than a preset proportion, determining that the target working state corresponding to the target user is exhausted.

The proportion corresponding to the state quantity is the quantity with tired working state, and accounts for the proportion of the total quantity of the first working state and the second working state. For example, if the first operating state is n, the second operating state is m, and the number of states whose operating states are exhausted is k, the corresponding ratio is k/(n + m). The analysis dimension refers to a dimension for analyzing the working state, for example, for the human face feature, the analysis dimension may include analysis dimensions such as expression, eyes, and the like. The first working state corresponding to the expression, the first working state corresponding to the expression and the first working state corresponding to the eyes can be obtained. For the voice feature, the working state corresponding to the speed of speech and the working state corresponding to the tone of speech, etc. can be obtained. The preset threshold and the preset ratio may be set as required, for example, the preset threshold may be 3, and the preset ratio may be 60%. For example, assume that the first working state corresponding to the expression is tired, and the first working state corresponding to the eyes is normal. The second working state corresponding to the pace of speech is tired, and the second working state corresponding to the intonation is tired, and the tired working states are 4 and are greater than the preset threshold value 3, so that the target working state is tired. In the embodiment of the application, the accuracy of the determined working state is improved through multi-level analysis.

In some embodiments, the first working state is a working state corresponding to each of the plurality of analysis dimensions, the second working state is a working state corresponding to each of the plurality of analysis dimensions, and the server inputs the first working state and the second working state into the comprehensive state determination model to obtain a target working state corresponding to the target user. The comprehensive state determination model is a model obtained by pre-training and is used for determining a target working state by synthesizing the first working state and the second working state. May be obtained through supervised training. For example, a working state set for model training and a manually labeled working state label can be obtained, each working state in the working state set is input into a comprehensive state determination model to be trained, the model outputs a predicted comprehensive working state, a model loss value is obtained according to the difference between the predicted comprehensive working state and the working state label, model parameters are adjusted according to a gradient descent algorithm until the model converges, and the trained comprehensive state determination model is obtained.

In some embodiments, the second operation state is obtained by processing the speech features using a second state determination model, and the first operation state is obtained by processing the face features using a first state determination model, as shown in fig. 3, the method for determining the operation state based on the digital person includes a step of performing model training on the comprehensive state determination model, and the step of performing model training on the comprehensive state determination model includes:

step S302, a first training sample is obtained, wherein the first training sample comprises training voice features and corresponding second state labels, training face features and corresponding first state labels, and comprehensive state labels.

Wherein the training samples are samples for model training. The training voice features in the first training sample are obtained by performing feature extraction on training voice, and the training face features are obtained by performing feature extraction on training images. The training speech refers to speech used for model training, and the training images are images used for model training. For training voice and training images in the same training sample, the training images are the images of a user collected when the user sends out the training voice. The second status label, the first status label, and the composite status label may be manually labeled.

Specifically, the server may obtain a training voice and a corresponding training image acquired when the training voice is uttered, and perform feature extraction on the training voice to obtain a training voice feature. And extracting the face features of the training images to obtain training face features. The server can output training voice and training images to the terminal, the terminal receives state labeling operation, and the terminal responds to the state labeling operation to obtain a second state label, a first state label and a comprehensive state label.

Step S304, inputting the training face features into a first state determination model to be trained to obtain a first prediction state.

Specifically, the first state determination model is a model for processing the face features, and may be, for example, a neural network model, and the first prediction state is an output operating state of the first state determination model after processing the training face features. The server can input the training face features into a first state determination model to be trained, the first state determination model processes the training face features by using model parameters, and a first prediction state is obtained through prediction.

Step S306, inputting the training voice characteristics into a second state determination model to be trained to obtain a second prediction state.

Specifically, the second state determination model is a model for processing the training speech features, and may be a neural network model, for example, and the second prediction state is an output operating state of the training speech features processed by the second state determination model. The server may input the training speech features into a second state determination model to be trained, and the second state determination model processes the training speech features by using the model parameters to predict a second predicted state.

In some embodiments, the first state determination model is a plurality of models, and the second state determination model is a plurality of models, for example, the plurality of first state determination models may be models using different model structures, and the plurality of second state determination models may also be models using different model structures.

Step S308, inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state.

Specifically, the plurality of first state determination models and the plurality of second state determination models may obtain a plurality of first prediction states and a plurality of second prediction states, the plurality of first prediction states are respectively used as features, the plurality of second prediction states are respectively used as features, the features are input into the integrated state determination model, and the integrated state determination model processes the input features by using the model parameters to obtain a third prediction state.

Step S310, a target model loss value is obtained based on the state difference between the first state label and the first prediction state, the state difference between the second state label and the second prediction state, and the state difference between the comprehensive state label and the third prediction state.

Specifically, the model loss value and the difference form a positive correlation relationship, and the larger the difference is, the larger the model loss value is, and the model loss value may be a cross entropy loss value or a Mean Square Error (Mean Square Error). A first model loss value may be obtained according to a state difference between the first state label and the first prediction state, a second model loss value may be obtained according to a state difference between the second state label and the second prediction state, and a third model loss value may be obtained by integrating the state differences between the state label and the third prediction state. And carrying out weighted summation according to the first model loss value, the second model loss value and the third model loss value to obtain a target model loss value. The weight corresponding to each model loss value can be set as required.

Step S312, model parameters of the first state determination model to be trained, the second state determination model to be trained and the comprehensive state determination model to be trained are adjusted based on the target model loss value, and the first state determination model, the second state determination model and the comprehensive state determination model are obtained.

Specifically, after a target model loss value is obtained, back propagation is performed according to the target model loss value, model parameters of the first state determination model to be trained, the second state determination model to be trained and the comprehensive state determination model to be trained are adjusted, and the trained first state determination model, the trained second state determination model and the trained comprehensive state determination model are obtained.

In the embodiment of the application, the first state determination model, the second state determination model and the comprehensive state determination model are obtained by joint training, and a target model loss value is obtained by combining the state difference between the first state label and the first prediction state, the state difference between the second state label and the second prediction state and the state difference between the comprehensive state label and the third prediction state, so that when the model parameters are adjusted in a reverse direction, the adjustment of the parameters can be adjusted by the comprehensive total loss value, and the recognition effect of the trained model is good.

In the method for determining the working state based on the digital human, the voice of the target user and the image corresponding to the voice can be acquired, the working state analysis is performed according to the face characteristics corresponding to the target image to obtain the first working state corresponding to the target user, the working state analysis is performed according to the voice characteristics corresponding to the voice to obtain the second working state corresponding to the target user, and the target working state corresponding to the target user is determined by combining the first working state and the second working state, so that the working state corresponding to the user can be accurately determined, and the efficiency and the accuracy for determining the working state are improved.

In some embodiments, as shown in fig. 4A, the digital person-based work state determination method further includes:

step S402, obtaining the virtual user image corresponding to the target user.

Specifically, the virtual user image means that the user image is obtained virtually, and is not a real user image, for example, the virtual user image may be a cartoon image of a user, and the virtual user image may be generated according to characteristics of the user, for example, a hair style of the virtual user image may be determined according to a hair style of a target user, so that the virtual user image and the characteristics of the target user are more fitted. The virtual user figure may be preset.

In some embodiments, a face adjustment parameter corresponding to the first working state may also be obtained; and performing image adjustment on the virtual user image according to the face adjustment parameters, and controlling the image-adjusted virtual user image to send out working state prompt information according to the target working state.

Specifically, the face adjustment parameter refers to a parameter for adjusting a face, and a corresponding relationship between a working state and the face adjustment parameter is preset, so that the virtual user image is subjected to image adjustment according to the face adjustment parameter. For example, assuming that the first working state is tired, parameters corresponding to the face when the face is tired, such as a parameter for adjusting eyes to be slouched, a parameter for adjusting the face to be frown, and the like, are obtained to adjust the face in the virtual user image to be eye slouched and frown. The first working state is obtained according to the face characteristics, so that the first working state can reflect the working state of a target user reflected by the appearance, and the corresponding face adjustment parameters are obtained according to the first working state, so that the virtual user image can be adjusted through the parameters matched with the current face image of the user, the prompt is more efficient, for example, the prompt information can be played in a voice mode, and the virtual user image after the image adjustment is displayed.

And S404, controlling the virtual user image to send out working state prompt information according to the target working state.

Specifically, the working state prompt message may be presented in the form of voice, text or motion. For example, the working status prompt message may be "you are currently in tired state, please take a rest". The working state prompting information may be prompting by the avatar performing an action corresponding to the target working state, for example, when tired, controlling the avatar to make a dozing action or a yawning action.

In the embodiment of the application, the working state prompt information is sent out through the virtual user image corresponding to the target user, so that the current target working state of the target user can be visually prompted, and the prompting efficiency is improved.

In some embodiments, when the working state of the user is detected to be tired, a rest prompt message can be sent out to prompt the user to rest. For example, when it is detected that the eye closing time length of the target user exceeds the preset time length or the eye closing times exceed the preset times, the working state of the user is determined to be tired.

In some embodiments, when the user's working status is detected to be mental, an incentive prompt may be issued to encourage the user.

In some embodiments, the working time of the staff can be monitored and reminded, and the time length of the face appearing in the video is used as the working time length of the staff. If the working time length of the staff does not reach the office time length, the working time length prompt message can be automatically sent out. If the working time of the staff exceeds the preset time, for example, 10 hours, the staff can automatically send out off-duty prompt information.

In some embodiments, in the process of determining the working state, the digital person may obtain a voice and an image for a preset time length to perform the working state analysis, where the preset time length is, for example, 30 minutes, that is, the digital person may analyze the working state every 30 minutes.

In some embodiments, each time the working state analysis is performed, a working state obtained by previous analysis (referred to as a forward working state) may be obtained, for example, the working state obtained by the previous analysis, and a current target working state corresponding to the target user is determined according to the forward working state. For example, when the forward working state is determined to be fatigue, the fatigue probability of the current time is higher, and if the previous analysis is in good state, the good state probability of the current time is also higher, so that the result can be more accurate through multi-level analysis.

The forward working state is determined by using the voice before the target voice (called forward voice) and the image corresponding to the forward voice when the forward voice is sent out (called forward image). And in the forward working state, when the working state is tired number (called forward tired number), the current forward working state is tired, and the currently obtained number of working states with tiredness is larger than the number of working states with tiredness in the forward working state, determining that the target user is tired. For example, if 3 working states are exhausted when the working state is determined last time, and 5 working states are exhausted this time, it indicates that the current working state of the target user is exhausted. In the embodiment of the application, the analysis of the working state at this time is assisted by the last generated working state, so that the result is more accurate.

The scheme in the embodiment of the application can be used for status prompt of staff and can be applied to remote office monitoring so as to improve the remote work efficiency of the staff. Before remote office monitoring, face recognition can be carried out for verification so as to check whether the person is in the office or not. At present, the daily working state of the staff cannot be guaranteed, and the staff may have too long working time and not have a rest in time to cause the situation of too low efficiency, so that the expression and eye information of the staff can be detected by using a computer vision technology, the speaking voice and semantic information of the staff are obtained by using a voice recognition and natural language processing technology, the working efficiency of the staff can be accurately judged by performing multi-level analysis through the comprehensive information, and the staff is prompted to encourage or have a rest according to the working state, so that the remote working efficiency of the staff is improved.

The working state determining method may be executed once every preset time duration, for example, every 20 minutes, the employee may be reminded many times in the working process, a summary report may be generated according to the target working state obtained each time, the period in which the working state is tired may be shown in the summary report, and the working state of one day may be summarized, for example, the period in which the working state is tired is the same, so as to facilitate the adjustment of the working state of the employee the next day.

In some embodiments, the appearance time of the face of the employee in the video can be calculated, and the working time of the employee can be calculated according to the appearance time.

Fig. 4B is a schematic interface diagram of a digital person prompting according to a working state in some embodiments. The digital person is represented by the virtual image, as shown in the left diagram of fig. 4B, when the digital person is in a normal working state, the user can work on the working interface, such as writing codes and debugging codes, and the digital person is in a hidden display state, so that the user can work normally, and interference to the user is reduced. When the user is detected to be in an exhausted working state, the digital person can be awakened, is displayed on the working interface and sends out a working state prompt message of 'you are currently in the exhausted state and please pay attention to rest' to remind the user to rest.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above-mentioned flowcharts may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.

In some embodiments, as shown in fig. 5, there is provided a digital person-based work state determination apparatus including: an image and voice obtaining module 502, a first working state obtaining module 504, a second working state obtaining module 506 and a target working state determining module 508, wherein:

an image and voice acquiring module 502, configured to acquire a target voice of a target user and a target image corresponding to the target voice.

The first working state obtaining module 504 is configured to perform working state analysis based on the facial features corresponding to the target image to obtain a first working state corresponding to the target user.

A second working state obtaining module 506, configured to perform working state analysis based on the voice feature corresponding to the target voice, so as to obtain a second working state corresponding to the target user.

And a target working state determining module 508, configured to determine a target working state corresponding to the target user according to the first working state and the second working state.

In some embodiments, the first operating state obtaining module is configured to: acquiring facial features corresponding to a target image, and processing the facial features by using a trained expression recognition model to obtain a target expression corresponding to a target user; and analyzing the working state according to the target expression to obtain a first working state corresponding to the target user.

In some embodiments, the target image is plural, and the first operation state obtaining module is configured to: acquiring feature point positions respectively corresponding to a plurality of eye key feature points corresponding to a target image, and obtaining a target closed state corresponding to eyes of a target user in the target image based on position difference between the feature point positions; sequencing the closed states of the targets according to the acquisition sequence corresponding to the target images to obtain a closed state sequence; and analyzing the working state according to the closed state sequence to obtain a first working state corresponding to the target user.

In some embodiments, the second operating state obtaining module is configured to: acquiring voice attribute information corresponding to target voice, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information; and analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user.

In some embodiments, as shown in fig. 6, the digital person-based work state determination apparatus further includes: an avatar obtaining module 602, configured to obtain an avatar corresponding to a target user; and a working state prompt message sending module 604, configured to control the virtual user image to send out a working state prompt message according to the target working state.

In some embodiments, the working state prompt message issuing module is configured to: acquiring a face adjustment parameter corresponding to the first working state; and performing image adjustment on the virtual user image according to the face adjustment parameters, and controlling the image-adjusted virtual user image to send out working state prompt information according to the target working state.

In some embodiments, the target operating state determination module is to: and inputting the first working state and the second working state into the comprehensive state determination model to obtain a target working state corresponding to the target user.

In some embodiments, the first operating state is derived from processing the facial features using a first state determination model, the second operating state is derived from processing the speech features using a second state determination model, and the apparatus further comprises a model training module configured to: acquiring a first training sample, wherein the first training sample comprises training face features and a corresponding first state label, training voice features and a corresponding second state label, the first state label and a comprehensive state label; inputting the training face features into a first state determination model to be trained to obtain a first prediction state; inputting the training voice characteristics into a second state determination model to be trained to obtain a second prediction state; inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state; obtaining a target model loss value based on the state difference between the first state label and the first prediction state, the state difference between the second state label and the second prediction state, and the state difference between the comprehensive state label and the third prediction state; model parameters of the first state determination model to be trained, the second state determination model to be trained and the comprehensive state determination model to be trained are adjusted based on the target model loss value, and the first state determination model, the second state determination model and the comprehensive state determination model are obtained.

For specific limitations of the digital person based operation state determination apparatus, reference may be made to the above limitations of the digital person based operation state determination method, which will not be described herein again. The respective modules in the above-described digital human operation state determination apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data determined based on the working state of the digital person. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a digital person based work status determination method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program that when executed by the processor performs the steps of: acquiring a target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

In some embodiments, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of: acquiring a target voice of a target user and a target image corresponding to the target voice when the target voice is sent out; analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user; analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user; and determining a target working state corresponding to the target user by combining the first working state and the second working state.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features. The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for determining a working state based on a digital person, the method comprising:

acquiring a target voice of a target user and a target image corresponding to the target voice when the target voice is sent out;

analyzing the working state based on the face features corresponding to the target image to obtain a first working state corresponding to the target user;

analyzing the working state based on the voice characteristics corresponding to the target voice to obtain a second working state corresponding to the target user;

and determining a target working state corresponding to the target user by combining the first working state and the second working state.

2. The method according to claim 1, wherein the analyzing the working state based on the face feature corresponding to the target image to obtain the first working state corresponding to the target user comprises:

acquiring the facial features corresponding to the target image, and processing the facial features by using a trained expression recognition model to obtain a target expression corresponding to the target user;

and analyzing the working state according to the target expression to obtain a first working state corresponding to the target user.

3. The method according to claim 1, wherein the number of the target images is multiple, and the analyzing the working state based on the face features corresponding to the target images to obtain the first working state corresponding to the target user comprises:

acquiring feature point positions respectively corresponding to a plurality of eye key feature points corresponding to the target image, and obtaining a target closing state corresponding to the eyes of the target user in the target image based on the position difference between the feature point positions;

sequencing the closed states of the targets according to the acquisition sequence corresponding to the target images to obtain a closed state sequence;

and analyzing the working state according to the closed state sequence to obtain a first working state corresponding to the target user.

4. The method according to claim 1, wherein the analyzing the working state based on the voice feature corresponding to the target voice to obtain the second working state corresponding to the target user comprises:

acquiring voice attribute information corresponding to the target voice, wherein the voice attribute information comprises at least one of voice speed information or voice tone change information;

and analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user.

5. The method of claim 4, further comprising:

performing semantic emotion analysis on the target voice to obtain a target semantic emotion corresponding to the target voice;

the analyzing the working state based on the voice attribute information to obtain a second working state corresponding to the target user includes:

and analyzing the working state based on the voice attribute information and the target semantic emotion to obtain a second working state corresponding to the target user.

6. The method according to claim 1, wherein the first working state is a working state corresponding to each of a plurality of analysis dimensions, the second working state is a working state corresponding to each of a plurality of analysis dimensions, and determining the target working state corresponding to the target user according to the first working state and the second working state comprises:

calculating the number of the fatigue states in the first working state and the second working state;

and when the state quantity is greater than a preset threshold value or the proportion corresponding to the state quantity is greater than a preset proportion, determining that the target working state corresponding to the target user is exhausted.

7. The method of claim 1, further comprising:

acquiring a virtual user image corresponding to the target user;

and controlling the virtual user image to send out working state prompt information according to the target working state.

8. The method of claim 7, wherein said controlling the avatar to issue an operation state prompt message according to the target operation state comprises:

acquiring a face adjustment parameter corresponding to the first working state;

and performing image adjustment on the virtual user image according to the face adjustment parameter, and controlling the virtual user image subjected to image adjustment to send out working state prompt information according to the target working state.

9. The method of claim 1, wherein the determining the target operating state corresponding to the target user in combination with the first operating state and the second operating state comprises:

and inputting the first working state and the second working state into a comprehensive state determination model to obtain a target working state corresponding to the target user.

10. The method of claim 9, wherein the first operating state is obtained by processing the face features using a first state determination model, and wherein the second operating state is obtained by processing the speech features using a second state determination model, the method further comprising:

acquiring a first training sample, wherein the first training sample comprises training face features and a corresponding first state label, training voice features and a corresponding second state label, a first state label and a comprehensive state label;

inputting the training face features into a first state determination model to be trained to obtain a first prediction state;

inputting the training voice features into a second state determination model to be trained to obtain a second prediction state;

inputting the first prediction state and the second prediction state into a comprehensive state determination model to be trained to obtain a third prediction state;

obtaining a target model loss value based on the state difference between the first state label and the first predicted state, the state difference between the second state label and the second predicted state, and the state difference between the integrated state label and the third predicted state;

and adjusting model parameters of the first state determination model to be trained, the second state determination model to be trained and the comprehensive state determination model to be trained based on the target model loss value to obtain the first state determination model, the second state determination model and the comprehensive state determination model.

11. An apparatus for determining a digital person-based operation state, the apparatus comprising:

the image and voice acquisition module is used for acquiring a target voice of a target user and a target image corresponding to the target voice;

the first working state obtaining module is used for carrying out working state analysis based on the face features corresponding to the target image to obtain a first working state corresponding to the target user;

a second working state obtaining module, configured to perform working state analysis based on a voice feature corresponding to the target voice to obtain a second working state corresponding to the target user;

and the target working state determining module is used for determining a target working state corresponding to the target user by combining the first working state and the second working state.

12. The apparatus of claim 11, wherein the first operating state obtaining module is configured to:

13. The apparatus of claim 11, wherein the target image is plural, and the first operation state obtaining module is configured to:

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 10 when executing the computer program.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 10.