CN117635785B

CN117635785B - Method and system for generating worker protection digital person

Info

Publication number: CN117635785B
Application number: CN202410095801.4A
Authority: CN
Inventors: 屠静; 王亚; 赵策; 苏岳; 万晶晶; 李伟伟; 颉彬; 周勤民
Original assignee: Zhuoshi Future Beijing technology Co ltd
Current assignee: Zhuoshi Future Beijing technology Co ltd
Priority date: 2024-01-24
Filing date: 2024-01-24
Publication date: 2024-05-28
Anticipated expiration: 2044-01-24
Also published as: CN117635785A

Abstract

The invention provides a method and a system for generating a worker protecting digital person, which relate to the technical field of data processing, and the method comprises the following steps: acquiring professional knowledge of a worker and interest knowledge of the old, and constructing a knowledge base; constructing a response model, and training the response model by using a knowledge base; collecting output results of the trained response model, and carrying out emotion classification on the output results by combining a graphic neural network and a syntax tree to obtain a plurality of emotion types; recording a video sample, and extracting video frames and an audio stream from the video sample; receiving a voice request of a nursing object, extracting the voice request as a text request, inputting the text request into a response model, and obtaining a target output result through the response model; analyzing the target emotion type; constructing a dual-mode encoding and decoding network, and fusing corresponding video frames and audio streams to obtain a worker protecting digital person with audio characteristics and video characteristics; and broadcasting the target output result by the worker protecting digital person. The fidelity of the nursing staff digital person and the conversation experience of the nursing object are improved.

Description

Method and system for generating worker protection digital person

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a system for generating a worker protecting digital person.

Background

With the increasing trend of aging population, more and more elderly people need long-term care and medical services, and aging population means that more elderly people need care services, but the number of young workers is relatively small, so that market demands are difficult to meet. The problem of the lack of care for the elderly has an important impact on society and home, it increases the burden of the home, and also makes it difficult for some elderly to obtain their required care services, possibly leading to more health problems and hospitalization.

At present, along with technological progress, in order to cope with the problem of shortage of workers, workers 'digital people appear, but the existing workers' digital people often only can provide the non-emotional answer for the old to be cared, and have no personality characteristics, so that the digital people and the old have communication barriers, which are inconsistent with the characteristics that the old needs to have emotion communication, and the actual workers often cannot accurately answer the medical consultation of the old to be cared due to the lack of professional knowledge.

Disclosure of Invention

The invention provides a method and a system for generating a nursing digital person, which are used for solving the technical problems that in the prior art, the nursing digital person can only provide a mood-free answer for the old to be nursed, and has no personality characteristics, so that communication barriers are caused between the digital person and the old, the communication barriers are inconsistent with the characteristics of the old which needs to have mood communication, and the real nursing person cannot accurately answer the medical consultation of the old to be nursed due to lack of professional knowledge.

The technical scheme provided by the invention is as follows:

First aspect

The invention provides a method for generating a worker protecting digital person, which comprises the following steps:

s1: acquiring professional knowledge of a worker and interest knowledge of the old, and constructing a knowledge base;

S2: constructing a response model by combining the BERT model and the graph neural network, and training the response model by utilizing a knowledge base;

S3: collecting output results of the trained response model, and carrying out emotion classification on the output results by combining a graphic neural network and a syntax tree to obtain a plurality of emotion types;

S4: recording a video sample of emotion types of a worker to be simulated, and extracting video frames and audio streams of different emotion types from the video sample;

s5: receiving a voice request of a nursing object, extracting the voice request into a text request by using a voice synthesis tool, inputting the text request into a response model, and obtaining a target output result through the response model;

S6: analyzing a target emotion type of a target output result;

S7: constructing a dual-mode coding and decoding network with a multi-layer LSTM, and fusing video frames and audio streams of corresponding emotion types according to the target emotion types to obtain a worker digital person with audio characteristics and video characteristics;

s8: and broadcasting the target output result by the worker protecting digital person.

Second aspect

The invention provides a worker protection digital person generating system, which comprises a processor and a memory for storing executable instructions of the processor; the processor is configured to invoke the instructions stored in the memory for performing the career digital person generation method of the first aspect.

Compared with the prior art, the technical scheme has at least the following beneficial effects:

In the invention, by constructing the response model with the BERT model and the graph neural network and training the response model based on professional knowledge and amateur knowledge, the BERT can capture the complex relation between words and sentences so as to more accurately understand the emotion expression of the user, so that the user has multi-level semantic understanding and context awareness capability, and more intelligent and accurate natural language processing capability is provided. And then analyzing the specific emotion classification of the output result by combining the graph neural network, namely positive emotion, neutral emotion and negative emotion, wherein the syntax tree provides dependency relationship information among words in the text, and the graph neural network can extract features on multiple layers, so that the model can gradually understand the global structure and local features of the text and more accurately capture emotion signals in the text. And finally, according to the output result of the response model, establishing a worker digital person with emotion, which is closer to the worker to be simulated in expression and audio, so that the established worker digital person is more vivid, can effectively and acoustically answer various problems of the old to be nursed, improve the satisfaction of the worker to be nursed and provide more excellent man-machine conversation experience.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for generating a worker protecting digital person;

FIG. 2 is a schematic diagram of a response model according to the present invention;

Fig. 3 is a schematic structural diagram of a dual-mode codec network according to the present invention.

Detailed Description

The technical scheme of the invention is described below with reference to the accompanying drawings.

In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.

In the embodiments of the present invention, "image" and "picture" may be sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized. "of", "corresponding (corresponding, relevant)" and "corresponding (corresponding)" are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.

In the embodiment of the present invention, sometimes a subscript such as W1 may be wrongly expressed in a non-subscript form such as W1, and the meaning of the subscript is consistent when the distinction is not emphasized.

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a method for generating a worker digital person, which can be realized by a worker digital person generating device, wherein the worker digital person generating device can be a terminal or a server. The process flow of the method for generating the protection work digital person shown in fig. 1 can comprise the following steps:

S1: and acquiring the professional knowledge of the worker and the interest knowledge of the old people, and constructing a knowledge base.

The healthcare expertise refers to professional information and skills related to medical treatment, nursing, health management and geriatric nursing. Such knowledge includes, but is not limited to, the following: disease and health status: knowing various common diseases, common health problems for the elderly, and symptoms, treatment and care methods thereof. Knowledge of drugs: familiarity with the use, dosage, side effects, and interactions of various drugs ensures proper use of the drugs by the elderly. Nursing skills: grasping care skills, including measuring blood pressure, assisting the elderly to move, sanitary care, wound care, etc. Nutritional knowledge: the eating needs of the elderly are known, including management of special diets, insufficient diets, and poor diets. Psychological support: the mental health needs of the elderly, such as management of autism, depression and anxiety, are known. Knowledge of interest of elderly: the knowledge of the interests of the elderly means knowledge and understanding of the interests, hobbies, demands and hobbies of the elderly. The knowledge is helpful to construct personalized services of the worker protecting digital person, so that the personal service protecting digital person can meet the requirements of the elderly. The knowledge of interest of the elderly includes, but is not limited to, the following: hobbies and interests: elderly people may have various interests such as reading, gardening, travel, music, art, etc. Diet preference: the eating preferences of the elderly are known, including particularly preferred foods and diet-resistant foods. Daily activities: the elderly's daily activities and habits are known to provide customized advice and services. The training of the follow-up model by using the knowledge base is beneficial to enabling the worker protecting digital person to have higher practicability and individuation, providing better support and advice, helping the elderly to better manage their health and life, improving the service efficiency and quality, and being beneficial to meeting the needs of the elderly.

In one possible embodiment, S1 specifically includes:

s101: acquiring the professional knowledge of nursing workers and the interest knowledge of the old through nursing association and welfare organization of the old respectively;

S102: according to the layout structure of the worker protection professional knowledge and the old people interest knowledge, disassembling the problem content and the corresponding answer content;

S103: and constructing a knowledge base according to the content of the questions and the content of the answers.

Referring to fig. 2 of the drawings, there is shown a schematic structural diagram of a response model provided by the present invention.

In fig. 2, the soft-active layer is embedded in the output layer, in addition to the BERT layer and the neural network layer, including the necessary input layer, max-pooling layer, full-connection layer, and output layer. The input layer is the beginning of the model, which accepts the original text input (which may be a session or text request), and the task of the input layer is to prepare the text data in a form suitable for the processing of the model, which converts the text data into a numerical representation that the model can understand. In emotion analysis tasks, which function to extract the most important information in text encoding, key features in the text can be captured by applying a max pooling operation to different parts of the encoding, which helps the emotion analysis model to better understand the input text. The fully connected layer is a neural network layer for further processing the features obtained from the maximally pooling layer, which layer typically comprises a plurality of neurons, each connected to each neuron of the previous layer, the task of which is to map the extracted features to an appropriate output space in preparation for emotion classification, and which fully connected layer can learn how best to combine the input features for classification. The output layer defines the final output of the model, and for emotion analysis tasks, the output layer typically includes a plurality of neurons, one for each emotion category, such as positive, neutral, and negative emotion, the goal of the output layer being to translate the information passed from the previous layer into a probability distribution for each category. The soft-active layer is embedded in the output layer, typically using a softmax function, for converting the final model output into a probability distribution of categories, which can be used to guide the final classification decision of the model, determining the likelihood that the input text belongs to different emotion categories, to determine the emotion category to which the text belongs.

S2: and constructing a response model by combining the BERT model and the graph neural network, and training the response model by utilizing a knowledge base.

The response model comprises a BERT layer and a graphic neural network layer, wherein the BERT layer is connected with the graphic neural network layer, and the graphic neural network layer is used for receiving output data of the BERT layer.

The BERT (Bidirectional Encoder Representations from Transformers) model is a pre-trained natural language processing model, which performs pre-training on large-scale text data to learn context relations between words and sentences, and the key features of BERT are a bidirectional (i.e. the information of context can be considered simultaneously) learning mode and a transducer architecture, which can encode the text data into word vectors with rich contexts. A graph neural network is a machine learning model used to process graph data. In graph data, the relationships between nodes may be complex, and the graph neural network may effectively capture dependencies and relationships between nodes. It includes two main parts of node representation learning and graph structure modeling, and is generally used for tasks such as node classification, graph classification, and link prediction on graph data.

It should be noted that BERT can capture rich semantic information in text data, so it can be used to understand natural language input of a user, which enables a response model to better understand questions or requests of a user, considering not only information of a single sentence, but also knowledge bases of the context information geriatric care field, which can be represented as graph data, where complex relationships exist between concepts and knowledge points. The graph neural network is suitable for processing such graph data with complex relationships, and by connecting the output of the BERT to the graph neural network layer, information in the knowledge base can be better understood and queried. Combining BERT and a graph neural network allows the answer model to provide personalized responses, as it can provide specific suggestions and information to each user based on contextual information and expertise in the knowledge base, which improves the accuracy and practicality of the answer model. The BERT is pre-trained on a large scale to provide general natural language understanding capability, which allows the response model to benefit from the pre-training knowledge of BERT, and then fine-tuning in a specific domain is performed through the graph neural network, so that the response model is quickly adapted to the field of geriatric care.

By combining the BERT model and the graphic neural network, the response model can better understand user input, query a knowledge base and provide personalized and high-quality response, and the combination fully utilizes the advantages of natural language understanding and graphic data processing and provides a powerful tool for intelligent response in the field of aged care.

Specifically, the training of the response model by the knowledge base includes the steps of: first, the knowledge base includes user questions and answers, related topics or pieces of information in the knowledge base. The corresponding questions and corresponding answers in the knowledge base are input into the answer model, and in the training process, the model gradually learns how to understand user inquiry, how to inquire the knowledge base to acquire relevant information and how to generate corresponding answers, and in the training process, the performance of the model is measured by using a loss function, and model parameters are continuously optimized through an optimization algorithm. After training, the performance of the model is evaluated, using the test dataset, and the accuracy and performance of the model in answering various questions is checked, and if necessary, the model is tuned, for example by fine tuning parameters or adding training data to improve performance. Once the model performs well in the assessment, it can be deployed into a practical application, integrating the trained model into a caretaker digital person or other service so that it can provide real-time answers based on user queries and information in the knowledge base. The knowledge base is maintained and updated periodically to ensure that the information therein remains up to date, and furthermore, iterations and updates of the model can be performed to improve performance and adapt to new situations.

S3: and collecting output results of the trained response model, and carrying out emotion classification on the output results by combining a graphic neural network and a syntax tree to obtain a plurality of emotion types.

Among the emotion types include positive emotion, neutral emotion and negative emotion.

Specifically, first, the trained answer model generates answers to the user query, which are natural language sentences or paragraphs in text form. For each generated answer, a syntactic tree is a data structure representing its syntactic structure, which shows the dependencies, e.g., subject, verb, object, etc., between the individual words in the sentence, which may be generated by syntactic analysis techniques. Once the syntactic tree of answers is present, the next task is to categorize the answers for emotion, which is intended to determine the emotion or emotion type conveyed by the answer, e.g., positive, neutral, negative, etc., which may aid in understanding the emotion color of the answer, as well as the user's possible reactions. In the emotion classification process, the graph neural network is used for processing information in the syntax tree, and the graph neural network can effectively capture the dependency relationship and semantic information among words in the syntax tree, so that emotion expression and emotion context in an answer can be understood more accurately. After analysis in combination with the neural network of graphs and the syntactic tree, the answers can be classified into a plurality of emotion types, such as positive, neutral, and negative, so that the model can identify and label the different emotion elements present in the answers. The model can better understand emotion intonation of the answer by analyzing emotion colors of the answer, and can better process grammar structures and contexts by combining a graph neural network and a syntax tree so as to obtain an accurate emotion classification result.

In one possible embodiment, S3 specifically includes:

s301: and constructing a directed graph representing the word and sentence relation trend in the output result.

In one possible implementation, S301 specifically includes:

S3011: text preprocessing is carried out on the output result, and a sentence subset S is obtained:

，/>

Wherein S _N represents that there are N sentences of the output result, w _n represents that there are N words in the i-th sentence S _i;

s3012: inputting the ith sentence S _i into the BERT model, extracting the corresponding word vector ；

S3013: constructing a word vector matrix E ⁰ based on the word vectors:

；

Wherein, ，/>Representing the real number field, d ₀ representing the vector dimension,/>The last hidden layer state of the BERT layer;

S3014: will be related to words Aspect word vector/>, with k words inAs aspect word vector/>Wherein an aspect word is a clause of each sentence that includes a plurality of words:

Wherein, A word vector representing an ith word, m representing a total number of aspect word vectors in the aspect words, j representing a jth word in the aspect word vector having k words;

s3015: converting an original sentence corresponding to the output result into a dependency syntax tree form by utilizing a double affine syntax analyzer, calculating aspect words in the original sentence, connecting each word of the original sentence with the aspect words based on the aspect word vectors, and constructing a word vector relation diagram with the dependency relation as an edge, the words as nodes and the aspect words as roots;

s3016: calculating semantic distance between word and related aspect word ；

Wherein l represents the number of words in the aspect word,Representing the minimum distance between a context word and an aspect word in a sentence to which the aspect word belongs,/>Representing the i-th word/>And j-th word/>T represents a distance threshold, and a minimum distance T greater than the distance threshold is set;

s3017: calculating all semantic distances in the word vector relation diagram to obtain a distance matrix ：

；

S3018: combining the word vector matrix and the distance matrix to construct a directed graph representing the word and sentence relationship trend。

It should be noted that, by combining the word vector matrix and the distance matrix, a directed graph is constructed, wherein the words are nodes, the dependency relationship is a directed edge, and the directed graph represents the semantic relationship and the dependency relationship between the words in the text, which is helpful for analyzing the structure and the association of the text. The directed graph may be used for further text analysis, such as emotion analysis, to understand emotion tendencies and emotion contexts in the text, so that the model more deeply understands the text and provides more accurate emotion classification and text analysis.

S302: feature aggregation based on graph neural network is carried out on the directed graph.

In one possible implementation, S302 is specifically:

s3021: feature aggregation is carried out on the directed graph by utilizing a graph neural network:

Wherein, Representing the feature output obtained by aggregation by the first layer of the graph neural network,/>Representing a learnable feature transfer matrix,/>Representing the feature vector dimensions before and after transfer, respectively,/>Represents a distance matrix after self-connection, I represents an identity matrix of corresponding dimension,/>Representing a nonlinear activation function,/>Representing a symmetric normalized version of the distance matrix after the self-join,/>Representing a degree matrix corresponding to the distance matrix after the self-connection,/>Representing a transition probability matrix in a markov chain.

It should be noted that, by using the graph neural network to integrate and update the features of the nodes in the directed graph constructed in S1031 through multi-layer feature transfer and aggregation, the structure and relationship of the text can be better understood, so as to facilitate the subsequent emotion classification or other text analysis tasks, and the key advantage of the graph neural network is that the complex dependency relationship and context information between the nodes can be captured, so as to improve the text understanding performance.

S3022: and calculating the probability values of the characteristic aggregated characteristic output belonging to different emotion types through a soft activation layer of the response model.

In one possible implementation, S303 is specifically:

s3031: inputting the characteristic output obtained by aggregation into a soft activation layer to cross entropy loss function As an objective function, calculating the probability value/>, of the ith aspect word t _i belonging to the jth class of emotion types, in the S-th sentence：

Wherein,Characteristic output obtained by aggregation of expression aspect word t _i through L layer of graph neural network, softmax representing soft activation layer,/>Network parameters representing soft-active layer, C representing the number of emotion types,/>Representing bias vector,/>One-hot vector representation representing class label to which the i-th aspect word t _i belongs,/>Representing the total number of sentences and the total number of terms, respectively.

The cross entropy loss function is a common loss function for multi-class classification tasks. It measures the difference between the model's predictive probability distribution and the actual labels, where it is used as an objective function to measure the difference between the model's predictive probability and the actual emotion labels. The feature outputs represent node features aggregated through layers of the graph neural network, and in the emotion analysis task, the feature outputs are used for predicting the probability that each aspect word belongs to different emotion types. The soft-active layer is an active function for multi-class classification tasks, mapping feature outputs to a probability distribution, where each emotion type corresponds to a probability, so that for each aspect word and each emotion type, a corresponding emotion probability can be calculated.

S304: and outputting the emotion type corresponding to the maximum probability value.

It should be noted that when a single aspect word is taken as a root, the topology structure of the corresponding directed graph is a star network with the aspect word as a center, and considering that more aspect words may exist in a sentence instance, it is impractical to construct an induction tree for each aspect word and separately train resources, and the invention provides a new directed graph construction strategy based on the minimum tree distance. Specifically, all aspect words in the aspect word set are simultaneously used as tree roots, all context words are simultaneously connected with all the aspect words respectively, and the continuous edge contains minimum distance information, in other words, the induction structure of the original dependency syntax tree structure based on sentences is a star-like network taking the aspect word set as a center, and the combination of the topological structures corresponding to all the aspects is equivalent. Compared with the method taking the distance information as an additional characteristic sequence, the method based on the directed graph construction can encode the interaction relation between the aspect words and the context into the representation of the aspect words, so that the emotion classification performance of the subsequent aspect words is improved. Compared with the traditional method based on the minimum distance directed graph, the construction strategy provided herein avoids larger analysis errors of the star-shaped directed graph, and the directed graph construction method based on the syntax tree with all aspect words as roots can complete fine-granularity text emotion analysis, improves emotion analysis efficiency, and simultaneously introduces relative position information to monitor sentence distance, so that syntax quality is ensured, and further accuracy of emotion analysis is ensured.

S4: recording video samples of emotion types of a worker to be simulated, and extracting video frames and audio streams of different emotion types from the video samples.

In one possible embodiment, S4 specifically includes:

S401: recording video samples of emotion types of the protector to be simulated by a camera, and marking the emotion types for each video sample and corresponding sound samples, wherein the video samples comprise the whole facial video of the protector to be simulated;

s402: dividing the video into individual video frames by a video editing tool;

s403: extracting facial five-sense features related to the worker to be simulated in the video frame by using Dlib tools;

S404: synthesizing facial features through an animation technology to obtain video frames under different emotion types;

S405: performing audio separation on the video sample, sampling the separated audio, and extracting sound frequency, volume and tone under different emotion types;

S406: and synthesizing the sound frequency, the sound volume and the pitch by using a voice synthesis tool to obtain audio streams under different emotion types.

It should be noted that, recording a video sample of a real protector, extracting multi-mode data of the real protector to simulate the protector digital person in different emotion states, marking different emotion types, dividing video frames, extracting facial features, synthesizing video frames in different emotions, separating audio from video, sampling and synthesizing audio in different emotions to create multi-mode data with vivid facial expression and sound, so that the protector digital person can express various emotion states more vividly.

S5: and receiving a voice request of the nursing object, extracting the voice request into a text request by using a voice synthesis tool, inputting the text request into a response model, and obtaining a target output result through the response model.

In one possible implementation, the speech synthesis tool includes OpenText, eSpeak, deepSpeech and PocketSphinx.

It should be noted that, the method of receiving the voice request of the nursing object may be a method that the nursing object interacts with the digital nursing staff through voice, such as raising a problem, raising a requirement, etc., then converting the voice request into a text request through a voice synthesis tool, and then inputting the text request into a response model of the nursing staff, where the model processes the text request using technologies such as BERT and a neural network, etc., and generates a target output result to meet the requirement of the nursing object, and this series of steps enables the nursing staff to interact with the nursing object in a natural language manner, understand the requirement and provide corresponding feedback.

S6: and analyzing the target emotion type of the target output result.

Referring to fig. 3 of the specification, a schematic structural diagram of a dual-mode codec network according to the present invention is shown.

S7: and constructing a dual-mode coding and decoding network with a multi-layer LSTM, and fusing video frames and audio streams of corresponding emotion types according to the target emotion types to obtain the worker digital person with audio characteristics and video characteristics.

The dual-mode coding and decoding network comprises a decoding module consisting of a plurality of LSTM network layers and self-attention mechanism layers and a plurality of self-attention coding layers, wherein the audio coding module and the video coding module are connected with the decoding module, and the decoding module is used for receiving output data of the audio coding module and the video coding module.

Wherein the dual mode codec network is a neural network structure for processing two types of data simultaneously, i.e. audio and video data, the network for generating the worker digital person has two encoding modules, one for audio data and the other for video data, and the outputs of the two modules are fused in the decoding module to generate the final digital agent. The LSTM (long and short term memory) network is a variant of a Recurrent Neural Network (RNN) dedicated to processing sequence data, which is capable of capturing long-term dependencies, and the advantage of using the LSTM network layer as a decoding module is that it is capable of better modeling sequence data, including time dependencies and long-term dependencies, and learning and generating more realistic emotion and emotion expressions, which makes the generated carer digital person more performance and flexibility in interaction and emotion communication. The self-attention mechanism is a neural network layer that allows the network to assign different attention weights at different time steps or spatial locations in order to better capture important information in the input data. An audio coding module: this is a part of the network used to process and encode audio data that can extract relevant features in the audio data such as frequency, volume and pitch. Video coding module: similar to the audio encoding module, the video encoding module processes and encodes video data from which video features, such as facial expressions, are extracted. And a decoding module: this module receives the outputs of the audio encoding module and the video encoding module and fuses them together to produce a caretaker digital person having audio and video characteristics.

The dual mode codec network is able to consider both audio and video data, relative to traditional single mode approaches (considering only audio or video), thereby providing more rich information to generate a finer and realistic digital proxy. The network can better capture the emotion and emotion of human beings by processing the audio and video data at the same time, and dynamically adjust the fusion mode of the audio and video according to different emotion types so as to ensure that the generated digital agent is suitable for specific emotion requirements, thereby improving the interactive quality and individuation, the neural network has learning ability, and the data can be continuously improved, so that the method can continuously optimize and improve the performance of the worker-protecting digital person so as to adapt to the requirements of different emotion and emotion, and has more flexibility compared with the traditional method. The dual-mode coding and decoding network has the advantages that the dual-mode coding and decoding network is more comprehensive, can express emotion, is more suitable for different emotion requirements, has learning capability, can continuously improve performance, and enables the generated worker protecting digital person to have more interactivity and adaptability, and is more attractive compared with the existing method.

In one possible implementation, S7 specifically includes:

s701: video frame alignment via a multimodal feature fusion network And audio stream/>Resolving to obtain the joint feature/>：

Wherein,Query vector, key vector and value vector respectively representing self-attention mechanism when parsing the audio stream,/>Respectively representing a query vector, a key vector and a value vector of a self-attention mechanism when the video frame is analyzed, wherein a symbol 'Joint' represents a fusion operation in the self-attention mechanism, a symbol 'cat' represents a connection operation in the self-attention mechanism, and a softmax represents an activation function of a soft activation layer;

s702: decoding the combination characteristic through a decoding module to obtain the worker protection digital person:

wherein y represents the decoding output result of the worker digital person, Representing probability parameters provided by a self-attention mechanism decoder,/>Representing probability parameters provided by the LSTM network layer,/>Representation/>Output result of decoding of time,/>Representing hyper-parameters, symbol "/>"Means stitching.

It should be noted that, combining video frames and audio streams into a joint feature to more fully describe the emotion expressions of a worker digital person, the multi-modal feature fusion network uses a self-attention mechanism as a method capable of assigning weights between different modalities, by which key video and audio features can be identified for subsequent emotion synthesis. The decoding process, which has previously obtained the joint features and then requires the use of a decoding module to translate these features into the actual output of the carer's digital person, including speech, facial expressions, actions, etc., uses an LSTM network, which helps to maintain consistency in timing information and processing time, and the decoding module generates output results, which may be audio, video or other forms of interaction, from the input joint features to meet the needs of the care-giver. The method ensures the effective fusion of video and audio characteristics and the nature and continuity of emotion expression, and compared with the prior art, the multi-mode worker-protecting digital person constructed in this way has the advantages that the multi-mode worker-protecting digital person can provide more diversified, vivid and natural interaction experience, better meets the requirements of nursing objects, integrates multiple perception modes (visual and auditory), and can keep consistency in the output process, so that the interaction which is more in line with the expectations of the human is provided.

The invention also provides a protection digital person generating system, which comprises a processor and a memory for storing executable instructions of the processor; the processor is configured to invoke the instructions stored in the memory for performing the healthcare digital person generating method described above.

The system for generating the protection digital person can execute the method for generating the protection digital person and achieve the same or similar technical effects, and the invention is not repeated for avoiding repetition.

The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present invention, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for generating a worker-protecting digital person, comprising:

s2: constructing a response model by combining the BERT model and the graph neural network, and training the response model by utilizing the knowledge base;

s3: collecting output results of the trained response model, and carrying out emotion classification on the output results by combining the graph neural network and the syntax tree to obtain a plurality of emotion types;

s4: recording video samples of the to-be-simulated careers about the emotion types, and extracting video frames and audio streams of different emotion types from the video samples;

s5: receiving a voice request of a nursing object, extracting the voice request into a text request by using a voice synthesis tool, inputting the text request into the response model, and obtaining a target output result through the response model;

s6: analyzing the target emotion type of the target output result;

S7: constructing a dual-mode coding and decoding network with a multi-layer LSTM, and fusing video frames of corresponding emotion types with the audio stream according to the target emotion types to obtain a worker digital person with audio characteristics and video characteristics;

s8: and broadcasting the target output result through the worker protection digital person.

2. The method for generating a worker figure according to claim 1, wherein the step S1 specifically comprises:

S101: acquiring the healthcare professional knowledge and the old people interest knowledge through a nursing association and an old people welfare organization respectively;

S102: according to the professional protection knowledge and the layout structure of the old people interest knowledge, disassembling the problem content and the corresponding answer content;

s103: and constructing the knowledge base according to the question content and the answer content.

3. The method for generating a worker figure according to claim 1, wherein the step S3 specifically comprises:

S301: constructing a directed graph representing word and sentence relation trend in the output result;

s302: performing feature aggregation based on a graph neural network on the directed graph;

s303: calculating the probability values of the characteristic output belonging to different emotion types after characteristic aggregation through a soft activation layer of the response model;

4. The method for generating a career digital person according to claim 3, wherein the step S301 specifically comprises:

s3011: text preprocessing is carried out on the output result, and sentence subset S is obtained:

，/>；

S3013: constructing a word vector matrix E ⁰ based on the word vector:

；

Wherein, ，/>Representing the real number field,/>D ₀ denotes the vector dimension, i.e. the last hidden layer state of the BERT layer;

S3014: will be related to words Aspect word vector/>, with k words inAs aspect word vector/>Wherein the aspect words are clauses of each sentence comprising a plurality of words:

；

s3015: converting an original sentence corresponding to the output result into a dependency syntax tree form by utilizing a dual affine syntax analyzer, calculating aspect words in the original sentence, connecting each word of the original sentence with the aspect words based on the aspect word vector, and constructing a word vector relation diagram taking the dependency relation as an edge, the word as a node and the aspect word as a root;

s3016: calculating semantic distance between word and related aspect word ；

；

Wherein l represents the number of words in the aspect word,Representing the minimum distance between a context word and an aspect word in a sentence to which the aspect word belongs,/>Representing the i-th word/>And j-th word/>T represents a distance threshold, a minimum distance T that will be greater than the distance threshold;

；

S3018: combining the word vector matrix and the distance matrix to construct a directed graph representing word and sentence relationship trend。

5. The method for generating a protection figure according to claim 4, wherein S302 is specifically:

s3021: feature aggregation is performed on the directed graph by using the graph neural network:

；

Wherein, Representing the feature output obtained by aggregation by using the first layer of the graph neural network,/>Representing a learnable feature transfer matrix,/>Representing the feature vector dimensions before and after transfer, respectively,/>Represents a distance matrix after self-connection, I represents an identity matrix of corresponding dimension,/>Representing a nonlinear activation function,/>Representing a symmetric normalized version of the self-connected distance matrix,/>Representing a degree matrix corresponding to the distance matrix after self-connection,/>Representing a transition probability matrix in a markov chain.

6. The method for generating a worker figure according to claim 5, wherein S303 is specifically:

S3031: inputting the characteristic output obtained by aggregation into the soft activation layer to cross entropy loss function As an objective function, calculating the probability value/>, of the ith aspect word t _i belonging to the jth class of emotion types, in the S-th sentence：

；

Wherein,Characteristic output obtained by aggregation of expression aspect word t _i through layer L of graph neural network, softmax representing the soft activation layer,/>Network parameters representing soft-active layer, C representing the number of emotion types,/>Representing bias vector,/>One-hot vector representing class label to which the i-th aspect word t _i belongs,/>Representing the total number of sentences and the total number of terms, respectively.

7. The method for generating a worker figure according to claim 1, wherein the step S4 specifically comprises:

S401: recording video samples of the to-be-simulated careers about the emotion types through a camera, and marking the emotion types for each video sample and corresponding sound samples, wherein the video samples comprise the whole facial video of the to-be-simulated careers;

S402: splitting the video into individual video frames by a video editing tool;

s404: synthesizing the facial features through an animation technology to obtain video frames under different emotion types;

s406: and synthesizing the sound frequency, the sound volume and the sound pitch by using the voice synthesis tool to obtain the audio streams under different emotion types.

8. The method for generating a career number person as defined in claim 1, wherein,

The dual-mode coding and decoding network comprises a decoding module consisting of a plurality of LSTM network layers and a self-attention mechanism layer, and an audio coding module and a video coding module which are formed by a plurality of self-attention coding layers, wherein the audio coding module and the video coding module are connected with the decoding module, and the decoding module is used for receiving output data of the audio coding module and the video coding module.

9. The method for generating a career digital person according to claim 8, wherein the step S7 specifically comprises:

S701: the video frames are processed through a multi-mode feature fusion network And the audio stream/>Resolving to obtain the joint feature/>：

；

Wherein,Query vector, key vector and value vector respectively representing self-attention mechanism when parsing the audio stream,/>Query vector, key vector and value vector respectively representing self-attention mechanism when parsing the video frame,/>A dimension parameter representing a key vector, a symbol "Joint" representing a fusion operation in a self-attention mechanism, a symbol "cat" representing a connection operation in the self-attention mechanism, and softmax representing an activation function of a soft activation layer;

s702: decoding the joint features through the decoding module to obtain the worker protection digital person:

；

wherein y represents the decoding output result of the worker protecting digital person, Representing probability parameters provided by a self-attention mechanism decoder,/>Representing probability parameters provided by the LSTM network layer,/>Representation/>Output result of decoding of time,/>Representing hyper-parameters, symbol "/>"Means stitching.

10. A system for generating a worker digital person, comprising a processor and a memory for storing instructions executable by the processor; the processor is configured to invoke the instructions stored in the memory to perform the caretaker digital person generating method of any one of claims 1 to 9.