CN117635785B - Method and system for generating worker protection digital person - Google Patents
Method and system for generating worker protection digital person Download PDFInfo
- Publication number
- CN117635785B CN117635785B CN202410095801.4A CN202410095801A CN117635785B CN 117635785 B CN117635785 B CN 117635785B CN 202410095801 A CN202410095801 A CN 202410095801A CN 117635785 B CN117635785 B CN 117635785B
- Authority
- CN
- China
- Prior art keywords
- representing
- word
- video
- worker
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000008451 emotion Effects 0.000 claims abstract description 122
- 238000013528 artificial neural network Methods 0.000 claims abstract description 48
- 230000004044 response Effects 0.000 claims abstract description 43
- 230000000474 nursing effect Effects 0.000 claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 47
- 239000011159 matrix material Substances 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 19
- 230000014509 gene expression Effects 0.000 claims description 14
- 230000007246 mechanism Effects 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 12
- 230000002776 aggregation Effects 0.000 claims description 12
- 238000004220 aggregation Methods 0.000 claims description 12
- 230000015572 biosynthetic process Effects 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 8
- 238000003786 synthesis reaction Methods 0.000 claims description 8
- 230000001815 facial effect Effects 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 7
- 230000015654 memory Effects 0.000 claims description 7
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 5
- 230000009977 dual effect Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 2
- 241000282326 Felis catus Species 0.000 claims description 2
- 230000008520 organization Effects 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000000926 separation method Methods 0.000 claims description 2
- 230000007704 transition Effects 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 12
- 230000008569 process Effects 0.000 description 15
- 238000004458 analytical method Methods 0.000 description 14
- 238000004891 communication Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 230000007935 neutral effect Effects 0.000 description 7
- 238000003860 storage Methods 0.000 description 7
- 230000003993 interaction Effects 0.000 description 6
- 230000001012 protector Effects 0.000 description 6
- 230000009286 beneficial effect Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 235000005911 diet Nutrition 0.000 description 4
- 230000037213 diet Effects 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 230000004888 barrier function Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 229940079593 drug Drugs 0.000 description 3
- 230000008921 facial expression Effects 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 2
- 230000032683 aging Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 235000013305 food Nutrition 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 230000005802 health problem Effects 0.000 description 2
- 230000006698 induction Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 208000019901 Anxiety disease Diseases 0.000 description 1
- 206010003805 Autism Diseases 0.000 description 1
- 208000020706 Autistic disease Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000036506 anxiety Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000036772 blood pressure Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000010413 gardening Methods 0.000 description 1
- 230000003862 health status Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000004630 mental health Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 235000016709 nutrition Nutrition 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 235000017924 poor diet Nutrition 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3343—Query execution using phonetics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/042—Knowledge-based neural networks; Logical representations of neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H80/00—ICT specially adapted for facilitating communication between medical practitioners or patients, e.g. for collaborative diagnosis, therapy or health monitoring
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Acoustics & Sound (AREA)
- Biomedical Technology (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Primary Health Care (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Pathology (AREA)
Abstract
The invention provides a method and a system for generating a worker protecting digital person, which relate to the technical field of data processing, and the method comprises the following steps: acquiring professional knowledge of a worker and interest knowledge of the old, and constructing a knowledge base; constructing a response model, and training the response model by using a knowledge base; collecting output results of the trained response model, and carrying out emotion classification on the output results by combining a graphic neural network and a syntax tree to obtain a plurality of emotion types; recording a video sample, and extracting video frames and an audio stream from the video sample; receiving a voice request of a nursing object, extracting the voice request as a text request, inputting the text request into a response model, and obtaining a target output result through the response model; analyzing the target emotion type; constructing a dual-mode encoding and decoding network, and fusing corresponding video frames and audio streams to obtain a worker protecting digital person with audio characteristics and video characteristics; and broadcasting the target output result by the worker protecting digital person. The fidelity of the nursing staff digital person and the conversation experience of the nursing object are improved.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a system for generating a worker protecting digital person.
Background
With the increasing trend of aging population, more and more elderly people need long-term care and medical services, and aging population means that more elderly people need care services, but the number of young workers is relatively small, so that market demands are difficult to meet. The problem of the lack of care for the elderly has an important impact on society and home, it increases the burden of the home, and also makes it difficult for some elderly to obtain their required care services, possibly leading to more health problems and hospitalization.
At present, along with technological progress, in order to cope with the problem of shortage of workers, workers 'digital people appear, but the existing workers' digital people often only can provide the non-emotional answer for the old to be cared, and have no personality characteristics, so that the digital people and the old have communication barriers, which are inconsistent with the characteristics that the old needs to have emotion communication, and the actual workers often cannot accurately answer the medical consultation of the old to be cared due to the lack of professional knowledge.
Disclosure of Invention
The invention provides a method and a system for generating a nursing digital person, which are used for solving the technical problems that in the prior art, the nursing digital person can only provide a mood-free answer for the old to be nursed, and has no personality characteristics, so that communication barriers are caused between the digital person and the old, the communication barriers are inconsistent with the characteristics of the old which needs to have mood communication, and the real nursing person cannot accurately answer the medical consultation of the old to be nursed due to lack of professional knowledge.
The technical scheme provided by the invention is as follows:
First aspect
The invention provides a method for generating a worker protecting digital person, which comprises the following steps:
s1: acquiring professional knowledge of a worker and interest knowledge of the old, and constructing a knowledge base;
S2: constructing a response model by combining the BERT model and the graph neural network, and training the response model by utilizing a knowledge base;
S3: collecting output results of the trained response model, and carrying out emotion classification on the output results by combining a graphic neural network and a syntax tree to obtain a plurality of emotion types;
S4: recording a video sample of emotion types of a worker to be simulated, and extracting video frames and audio streams of different emotion types from the video sample;
s5: receiving a voice request of a nursing object, extracting the voice request into a text request by using a voice synthesis tool, inputting the text request into a response model, and obtaining a target output result through the response model;
S6: analyzing a target emotion type of a target output result;
S7: constructing a dual-mode coding and decoding network with a multi-layer LSTM, and fusing video frames and audio streams of corresponding emotion types according to the target emotion types to obtain a worker digital person with audio characteristics and video characteristics;
s8: and broadcasting the target output result by the worker protecting digital person.
Second aspect
The invention provides a worker protection digital person generating system, which comprises a processor and a memory for storing executable instructions of the processor; the processor is configured to invoke the instructions stored in the memory for performing the career digital person generation method of the first aspect.
Compared with the prior art, the technical scheme has at least the following beneficial effects:
In the invention, by constructing the response model with the BERT model and the graph neural network and training the response model based on professional knowledge and amateur knowledge, the BERT can capture the complex relation between words and sentences so as to more accurately understand the emotion expression of the user, so that the user has multi-level semantic understanding and context awareness capability, and more intelligent and accurate natural language processing capability is provided. And then analyzing the specific emotion classification of the output result by combining the graph neural network, namely positive emotion, neutral emotion and negative emotion, wherein the syntax tree provides dependency relationship information among words in the text, and the graph neural network can extract features on multiple layers, so that the model can gradually understand the global structure and local features of the text and more accurately capture emotion signals in the text. And finally, according to the output result of the response model, establishing a worker digital person with emotion, which is closer to the worker to be simulated in expression and audio, so that the established worker digital person is more vivid, can effectively and acoustically answer various problems of the old to be nursed, improve the satisfaction of the worker to be nursed and provide more excellent man-machine conversation experience.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for generating a worker protecting digital person;
FIG. 2 is a schematic diagram of a response model according to the present invention;
Fig. 3 is a schematic structural diagram of a dual-mode codec network according to the present invention.
Detailed Description
The technical scheme of the invention is described below with reference to the accompanying drawings.
In embodiments of the invention, words such as "exemplary," "such as" and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term use of an example is intended to present concepts in a concrete fashion. Furthermore, in embodiments of the present invention, the meaning of "and/or" may be that of both, or may be that of either, optionally one of both.
In the embodiments of the present invention, "image" and "picture" may be sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized. "of", "corresponding (corresponding, relevant)" and "corresponding (corresponding)" are sometimes used in combination, and it should be noted that the meaning of the expression is consistent when the distinction is not emphasized.
In the embodiment of the present invention, sometimes a subscript such as W1 may be wrongly expressed in a non-subscript form such as W1, and the meaning of the subscript is consistent when the distinction is not emphasized.
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
The embodiment of the invention provides a method for generating a worker digital person, which can be realized by a worker digital person generating device, wherein the worker digital person generating device can be a terminal or a server. The process flow of the method for generating the protection work digital person shown in fig. 1 can comprise the following steps:
S1: and acquiring the professional knowledge of the worker and the interest knowledge of the old people, and constructing a knowledge base.
The healthcare expertise refers to professional information and skills related to medical treatment, nursing, health management and geriatric nursing. Such knowledge includes, but is not limited to, the following: disease and health status: knowing various common diseases, common health problems for the elderly, and symptoms, treatment and care methods thereof. Knowledge of drugs: familiarity with the use, dosage, side effects, and interactions of various drugs ensures proper use of the drugs by the elderly. Nursing skills: grasping care skills, including measuring blood pressure, assisting the elderly to move, sanitary care, wound care, etc. Nutritional knowledge: the eating needs of the elderly are known, including management of special diets, insufficient diets, and poor diets. Psychological support: the mental health needs of the elderly, such as management of autism, depression and anxiety, are known. Knowledge of interest of elderly: the knowledge of the interests of the elderly means knowledge and understanding of the interests, hobbies, demands and hobbies of the elderly. The knowledge is helpful to construct personalized services of the worker protecting digital person, so that the personal service protecting digital person can meet the requirements of the elderly. The knowledge of interest of the elderly includes, but is not limited to, the following: hobbies and interests: elderly people may have various interests such as reading, gardening, travel, music, art, etc. Diet preference: the eating preferences of the elderly are known, including particularly preferred foods and diet-resistant foods. Daily activities: the elderly's daily activities and habits are known to provide customized advice and services. The training of the follow-up model by using the knowledge base is beneficial to enabling the worker protecting digital person to have higher practicability and individuation, providing better support and advice, helping the elderly to better manage their health and life, improving the service efficiency and quality, and being beneficial to meeting the needs of the elderly.
In one possible embodiment, S1 specifically includes:
s101: acquiring the professional knowledge of nursing workers and the interest knowledge of the old through nursing association and welfare organization of the old respectively;
S102: according to the layout structure of the worker protection professional knowledge and the old people interest knowledge, disassembling the problem content and the corresponding answer content;
S103: and constructing a knowledge base according to the content of the questions and the content of the answers.
Referring to fig. 2 of the drawings, there is shown a schematic structural diagram of a response model provided by the present invention.
In fig. 2, the soft-active layer is embedded in the output layer, in addition to the BERT layer and the neural network layer, including the necessary input layer, max-pooling layer, full-connection layer, and output layer. The input layer is the beginning of the model, which accepts the original text input (which may be a session or text request), and the task of the input layer is to prepare the text data in a form suitable for the processing of the model, which converts the text data into a numerical representation that the model can understand. In emotion analysis tasks, which function to extract the most important information in text encoding, key features in the text can be captured by applying a max pooling operation to different parts of the encoding, which helps the emotion analysis model to better understand the input text. The fully connected layer is a neural network layer for further processing the features obtained from the maximally pooling layer, which layer typically comprises a plurality of neurons, each connected to each neuron of the previous layer, the task of which is to map the extracted features to an appropriate output space in preparation for emotion classification, and which fully connected layer can learn how best to combine the input features for classification. The output layer defines the final output of the model, and for emotion analysis tasks, the output layer typically includes a plurality of neurons, one for each emotion category, such as positive, neutral, and negative emotion, the goal of the output layer being to translate the information passed from the previous layer into a probability distribution for each category. The soft-active layer is embedded in the output layer, typically using a softmax function, for converting the final model output into a probability distribution of categories, which can be used to guide the final classification decision of the model, determining the likelihood that the input text belongs to different emotion categories, to determine the emotion category to which the text belongs.
S2: and constructing a response model by combining the BERT model and the graph neural network, and training the response model by utilizing a knowledge base.
The response model comprises a BERT layer and a graphic neural network layer, wherein the BERT layer is connected with the graphic neural network layer, and the graphic neural network layer is used for receiving output data of the BERT layer.
The BERT (Bidirectional Encoder Representations from Transformers) model is a pre-trained natural language processing model, which performs pre-training on large-scale text data to learn context relations between words and sentences, and the key features of BERT are a bidirectional (i.e. the information of context can be considered simultaneously) learning mode and a transducer architecture, which can encode the text data into word vectors with rich contexts. A graph neural network is a machine learning model used to process graph data. In graph data, the relationships between nodes may be complex, and the graph neural network may effectively capture dependencies and relationships between nodes. It includes two main parts of node representation learning and graph structure modeling, and is generally used for tasks such as node classification, graph classification, and link prediction on graph data.
It should be noted that BERT can capture rich semantic information in text data, so it can be used to understand natural language input of a user, which enables a response model to better understand questions or requests of a user, considering not only information of a single sentence, but also knowledge bases of the context information geriatric care field, which can be represented as graph data, where complex relationships exist between concepts and knowledge points. The graph neural network is suitable for processing such graph data with complex relationships, and by connecting the output of the BERT to the graph neural network layer, information in the knowledge base can be better understood and queried. Combining BERT and a graph neural network allows the answer model to provide personalized responses, as it can provide specific suggestions and information to each user based on contextual information and expertise in the knowledge base, which improves the accuracy and practicality of the answer model. The BERT is pre-trained on a large scale to provide general natural language understanding capability, which allows the response model to benefit from the pre-training knowledge of BERT, and then fine-tuning in a specific domain is performed through the graph neural network, so that the response model is quickly adapted to the field of geriatric care.
By combining the BERT model and the graphic neural network, the response model can better understand user input, query a knowledge base and provide personalized and high-quality response, and the combination fully utilizes the advantages of natural language understanding and graphic data processing and provides a powerful tool for intelligent response in the field of aged care.
Specifically, the training of the response model by the knowledge base includes the steps of: first, the knowledge base includes user questions and answers, related topics or pieces of information in the knowledge base. The corresponding questions and corresponding answers in the knowledge base are input into the answer model, and in the training process, the model gradually learns how to understand user inquiry, how to inquire the knowledge base to acquire relevant information and how to generate corresponding answers, and in the training process, the performance of the model is measured by using a loss function, and model parameters are continuously optimized through an optimization algorithm. After training, the performance of the model is evaluated, using the test dataset, and the accuracy and performance of the model in answering various questions is checked, and if necessary, the model is tuned, for example by fine tuning parameters or adding training data to improve performance. Once the model performs well in the assessment, it can be deployed into a practical application, integrating the trained model into a caretaker digital person or other service so that it can provide real-time answers based on user queries and information in the knowledge base. The knowledge base is maintained and updated periodically to ensure that the information therein remains up to date, and furthermore, iterations and updates of the model can be performed to improve performance and adapt to new situations.
S3: and collecting output results of the trained response model, and carrying out emotion classification on the output results by combining a graphic neural network and a syntax tree to obtain a plurality of emotion types.
Among the emotion types include positive emotion, neutral emotion and negative emotion.
Specifically, first, the trained answer model generates answers to the user query, which are natural language sentences or paragraphs in text form. For each generated answer, a syntactic tree is a data structure representing its syntactic structure, which shows the dependencies, e.g., subject, verb, object, etc., between the individual words in the sentence, which may be generated by syntactic analysis techniques. Once the syntactic tree of answers is present, the next task is to categorize the answers for emotion, which is intended to determine the emotion or emotion type conveyed by the answer, e.g., positive, neutral, negative, etc., which may aid in understanding the emotion color of the answer, as well as the user's possible reactions. In the emotion classification process, the graph neural network is used for processing information in the syntax tree, and the graph neural network can effectively capture the dependency relationship and semantic information among words in the syntax tree, so that emotion expression and emotion context in an answer can be understood more accurately. After analysis in combination with the neural network of graphs and the syntactic tree, the answers can be classified into a plurality of emotion types, such as positive, neutral, and negative, so that the model can identify and label the different emotion elements present in the answers. The model can better understand emotion intonation of the answer by analyzing emotion colors of the answer, and can better process grammar structures and contexts by combining a graph neural network and a syntax tree so as to obtain an accurate emotion classification result.
In one possible embodiment, S3 specifically includes:
s301: and constructing a directed graph representing the word and sentence relation trend in the output result.
In one possible implementation, S301 specifically includes:
S3011: text preprocessing is carried out on the output result, and a sentence subset S is obtained:
,/>
Wherein S N represents that there are N sentences of the output result, w n represents that there are N words in the i-th sentence S i;
s3012: inputting the ith sentence S i into the BERT model, extracting the corresponding word vector ;
S3013: constructing a word vector matrix E 0 based on the word vectors:
;
Wherein, ,/>Representing the real number field, d 0 representing the vector dimension,/>The last hidden layer state of the BERT layer;
S3014: will be related to words Aspect word vector/>, with k words inAs aspect word vector/>Wherein an aspect word is a clause of each sentence that includes a plurality of words:
Wherein, A word vector representing an ith word, m representing a total number of aspect word vectors in the aspect words, j representing a jth word in the aspect word vector having k words;
s3015: converting an original sentence corresponding to the output result into a dependency syntax tree form by utilizing a double affine syntax analyzer, calculating aspect words in the original sentence, connecting each word of the original sentence with the aspect words based on the aspect word vectors, and constructing a word vector relation diagram with the dependency relation as an edge, the words as nodes and the aspect words as roots;
s3016: calculating semantic distance between word and related aspect word ;
Wherein l represents the number of words in the aspect word,Representing the minimum distance between a context word and an aspect word in a sentence to which the aspect word belongs,/>Representing the i-th word/>And j-th word/>T represents a distance threshold, and a minimum distance T greater than the distance threshold is set;
s3017: calculating all semantic distances in the word vector relation diagram to obtain a distance matrix :
;
S3018: combining the word vector matrix and the distance matrix to construct a directed graph representing the word and sentence relationship trend。
It should be noted that, by combining the word vector matrix and the distance matrix, a directed graph is constructed, wherein the words are nodes, the dependency relationship is a directed edge, and the directed graph represents the semantic relationship and the dependency relationship between the words in the text, which is helpful for analyzing the structure and the association of the text. The directed graph may be used for further text analysis, such as emotion analysis, to understand emotion tendencies and emotion contexts in the text, so that the model more deeply understands the text and provides more accurate emotion classification and text analysis.
S302: feature aggregation based on graph neural network is carried out on the directed graph.
In one possible implementation, S302 is specifically:
s3021: feature aggregation is carried out on the directed graph by utilizing a graph neural network:
Wherein, Representing the feature output obtained by aggregation by the first layer of the graph neural network,/>Representing a learnable feature transfer matrix,/>Representing the feature vector dimensions before and after transfer, respectively,/>Represents a distance matrix after self-connection, I represents an identity matrix of corresponding dimension,/>Representing a nonlinear activation function,/>Representing a symmetric normalized version of the distance matrix after the self-join,/>Representing a degree matrix corresponding to the distance matrix after the self-connection,/>Representing a transition probability matrix in a markov chain.
It should be noted that, by using the graph neural network to integrate and update the features of the nodes in the directed graph constructed in S1031 through multi-layer feature transfer and aggregation, the structure and relationship of the text can be better understood, so as to facilitate the subsequent emotion classification or other text analysis tasks, and the key advantage of the graph neural network is that the complex dependency relationship and context information between the nodes can be captured, so as to improve the text understanding performance.
S3022: and calculating the probability values of the characteristic aggregated characteristic output belonging to different emotion types through a soft activation layer of the response model.
In one possible implementation, S303 is specifically:
s3031: inputting the characteristic output obtained by aggregation into a soft activation layer to cross entropy loss function As an objective function, calculating the probability value/>, of the ith aspect word t i belonging to the jth class of emotion types, in the S-th sentence:
Wherein,Characteristic output obtained by aggregation of expression aspect word t i through L layer of graph neural network, softmax representing soft activation layer,/>Network parameters representing soft-active layer, C representing the number of emotion types,/>Representing bias vector,/>One-hot vector representation representing class label to which the i-th aspect word t i belongs,/>Representing the total number of sentences and the total number of terms, respectively.
The cross entropy loss function is a common loss function for multi-class classification tasks. It measures the difference between the model's predictive probability distribution and the actual labels, where it is used as an objective function to measure the difference between the model's predictive probability and the actual emotion labels. The feature outputs represent node features aggregated through layers of the graph neural network, and in the emotion analysis task, the feature outputs are used for predicting the probability that each aspect word belongs to different emotion types. The soft-active layer is an active function for multi-class classification tasks, mapping feature outputs to a probability distribution, where each emotion type corresponds to a probability, so that for each aspect word and each emotion type, a corresponding emotion probability can be calculated.
S304: and outputting the emotion type corresponding to the maximum probability value.
It should be noted that when a single aspect word is taken as a root, the topology structure of the corresponding directed graph is a star network with the aspect word as a center, and considering that more aspect words may exist in a sentence instance, it is impractical to construct an induction tree for each aspect word and separately train resources, and the invention provides a new directed graph construction strategy based on the minimum tree distance. Specifically, all aspect words in the aspect word set are simultaneously used as tree roots, all context words are simultaneously connected with all the aspect words respectively, and the continuous edge contains minimum distance information, in other words, the induction structure of the original dependency syntax tree structure based on sentences is a star-like network taking the aspect word set as a center, and the combination of the topological structures corresponding to all the aspects is equivalent. Compared with the method taking the distance information as an additional characteristic sequence, the method based on the directed graph construction can encode the interaction relation between the aspect words and the context into the representation of the aspect words, so that the emotion classification performance of the subsequent aspect words is improved. Compared with the traditional method based on the minimum distance directed graph, the construction strategy provided herein avoids larger analysis errors of the star-shaped directed graph, and the directed graph construction method based on the syntax tree with all aspect words as roots can complete fine-granularity text emotion analysis, improves emotion analysis efficiency, and simultaneously introduces relative position information to monitor sentence distance, so that syntax quality is ensured, and further accuracy of emotion analysis is ensured.
S4: recording video samples of emotion types of a worker to be simulated, and extracting video frames and audio streams of different emotion types from the video samples.
In one possible embodiment, S4 specifically includes:
S401: recording video samples of emotion types of the protector to be simulated by a camera, and marking the emotion types for each video sample and corresponding sound samples, wherein the video samples comprise the whole facial video of the protector to be simulated;
s402: dividing the video into individual video frames by a video editing tool;
s403: extracting facial five-sense features related to the worker to be simulated in the video frame by using Dlib tools;
S404: synthesizing facial features through an animation technology to obtain video frames under different emotion types;
S405: performing audio separation on the video sample, sampling the separated audio, and extracting sound frequency, volume and tone under different emotion types;
S406: and synthesizing the sound frequency, the sound volume and the pitch by using a voice synthesis tool to obtain audio streams under different emotion types.
It should be noted that, recording a video sample of a real protector, extracting multi-mode data of the real protector to simulate the protector digital person in different emotion states, marking different emotion types, dividing video frames, extracting facial features, synthesizing video frames in different emotions, separating audio from video, sampling and synthesizing audio in different emotions to create multi-mode data with vivid facial expression and sound, so that the protector digital person can express various emotion states more vividly.
S5: and receiving a voice request of the nursing object, extracting the voice request into a text request by using a voice synthesis tool, inputting the text request into a response model, and obtaining a target output result through the response model.
In one possible implementation, the speech synthesis tool includes OpenText, eSpeak, deepSpeech and PocketSphinx.
It should be noted that, the method of receiving the voice request of the nursing object may be a method that the nursing object interacts with the digital nursing staff through voice, such as raising a problem, raising a requirement, etc., then converting the voice request into a text request through a voice synthesis tool, and then inputting the text request into a response model of the nursing staff, where the model processes the text request using technologies such as BERT and a neural network, etc., and generates a target output result to meet the requirement of the nursing object, and this series of steps enables the nursing staff to interact with the nursing object in a natural language manner, understand the requirement and provide corresponding feedback.
S6: and analyzing the target emotion type of the target output result.
Referring to fig. 3 of the specification, a schematic structural diagram of a dual-mode codec network according to the present invention is shown.
S7: and constructing a dual-mode coding and decoding network with a multi-layer LSTM, and fusing video frames and audio streams of corresponding emotion types according to the target emotion types to obtain the worker digital person with audio characteristics and video characteristics.
The dual-mode coding and decoding network comprises a decoding module consisting of a plurality of LSTM network layers and self-attention mechanism layers and a plurality of self-attention coding layers, wherein the audio coding module and the video coding module are connected with the decoding module, and the decoding module is used for receiving output data of the audio coding module and the video coding module.
Wherein the dual mode codec network is a neural network structure for processing two types of data simultaneously, i.e. audio and video data, the network for generating the worker digital person has two encoding modules, one for audio data and the other for video data, and the outputs of the two modules are fused in the decoding module to generate the final digital agent. The LSTM (long and short term memory) network is a variant of a Recurrent Neural Network (RNN) dedicated to processing sequence data, which is capable of capturing long-term dependencies, and the advantage of using the LSTM network layer as a decoding module is that it is capable of better modeling sequence data, including time dependencies and long-term dependencies, and learning and generating more realistic emotion and emotion expressions, which makes the generated carer digital person more performance and flexibility in interaction and emotion communication. The self-attention mechanism is a neural network layer that allows the network to assign different attention weights at different time steps or spatial locations in order to better capture important information in the input data. An audio coding module: this is a part of the network used to process and encode audio data that can extract relevant features in the audio data such as frequency, volume and pitch. Video coding module: similar to the audio encoding module, the video encoding module processes and encodes video data from which video features, such as facial expressions, are extracted. And a decoding module: this module receives the outputs of the audio encoding module and the video encoding module and fuses them together to produce a caretaker digital person having audio and video characteristics.
The dual mode codec network is able to consider both audio and video data, relative to traditional single mode approaches (considering only audio or video), thereby providing more rich information to generate a finer and realistic digital proxy. The network can better capture the emotion and emotion of human beings by processing the audio and video data at the same time, and dynamically adjust the fusion mode of the audio and video according to different emotion types so as to ensure that the generated digital agent is suitable for specific emotion requirements, thereby improving the interactive quality and individuation, the neural network has learning ability, and the data can be continuously improved, so that the method can continuously optimize and improve the performance of the worker-protecting digital person so as to adapt to the requirements of different emotion and emotion, and has more flexibility compared with the traditional method. The dual-mode coding and decoding network has the advantages that the dual-mode coding and decoding network is more comprehensive, can express emotion, is more suitable for different emotion requirements, has learning capability, can continuously improve performance, and enables the generated worker protecting digital person to have more interactivity and adaptability, and is more attractive compared with the existing method.
In one possible implementation, S7 specifically includes:
s701: video frame alignment via a multimodal feature fusion network And audio stream/>Resolving to obtain the joint feature/>:
Wherein,Query vector, key vector and value vector respectively representing self-attention mechanism when parsing the audio stream,/>Respectively representing a query vector, a key vector and a value vector of a self-attention mechanism when the video frame is analyzed, wherein a symbol 'Joint' represents a fusion operation in the self-attention mechanism, a symbol 'cat' represents a connection operation in the self-attention mechanism, and a softmax represents an activation function of a soft activation layer;
s702: decoding the combination characteristic through a decoding module to obtain the worker protection digital person:
wherein y represents the decoding output result of the worker digital person, Representing probability parameters provided by a self-attention mechanism decoder,/>Representing probability parameters provided by the LSTM network layer,/>Representation/>Output result of decoding of time,/>Representing hyper-parameters, symbol "/>"Means stitching.
It should be noted that, combining video frames and audio streams into a joint feature to more fully describe the emotion expressions of a worker digital person, the multi-modal feature fusion network uses a self-attention mechanism as a method capable of assigning weights between different modalities, by which key video and audio features can be identified for subsequent emotion synthesis. The decoding process, which has previously obtained the joint features and then requires the use of a decoding module to translate these features into the actual output of the carer's digital person, including speech, facial expressions, actions, etc., uses an LSTM network, which helps to maintain consistency in timing information and processing time, and the decoding module generates output results, which may be audio, video or other forms of interaction, from the input joint features to meet the needs of the care-giver. The method ensures the effective fusion of video and audio characteristics and the nature and continuity of emotion expression, and compared with the prior art, the multi-mode worker-protecting digital person constructed in this way has the advantages that the multi-mode worker-protecting digital person can provide more diversified, vivid and natural interaction experience, better meets the requirements of nursing objects, integrates multiple perception modes (visual and auditory), and can keep consistency in the output process, so that the interaction which is more in line with the expectations of the human is provided.
S8: and broadcasting the target output result by the worker protecting digital person.
Compared with the prior art, the technical scheme has at least the following beneficial effects:
In the invention, by constructing the response model with the BERT model and the graph neural network and training the response model based on professional knowledge and amateur knowledge, the BERT can capture the complex relation between words and sentences so as to more accurately understand the emotion expression of the user, so that the user has multi-level semantic understanding and context awareness capability, and more intelligent and accurate natural language processing capability is provided. And then analyzing the specific emotion classification of the output result by combining the graph neural network, namely positive emotion, neutral emotion and negative emotion, wherein the syntax tree provides dependency relationship information among words in the text, and the graph neural network can extract features on multiple layers, so that the model can gradually understand the global structure and local features of the text and more accurately capture emotion signals in the text. And finally, according to the output result of the response model, establishing a worker digital person with emotion, which is closer to the worker to be simulated in expression and audio, so that the established worker digital person is more vivid, can effectively and acoustically answer various problems of the old to be nursed, improve the satisfaction of the worker to be nursed and provide more excellent man-machine conversation experience.
The invention also provides a protection digital person generating system, which comprises a processor and a memory for storing executable instructions of the processor; the processor is configured to invoke the instructions stored in the memory for performing the healthcare digital person generating method described above.
The system for generating the protection digital person can execute the method for generating the protection digital person and achieve the same or similar technical effects, and the invention is not repeated for avoiding repetition.
Compared with the prior art, the technical scheme has at least the following beneficial effects:
In the invention, by constructing the response model with the BERT model and the graph neural network and training the response model based on professional knowledge and amateur knowledge, the BERT can capture the complex relation between words and sentences so as to more accurately understand the emotion expression of the user, so that the user has multi-level semantic understanding and context awareness capability, and more intelligent and accurate natural language processing capability is provided. And then analyzing the specific emotion classification of the output result by combining the graph neural network, namely positive emotion, neutral emotion and negative emotion, wherein the syntax tree provides dependency relationship information among words in the text, and the graph neural network can extract features on multiple layers, so that the model can gradually understand the global structure and local features of the text and more accurately capture emotion signals in the text. And finally, according to the output result of the response model, establishing a worker digital person with emotion, which is closer to the worker to be simulated in expression and audio, so that the established worker digital person is more vivid, can effectively and acoustically answer various problems of the old to be nursed, improve the satisfaction of the worker to be nursed and provide more excellent man-machine conversation experience.
The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.
In the present invention, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method for generating a worker-protecting digital person, comprising:
s1: acquiring professional knowledge of a worker and interest knowledge of the old, and constructing a knowledge base;
s2: constructing a response model by combining the BERT model and the graph neural network, and training the response model by utilizing the knowledge base;
s3: collecting output results of the trained response model, and carrying out emotion classification on the output results by combining the graph neural network and the syntax tree to obtain a plurality of emotion types;
s4: recording video samples of the to-be-simulated careers about the emotion types, and extracting video frames and audio streams of different emotion types from the video samples;
s5: receiving a voice request of a nursing object, extracting the voice request into a text request by using a voice synthesis tool, inputting the text request into the response model, and obtaining a target output result through the response model;
s6: analyzing the target emotion type of the target output result;
S7: constructing a dual-mode coding and decoding network with a multi-layer LSTM, and fusing video frames of corresponding emotion types with the audio stream according to the target emotion types to obtain a worker digital person with audio characteristics and video characteristics;
s8: and broadcasting the target output result through the worker protection digital person.
2. The method for generating a worker figure according to claim 1, wherein the step S1 specifically comprises:
S101: acquiring the healthcare professional knowledge and the old people interest knowledge through a nursing association and an old people welfare organization respectively;
S102: according to the professional protection knowledge and the layout structure of the old people interest knowledge, disassembling the problem content and the corresponding answer content;
s103: and constructing the knowledge base according to the question content and the answer content.
3. The method for generating a worker figure according to claim 1, wherein the step S3 specifically comprises:
S301: constructing a directed graph representing word and sentence relation trend in the output result;
s302: performing feature aggregation based on a graph neural network on the directed graph;
s303: calculating the probability values of the characteristic output belonging to different emotion types after characteristic aggregation through a soft activation layer of the response model;
S304: and outputting the emotion type corresponding to the maximum probability value.
4. The method for generating a career digital person according to claim 3, wherein the step S301 specifically comprises:
s3011: text preprocessing is carried out on the output result, and sentence subset S is obtained:
,/>;
Wherein S N represents that there are N sentences of the output result, w n represents that there are N words in the i-th sentence S i;
s3012: inputting the ith sentence S i into the BERT model, extracting the corresponding word vector ;
S3013: constructing a word vector matrix E 0 based on the word vector:
;
Wherein, ,/>Representing the real number field,/>D 0 denotes the vector dimension, i.e. the last hidden layer state of the BERT layer;
S3014: will be related to words Aspect word vector/>, with k words inAs aspect word vector/>Wherein the aspect words are clauses of each sentence comprising a plurality of words:
;
Wherein, A word vector representing an ith word, m representing a total number of aspect word vectors in the aspect words, j representing a jth word in the aspect word vector having k words;
s3015: converting an original sentence corresponding to the output result into a dependency syntax tree form by utilizing a dual affine syntax analyzer, calculating aspect words in the original sentence, connecting each word of the original sentence with the aspect words based on the aspect word vector, and constructing a word vector relation diagram taking the dependency relation as an edge, the word as a node and the aspect word as a root;
s3016: calculating semantic distance between word and related aspect word ;
;
Wherein l represents the number of words in the aspect word,Representing the minimum distance between a context word and an aspect word in a sentence to which the aspect word belongs,/>Representing the i-th word/>And j-th word/>T represents a distance threshold, a minimum distance T that will be greater than the distance threshold;
S3017: calculating all semantic distances in the word vector relation diagram to obtain a distance matrix :
;
S3018: combining the word vector matrix and the distance matrix to construct a directed graph representing word and sentence relationship trend。
5. The method for generating a protection figure according to claim 4, wherein S302 is specifically:
s3021: feature aggregation is performed on the directed graph by using the graph neural network:
;
;
Wherein, Representing the feature output obtained by aggregation by using the first layer of the graph neural network,/>Representing a learnable feature transfer matrix,/>Representing the feature vector dimensions before and after transfer, respectively,/>Represents a distance matrix after self-connection, I represents an identity matrix of corresponding dimension,/>Representing a nonlinear activation function,/>Representing a symmetric normalized version of the self-connected distance matrix,/>Representing a degree matrix corresponding to the distance matrix after self-connection,/>Representing a transition probability matrix in a markov chain.
6. The method for generating a worker figure according to claim 5, wherein S303 is specifically:
S3031: inputting the characteristic output obtained by aggregation into the soft activation layer to cross entropy loss function As an objective function, calculating the probability value/>, of the ith aspect word t i belonging to the jth class of emotion types, in the S-th sentence:
;
;
Wherein,Characteristic output obtained by aggregation of expression aspect word t i through layer L of graph neural network, softmax representing the soft activation layer,/>Network parameters representing soft-active layer, C representing the number of emotion types,/>Representing bias vector,/>One-hot vector representing class label to which the i-th aspect word t i belongs,/>Representing the total number of sentences and the total number of terms, respectively.
7. The method for generating a worker figure according to claim 1, wherein the step S4 specifically comprises:
S401: recording video samples of the to-be-simulated careers about the emotion types through a camera, and marking the emotion types for each video sample and corresponding sound samples, wherein the video samples comprise the whole facial video of the to-be-simulated careers;
S402: splitting the video into individual video frames by a video editing tool;
S403: extracting facial five-sense features related to the worker to be simulated in the video frame by using Dlib tools;
s404: synthesizing the facial features through an animation technology to obtain video frames under different emotion types;
S405: performing audio separation on the video sample, sampling the separated audio, and extracting sound frequency, volume and tone under different emotion types;
s406: and synthesizing the sound frequency, the sound volume and the sound pitch by using the voice synthesis tool to obtain the audio streams under different emotion types.
8. The method for generating a career number person as defined in claim 1, wherein,
The dual-mode coding and decoding network comprises a decoding module consisting of a plurality of LSTM network layers and a self-attention mechanism layer, and an audio coding module and a video coding module which are formed by a plurality of self-attention coding layers, wherein the audio coding module and the video coding module are connected with the decoding module, and the decoding module is used for receiving output data of the audio coding module and the video coding module.
9. The method for generating a career digital person according to claim 8, wherein the step S7 specifically comprises:
S701: the video frames are processed through a multi-mode feature fusion network And the audio stream/>Resolving to obtain the joint feature/>:
;
;
;
Wherein,Query vector, key vector and value vector respectively representing self-attention mechanism when parsing the audio stream,/>Query vector, key vector and value vector respectively representing self-attention mechanism when parsing the video frame,/>A dimension parameter representing a key vector, a symbol "Joint" representing a fusion operation in a self-attention mechanism, a symbol "cat" representing a connection operation in the self-attention mechanism, and softmax representing an activation function of a soft activation layer;
s702: decoding the joint features through the decoding module to obtain the worker protection digital person:
;
wherein y represents the decoding output result of the worker protecting digital person, Representing probability parameters provided by a self-attention mechanism decoder,/>Representing probability parameters provided by the LSTM network layer,/>Representation/>Output result of decoding of time,/>Representing hyper-parameters, symbol "/>"Means stitching.
10. A system for generating a worker digital person, comprising a processor and a memory for storing instructions executable by the processor; the processor is configured to invoke the instructions stored in the memory to perform the caretaker digital person generating method of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410095801.4A CN117635785B (en) | 2024-01-24 | 2024-01-24 | Method and system for generating worker protection digital person |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410095801.4A CN117635785B (en) | 2024-01-24 | 2024-01-24 | Method and system for generating worker protection digital person |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117635785A CN117635785A (en) | 2024-03-01 |
CN117635785B true CN117635785B (en) | 2024-05-28 |
Family
ID=90016586
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410095801.4A Active CN117635785B (en) | 2024-01-24 | 2024-01-24 | Method and system for generating worker protection digital person |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117635785B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118195428B (en) * | 2024-05-20 | 2024-07-26 | 杭州慧言互动科技有限公司 | Digital employee optimization system and method based on AI Agent |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109446331A (en) * | 2018-12-07 | 2019-03-08 | 华中科技大学 | A kind of text mood disaggregated model method for building up and text mood classification method |
CN113779211A (en) * | 2021-08-06 | 2021-12-10 | 华中科技大学 | Intelligent question-answer reasoning method and system based on natural language entity relationship |
CN113868400A (en) * | 2021-10-18 | 2021-12-31 | 深圳追一科技有限公司 | Method and device for responding to digital human questions, electronic equipment and storage medium |
CN115834519A (en) * | 2022-12-24 | 2023-03-21 | 北京蔚领时代科技有限公司 | Intelligent question and answer method, device, server and storage medium |
CN116226347A (en) * | 2022-10-26 | 2023-06-06 | 中国科学院软件研究所 | Fine granularity video emotion content question-answering method and system based on multi-mode data |
CN116414959A (en) * | 2023-02-23 | 2023-07-11 | 厦门黑镜科技有限公司 | Digital person interaction control method and device, electronic equipment and storage medium |
CN116524924A (en) * | 2023-04-23 | 2023-08-01 | 厦门黑镜科技有限公司 | Digital human interaction control method, device, electronic equipment and storage medium |
US11727915B1 (en) * | 2022-10-24 | 2023-08-15 | Fujian TQ Digital Inc. | Method and terminal for generating simulated voice of virtual teacher |
CN116954418A (en) * | 2023-07-28 | 2023-10-27 | 武汉市万睿数字运营有限公司 | Exhibition hall digital person realization method, device, equipment and storage medium |
CN117216209A (en) * | 2023-09-04 | 2023-12-12 | 上海深至信息科技有限公司 | Ultrasonic examination report reading system based on large language model |
CN117234369A (en) * | 2023-08-21 | 2023-12-15 | 华院计算技术(上海)股份有限公司 | Digital human interaction method and system, computer readable storage medium and digital human equipment |
-
2024
- 2024-01-24 CN CN202410095801.4A patent/CN117635785B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109446331A (en) * | 2018-12-07 | 2019-03-08 | 华中科技大学 | A kind of text mood disaggregated model method for building up and text mood classification method |
CN113779211A (en) * | 2021-08-06 | 2021-12-10 | 华中科技大学 | Intelligent question-answer reasoning method and system based on natural language entity relationship |
CN113868400A (en) * | 2021-10-18 | 2021-12-31 | 深圳追一科技有限公司 | Method and device for responding to digital human questions, electronic equipment and storage medium |
US11727915B1 (en) * | 2022-10-24 | 2023-08-15 | Fujian TQ Digital Inc. | Method and terminal for generating simulated voice of virtual teacher |
CN116226347A (en) * | 2022-10-26 | 2023-06-06 | 中国科学院软件研究所 | Fine granularity video emotion content question-answering method and system based on multi-mode data |
CN115834519A (en) * | 2022-12-24 | 2023-03-21 | 北京蔚领时代科技有限公司 | Intelligent question and answer method, device, server and storage medium |
CN116414959A (en) * | 2023-02-23 | 2023-07-11 | 厦门黑镜科技有限公司 | Digital person interaction control method and device, electronic equipment and storage medium |
CN116524924A (en) * | 2023-04-23 | 2023-08-01 | 厦门黑镜科技有限公司 | Digital human interaction control method, device, electronic equipment and storage medium |
CN116954418A (en) * | 2023-07-28 | 2023-10-27 | 武汉市万睿数字运营有限公司 | Exhibition hall digital person realization method, device, equipment and storage medium |
CN117234369A (en) * | 2023-08-21 | 2023-12-15 | 华院计算技术(上海)股份有限公司 | Digital human interaction method and system, computer readable storage medium and digital human equipment |
CN117216209A (en) * | 2023-09-04 | 2023-12-12 | 上海深至信息科技有限公司 | Ultrasonic examination report reading system based on large language model |
Also Published As
Publication number | Publication date |
---|---|
CN117635785A (en) | 2024-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021104099A1 (en) | Multimodal depression detection method and system employing context awareness | |
Wang et al. | A meta-analysis of the predictability of LENA™ automated measures for child language development | |
CN110148318B (en) | Digital teaching assistant system, information interaction method and information processing method | |
Narayanan et al. | Behavioral signal processing: Deriving human behavioral informatics from speech and language | |
Lee et al. | Study on emotion recognition and companion Chatbot using deep neural network | |
CN117635785B (en) | Method and system for generating worker protection digital person | |
Griol et al. | Mobile conversational agents for context-aware care applications | |
Franciscatto et al. | Towards a speech therapy support system based on phonological processes early detection | |
CN113035232B (en) | Psychological state prediction system, method and device based on voice recognition | |
Wilks et al. | A prototype for a conversational companion for reminiscing about images | |
Qian et al. | English language teaching based on big data analytics in augmentative and alternative communication system | |
Harati et al. | Speech-based depression prediction using encoder-weight-only transfer learning and a large corpus | |
Sinclair et al. | Using machine learning to predict children’s reading comprehension from linguistic features extracted from speech and writing. | |
Chandler et al. | Machine learning for ambulatory applications of neuropsychological testing | |
US20230320642A1 (en) | Systems and methods for techniques to process, analyze and model interactive verbal data for multiple individuals | |
Brinkschulte et al. | The EMPATHIC project: building an expressive, advanced virtual coach to improve independent healthy-life-years of the elderly | |
Bhatia | Using transfer learning, spectrogram audio classification, and MIT app inventor to facilitate machine learning understanding | |
Lin et al. | Advancing naturalistic affective science with deep learning | |
Zhao et al. | [Retracted] Standardized Evaluation Method of Pronunciation Teaching Based on Deep Learning | |
CN117711404A (en) | Method, device, equipment and storage medium for evaluating oral-language review questions | |
Schipor et al. | Towards a multimodal emotion recognition framework to be integrated in a Computer Based Speech Therapy System | |
Huang et al. | Inferring Stressors from Conversation: Towards an Emotional Support Robot Companion | |
Du et al. | Composite Emotion Recognition and Feedback of Social Assistive Robot for Elderly People | |
Bose | Continuous emotion prediction from speech: Modelling ambiguity in emotion | |
Mallios | Virtual doctor: an intelligent human-computer dialogue system for quick response to people in need |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |