CN112002329A - Physical and mental health monitoring method and device and computer readable storage medium - Google Patents

Physical and mental health monitoring method and device and computer readable storage medium Download PDF

Info

Publication number
CN112002329A
CN112002329A CN202010925877.7A CN202010925877A CN112002329A CN 112002329 A CN112002329 A CN 112002329A CN 202010925877 A CN202010925877 A CN 202010925877A CN 112002329 A CN112002329 A CN 112002329A
Authority
CN
China
Prior art keywords
text
audio
physical
emotion
mental health
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010925877.7A
Other languages
Chinese (zh)
Other versions
CN112002329B (en
Inventor
温馨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen TCL New Technology Co Ltd
Original Assignee
Shenzhen TCL New Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen TCL New Technology Co Ltd filed Critical Shenzhen TCL New Technology Co Ltd
Priority to CN202010925877.7A priority Critical patent/CN112002329B/en
Publication of CN112002329A publication Critical patent/CN112002329A/en
Application granted granted Critical
Publication of CN112002329B publication Critical patent/CN112002329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/02Detecting, measuring or recording pulse, heart rate, blood pressure or blood flow; Combined pulse/heart-rate/blood pressure determination; Evaluating a cardiovascular condition not otherwise provided for, e.g. using combinations of techniques provided for in this group with electrocardiography or electroauscultation; Heart catheters for measuring blood pressure
    • A61B5/021Measuring pressure in heart or blood vessels
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/02Detecting, measuring or recording pulse, heart rate, blood pressure or blood flow; Combined pulse/heart-rate/blood pressure determination; Evaluating a cardiovascular condition not otherwise provided for, e.g. using combinations of techniques provided for in this group with electrocardiography or electroauscultation; Heart catheters for measuring blood pressure
    • A61B5/024Detecting, measuring or recording pulse rate or heart rate
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/165Evaluating the state of mind, e.g. depression, anxiety
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/68Arrangements of detecting, measuring or recording means, e.g. sensors, in relation to patient
    • A61B5/6887Arrangements of detecting, measuring or recording means, e.g. sensors, in relation to patient mounted on external non-worn devices, e.g. non-medical devices
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/68Arrangements of detecting, measuring or recording means, e.g. sensors, in relation to patient
    • A61B5/6887Arrangements of detecting, measuring or recording means, e.g. sensors, in relation to patient mounted on external non-worn devices, e.g. non-medical devices
    • A61B5/6891Furniture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Abstract

The invention discloses a physical and mental health monitoring method, which comprises the following steps: receiving voice information, converting the voice information into text information, and generating a text emotional state according to the text information; extracting audio features of the voice information, and generating an audio emotional state according to the audio features; fusing the text emotion state and the audio emotion state to obtain a voice emotion state; and acquiring biological indexes, and combining the voice emotion state with the biological indexes to generate physical and mental health monitoring information. The invention also discloses a physical and mental health monitoring device, equipment and a computer readable storage medium. The invention realizes the comprehensive monitoring of the physical and mental health of the user without increasing the burden on the user.

Description

Physical and mental health monitoring method and device and computer readable storage medium
Technical Field
The present invention relates to the field of health monitoring, and in particular, to a method, an apparatus, a device and a computer-readable storage medium for physical and mental health monitoring.
Background
With the improvement of quality of life and the acceleration of life rhythm, people pay more attention to their health conditions, the traditional health monitoring method is to acquire the physical sign information of a user through some sensors worn with the user or implanted in the body, but the traditional health monitoring method has extra burden to the user no matter worn with the user or implanted in the body, and the traditional health monitoring method can only acquire certain biological indexes of the user, which cannot play a role in comprehensively monitoring health (such as mental health).
Disclosure of Invention
The invention mainly aims to provide a physical and mental health monitoring method, equipment and a computer readable storage medium, aiming at solving the technical problems that the traditional health monitoring method has extra burden on a user and can not monitor the health comprehensively.
In addition, in order to achieve the above object, the present invention further provides a physical and mental health monitoring method, including the steps of:
receiving voice information, converting the voice information into text information, and generating a text emotional state according to the text information;
extracting audio features of the voice information, and generating an audio emotional state according to the audio features;
fusing the text emotion state and the audio emotion state to obtain a voice emotion state;
and acquiring biological indexes, and combining the voice emotion state with the biological indexes to generate physical and mental health monitoring information.
Optionally, the step of generating a text emotional state according to the text information includes:
performing word segmentation processing on the text information to obtain a target word;
acquiring emotion classification results corresponding to a preset voice classification model and a text database associated with each emotion classification result;
calculating the existence proportion of each target vocabulary in each text database and the sum of the existence proportions of all the target vocabularies in each text database;
and taking the emotion classification result associated with the text database with the largest sum of the existing ratios as the text emotion state.
Optionally, the step of generating a text emotional state according to the text information includes:
vectorizing the text information to obtain a text vector;
and inputting the text vector to a preset text emotion sensor to obtain a text emotion state corresponding to the text information.
Optionally, the preset textual emotion sensor includes: closed recurrence models and logistic regression models;
the step of inputting the text vector to a preset text emotion sensor to obtain a text emotion state corresponding to the text information comprises:
inputting the text vector into a coding module of the closed recurrence model to obtain a text coding vector;
decoding the text coding vector through a decoding module of the closed recurrence model to obtain emotional characteristics;
and inputting the emotion characteristics into the logistic regression model to perform emotion classification processing, and generating a text emotion state.
Optionally, the step of extracting the audio feature of the voice information and generating an audio emotional state according to the audio feature includes:
extracting the audio features of the voice information, and performing vectorization processing on the audio features to obtain audio vectors;
inputting the audio vector to a coding module of a preset sequence coding and decoding model to obtain an audio coding vector;
and decoding the audio coding vector through a decoding module of the preset sequence coding and decoding model to generate an audio emotional state.
Optionally, the step of fusing the text emotional state and the audio emotional state to obtain a speech emotional state includes:
inputting the text emotion state and the audio emotion state into a preset classification model, and sequentially passing through a full connection layer and a logistic regression layer in the preset classification model;
querying a classification result corresponding to the preset classification model, and acquiring a text probability value and an audio probability value associated with each classification result;
and calculating the numerical sum of the text probability value and the audio probability value associated with each classification result, and taking the classification result with the maximum numerical sum as the speech emotion state.
Optionally, after the step of fusing the text emotional state and the audio emotional state to obtain a speech emotional state, the method includes:
determining a target conversation state according to the text information and the voice emotion state, and searching a target utterance with the highest matching degree with the target conversation state in a preset conversation database;
scoring the target utterance to obtain a matching score;
and if the matching score is larger than a preset threshold value, outputting the target utterance.
Optionally, after the step of scoring the target utterance to obtain a match score, the method includes:
if the matching score is smaller than or equal to a preset threshold value, inputting the current conversation state into a preset voice response model to generate response voice;
and outputting the response voice.
In addition, to achieve the above object, the present invention also provides a physical and mental health monitoring apparatus, including: the monitoring system comprises a memory, a processor and a physical and mental health monitoring program which is stored on the memory and can run on the processor, wherein the physical and mental health monitoring program realizes the steps of the physical and mental health monitoring method when being executed by the processor.
In addition, to achieve the above object, the present invention further provides a computer readable storage medium, having a physical and mental health monitoring program stored thereon, wherein the physical and mental health monitoring program, when executed by a processor, implements the steps of the physical and mental health monitoring method as described above.
The embodiment of the invention provides a physical and mental health monitoring method, equipment and a computer readable storage medium. In the embodiment of the invention, after receiving voice information generated by a user, a physical and mental health monitoring program converts the voice information into text information, and further generates a corresponding text emotional state according to the text information, further extracts audio features of the voice information, generates an audio emotional state according to the extracted audio features, then fuses the text emotional state and the audio emotional state to finally obtain the voice emotional state, the obtained voice emotional state can represent the psychological health condition of the user, and finally combines the voice emotional state and biological indexes to generate and output the physical and mental health monitoring information by obtaining the biological indexes of the user. According to the voice information generated by the user and the biological index of the user, the invention realizes the comprehensive monitoring of the physical and mental health of the user under the condition of not increasing the burden of the user.
Drawings
Fig. 1 is a schematic hardware structure diagram of an embodiment of a physical and mental health monitoring apparatus according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a physical and mental health monitoring method according to a first embodiment of the present invention;
fig. 3 is a schematic view of a physical and mental health monitoring process in a first embodiment of the physical and mental health monitoring method according to the present invention;
fig. 4 is a flowchart illustrating a physical and mental health monitoring method according to a second embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.
The physical and mental health monitoring terminal (also called terminal, equipment or terminal equipment) in the embodiment of the invention can be a terminal with an information processing function such as a PC, a smart phone, a tablet computer, a portable computer and the like, and also can be various patch sensors capable of acquiring bioelectricity signals and equipment with a recording function.
As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the mobile terminal is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer and tapping) and the like for recognizing the attitude of the mobile terminal; of course, the mobile terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, the memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a physical and mental health monitoring program.
In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call a physical and mental health monitoring program stored in the memory 1005, and when the physical and mental health monitoring program is executed by the processor, the physical and mental health monitoring program implements the operations of the physical and mental health monitoring method provided in the following embodiments.
Based on the above-mentioned physical and mental health monitoring device hardware structure, the embodiment of the physical and mental health monitoring method of the invention is provided.
Referring to fig. 2, in a first embodiment of the physical and mental health monitoring method of the present invention, the physical and mental health monitoring method includes:
and step S10, receiving voice information, converting the voice information into text information, and generating a text emotional state according to the text information.
The physical and mental health monitoring method in this embodiment is applied to a physical and mental health monitoring terminal, where the physical and mental health monitoring terminal includes information processing devices (hereinafter, represented by a smart phone) such as a personal computer and a smart phone, various patch sensors capable of acquiring bioelectrical signals (such as heart rate and blood pressure), and devices equipped with a sound recorder capable of acquiring audio information (such as a smart sound box and a smart television with a voice interaction function). The voice information in this embodiment is obtained and sent to the equipment of having installed physical and mental health monitoring procedure by presetting the audio acquisition unit, and wherein, it means to predetermine the audio acquisition unit, can obtain the components and parts of audio information, and these components and parts have the recording function.
The embodiment provides a specific application scenario, in an existing smart television with a voice interaction function, an audio acquisition unit is installed inside the television, and when the audio acquisition unit is in a power-on state, audio information can be acquired in real time, where the acquired audio information includes near-field audio information and far-field audio information, the near-field audio information refers to audio generated by the television itself, and the far-field audio information refers to audio information that is far away (generally 2 to 8 meters) from the audio acquisition unit. Near-field audio information in the audio information acquired by the audio acquisition unit can be filtered by the existing active noise reduction technology, so that far-field audio information is acquired, voice information in the far-field audio information (namely, the voice information in the embodiment) is further acquired, the voice information is converted into text information by the existing voice-to-text technology, semantic information (including fields, intentions, topics and the like) corresponding to the text information is acquired, and the emotional state (namely, the text emotional state in the embodiment) represented by the text information can be determined by the semantic information, wherein the emotional state includes angry, happy, sick, neutral, surprise, disgust, fear, non-neutral and the like. For example, a piece of text information converted from voice information is "i have good taste today", and by acquiring semantic information corresponding to the text information, the emotional state of the text represented by the text information can be determined to be happy.
And step S20, extracting the audio features of the voice information, and generating an audio emotional state according to the audio features.
It can be known that, while converting the voice information into the text information, the physical and mental health monitoring program may also extract audio features of the voice information, where the extracted audio features include speed, tone, and through analysis of the audio features, an emotional state (i.e., an audio emotional state in this embodiment) represented by the audio features may be determined, for example, if the extracted audio features are fast speed, high tone, and hoarse tone, the audio emotional state represented by the audio features may be determined to be angry, specifically, how to determine the fast speed or the high tone, a threshold may be set, and when a time distance between two adjacent tones in the acquired voice information is greater than the threshold, the fast speed is determined.
And step S30, fusing the text emotional state and the audio emotional state to obtain a voice emotional state.
It is also known that the generated textual emotional state and the audio emotional state need to be fused to generate the final speech emotional state, i.e. the emotional state represented by the speech information. In general, the generated text emotional state and the generated audio emotional state are the same, and when the generated text emotional state and the generated audio emotional state are not the same, the physical and mental health monitoring program may select the text emotional state or the audio emotional state as the final generated speech emotional state according to the priority of the text emotional state and the audio emotional state.
It can be known that, no matter the text emotional state or the audio emotional state, the representable emotional states include angry, happy, sad, neutral, surprised, disgust, fear, non-neutral, and the like, in most cases, the generated text emotional state and the audio emotional state are the same, and when the generated text emotional state and the audio emotional state are not the same, the physical and mental health monitoring program acquires priorities of the text emotional state and the audio emotional state, so as to determine to select the text emotional state or the audio emotional state as the finally generated voice emotional state according to the priorities of the text emotional state and the audio emotional state. If the priority of the text emotional state is higher than that of the audio emotional state, the emotion represented by the text information is stronger, and the physical and mental health monitoring program selects the text emotional state as the finally generated voice emotional state; if the priority of the audio emotional state is lower than that of the text emotional state, the emotion represented by the audio characteristic is stronger, and the physical and mental health monitoring program selects the audio emotional state as the finally generated speech emotional state. And finally, determining to select the text emotional state or the audio emotional state as the finally generated voice emotional state according to the grade.
And step S40, acquiring biological indexes, and combining the voice emotion state with the biological indexes to generate physical and mental health monitoring information.
Therefore, in the physical and mental health monitoring program in this embodiment, various biological indicators of the user, such as blood pressure and heart rate, may also be obtained through the patch sensor, where the patch sensor is a sensor in a special form that can be attached to a chair or a bed, and this sensor has no burden to the user compared to an attached sensor (such as a smart bracelet and a smart watch), and finally, the physical and mental health monitoring program may periodically generate physical and mental health monitoring information that is formed by combining a speech emotion state and a biological indicator, and this physical and mental health monitoring information may also be sent to a smart phone of the user, so that the user may view the information.
Specifically, the step S10 is a step of refining, including:
step a1, performing word segmentation processing on the text information to obtain a target word.
Step a2, obtaining emotion classification results corresponding to the preset voice classification models and a text database associated with each emotion classification result.
Step a3, calculating the existence proportion of each target vocabulary in each text database and the sum of the existence proportions of all the target vocabularies in each text database.
Step a4, using the emotion classification result associated with the text database with the largest existence ratio as the text emotion state.
In the embodiment, the text information is the text information converted from the voice information and mostly is a complete sentence, it is understood that the complete sentence is decomposed, and the phrases constituting the complete sentence (i.e. the target vocabulary in the embodiment) are easily obtained, wherein the phrases constituting the complete sentence include subjects, predicates, objects, and the like in grammatical structures, obviously, the subjects have no reference value for generating the emotion represented by the text, and the predicates and the objects have different reference values in generating the emotion, wherein the objects have a larger reference value, and a plurality of emotion classification results determined by the preset voice classification model (i.e. the model used in the semantic understanding module in fig. 3) during modeling each emotion classification result corresponds to one text database (i.e. the text database associated with each emotion classification result in the embodiment), therefore, most of the vocabularies in the text database are objects, the vocabularies in the text database are divided from predetermined complete sentences, all the divided vocabularies are stored in the text database, and the larger the existing proportion of a certain vocabulary in the text database is, the more the reference value of the vocabulary for the generation of emotional state can be embodied, for example, the vocabulary "happy" occupies a large proportion in the text database corresponding to the emotional state "happy", and many sentences consisting of "happy" are judged to be in the emotional state of happy, then in the text database associated with the emotional state "happy", the proportion of the target vocabulary is the highest, and for the text information "true happy", the target vocabulary "true" and the target vocabulary "happy" are obtained after the word division processing, wherein the target vocabulary "true" is a degree adverb, the proportion of the emotion is larger in two polarization emotions (for example, happy and sad), the proportion of the emotion is smaller in neutral emotions (for example, neutral emotions), the target word 'happy' is obviously more relevant to the happy emotion state, as shown in table 1, the number in the table is the existence proportion of the target word in the text database relevant to each emotion state, and the emotion state with the largest sum of the existence proportions of 'true' and 'happy' is happy.
Generating qi Happy Heart injury Neutral property Is surprised Aversion to Fear of Is not neutral
'Zhen' is a Chinese character " 0.05 0.05 0.05 0.001 0.05 0.05 0.05 0.005
"Happy" 0.003 0.1 0.001 0.01 0.008 0.002 0.001 0.005
TABLE 1
Specifically, the step refined in step S10 further includes:
and b1, vectorizing the text information to obtain a text vector.
Step b2, inputting the text vector to a preset text emotion sensor to obtain a text emotion state corresponding to the text information.
Specifically, the step b2 is a step of refining, which comprises:
and c1, inputting the text vector into the encoding module of the closed recurrence model to obtain a text encoding vector.
And c2, decoding the text coding vector through a decoding module of the closed recurrence model to obtain the emotional characteristics.
And c3, inputting the emotion characteristics into the logistic regression model to perform emotion classification processing, and generating a text emotion state.
In this embodiment, when text information converted from voice information is acquired, semantic information corresponding to the text information is further acquired, it can be understood that the process of acquiring the semantic information corresponding to the text information is a process of splitting the text information, where split classification tags include a field, an intention, a topic, and the like, and it is known that semantic information obtained by splitting the text information is composed of a plurality of tags, and a process of vectorizing the semantic information to obtain a text vector is substantially to use each tag as a vector. The preset text emotion sensor in this embodiment is a tool that can generate a text emotion state according to text information, and a multi-layer GRU (closed recurrent unit) model (a language model for determining whether a sentence is reasonable) is used in the preset text emotion sensor, where the GRU model includes a coding module and a decoding module, the coding module uses bidirectional GRU, inputs the coding module as a text vector and outputs the coding module as a decoding vector, and the decoding module uses bidirectional GRU in combination with dot-product attention to obtain an emotion feature, where the dot-product attention means that when a word is predicted or inferred, the dot-product attention is used to determine how strongly the word in the text has a relationship with the word, and then the weighted text vectors are summed to obtain the predicted or inferred word. Therefore, after the text vector is input into the encoding module of the closed recurrence model, the text encoding vector is obtained, the text encoding vector is decoded by the decoding module of the closed recurrence model, the obtained emotional characteristics may exist in various types, further, the obtained emotional characteristics are input into the logistic regression model for emotion classification, and finally, the text emotional characteristics are obtained.
Specifically, the step of step S20 refinement includes:
and d1, extracting the audio features of the voice information, and performing vectorization processing on the audio features to obtain an audio vector.
And d2, inputting the audio vector to a coding module of a preset sequence coding and decoding model to obtain an audio coding vector.
And d3, decoding the audio coding vector through a decoding module of the preset sequence coding and decoding model to generate an audio emotional state.
The preset sequence coding and decoding model in this embodiment is a model used in a speech emotion sensor (i.e., the emotion sensing module in fig. 3), the speech emotion sensor is a tool that can generate an audio emotion state according to audio features, the preset sequence coding and decoding model also includes a coding module and a decoding module, it is known that after audio features of speech information are extracted by a physical and mental health monitoring program, vectorization processing is performed on the audio features to obtain audio vectors, the obtained audio vectors are input to the coding module of the preset sequence coding and decoding model to obtain audio coding vectors, the audio coding vectors are decoded by the decoding module of the preset sequence coding and decoding model to obtain audio emotion states, and as there may be a plurality of audio emotion states obtained, further, the audio features are also input to a logistic regression model for emotion classification processing, and finally obtaining the audio emotional characteristics.
Specifically, the step of S30 refinement in the table includes:
and e1, inputting the text emotion state and the audio emotion state into a preset classification model, and sequentially passing through a full connection layer and a logistic regression layer in the preset classification model.
And e2, querying the classification result corresponding to the preset classification model, and acquiring a text probability value and an audio probability value associated with each classification result.
And e3, calculating the numerical sum of the text probability value and the audio probability value associated with each classification result, and taking the classification result with the maximum numerical sum as the speech emotion state.
In this embodiment, the preset classification model is a model for processing a multi-classification problem, and the text emotional state and the audio emotional state obtained in the above embodiments are sequentially input into the preset classification model, and it can be known that the preset classification model corresponds to a plurality of classification results, such as anger, joy, hurt, neutrality, surprise, dislike, fear, non-neutrality, etc., the full connection layer and the logistic regression layer in this embodiment belong to the preset classification model, wherein the full connection layer is used for establishing association between input parameters (i.e., the text emotional state and the audio emotional state) and each classification result, and the logistic regression layer is used for calculating probabilities that the input parameters are the same as each classification result, when the text emotional state and the audio emotional state are sequentially input into the preset classification model and pass through the full connection layer and the logistic regression layer, the preset classification model outputs two sets of probability numbers, the probability sets of the text emotional states under the classification results and the probability sets of the audio emotional states under the classification results are respectively corresponding, and the sum of all probability numbers in each group of probability sets is equal to 1.
In this embodiment, a specific application scenario is given, assuming that a text emotion state is a and an audio emotion state is b, after a and b are input into a preset classification model, the obtained output results are shown in tables 2 and 3, where tables 2 and 3 respectively represent classification results of the text emotion state and the audio emotion state, the probability values under the same result are added to obtain a sum of the probability values, and then the largest number is selected from the sum of all the probability values as an output result of the preset classification model (i.e., the speech emotion state in this embodiment).
Results Generating qi Happy Heart injury Neutral property Is surprised Aversion to Fear of Is not neutral
Probability of 0.02 0.01 0.8 0.01 0.01 0.01 0.13 0.01
TABLE 2
Results Generating qi Happy Heart injury Neutral property Is surprised Aversion to Fear of Is not neutral
Probability of 0.02 0.01 0.82 0.02 0.01 0.01 0.11 0.01
TABLE 3
In this embodiment, after receiving voice information generated by a user, a physical and mental health monitoring program converts the voice information into text information, and then generates a corresponding text emotional state according to the text information, further, the physical and mental health monitoring program also extracts audio features of the voice information, and generates an audio emotional state according to the extracted audio features, then fuses the text emotional state and the audio emotional state, and finally obtains the voice emotional state, where the obtained voice emotional state can represent the mental health condition of the user, and finally, by obtaining a biological index of the user, combines the voice emotional state and the biological index, and generates and outputs the physical and mental health monitoring information. According to the voice information generated by the user and the biological index of the user, the invention realizes the comprehensive monitoring of the physical and mental health of the user under the condition of not increasing the burden of the user.
Further, referring to fig. 4, a second embodiment of the method for monitoring physical and mental health of the present invention is provided on the basis of the above-mentioned embodiment of the present invention.
This embodiment is a step after step S30 in the first embodiment, and the present embodiment is different from the above-described embodiments of the present invention in that:
and step S50, determining a target conversation state according to the text information and the voice emotion state, and searching a target utterance with the highest matching degree with the target conversation state in a preset conversation database.
It should be noted that the target dialog state in this embodiment refers to that the physical and mental health monitoring program performs different responses for different speech emotional states, where the physical and mental health monitoring program performs a soothing response if the speech emotional state is a negative emotion (e.g. injury, aversion, etc.), the physical and mental health monitoring program performs a responding response if the speech emotional state is a positive emotion (e.g. happy), the difference in the target dialog states is substantially different in the tone of the response to the speech spoken by the user, and it is known that different target dialog states are matched with different dialog databases (i.e. the preset dialog database in this embodiment), the preset dialog database stores many utterances for response, and the reply utterance (i.e. the target utterance in this embodiment) with the highest matching degree with the target dialog state is selected from the preset dialog database, for example, in a product test phase, a research and development staff determines an optimal reply sentence through a large number of voice reply experiments, and the optimal reply sentence is used as a dialect with the highest matching degree with a certain target dialog state, such as the dialect matching module in fig. 3, for example, when the physical and mental health monitoring program determines that the voice emotion state of the user is angry, the target dialog state is a soothing reply to angry, the physical and mental health monitoring program searches a preset dialog database, determines that the sentence "angry, angry hurting the body" is the target utterance with the highest matching degree with the target dialog state, such as the dialect generating module in fig. 3.
And step S60, scoring the target utterance to obtain a matching score.
It is known that the determined target utterance is not necessarily the best reply utterance, and each person has different receiving degrees for the same utterance in consideration of different individual differences, so that the target utterance needs to be scored after being determined, the scoring basis can be user feedback, and if the scoring (i.e. matching score) of the target utterance is smaller than a certain value according to the feedback of the user, prompt information is output for the program developer to update each utterance in the preset dialogue database.
And step S70, if the matching score is larger than a preset threshold value, outputting the target utterance.
Therefore, after the voice emotion characteristics are generated, the voice sent by the user can be replied through the equipment with the voice interaction function, so that the physical and mental health of the user can be comprehensively protected. For example, according to the speech emotional state, it is determined that the current emotional state of the user is angry, in this case, the physical and mental health monitoring program will query the preset dialogue database and obtain the dialogue (i.e. the target utterance in this embodiment) that matches the current dialogue state most, for example, a text message converted from a piece of speech information is "the day failed! The semantic information corresponding to the text information and the audio characteristics of the voice information are obtained, the voice emotion state corresponding to the voice information can be determined to be impaired, the target words inquired by the physical and mental health monitoring program from the preset dialogue database are 'ten years of repair to get a ferry to the ship, hundred years of repair to get a sleep together', the physical and mental health monitoring program can score the target words to obtain a matching score, when the matching score is larger than a preset threshold value, the target words are output, and the scoring of the target words can be combined with user feedback, namely, the scoring of the target words can be adjusted according to the feedback condition of the user.
Specifically, steps subsequent to step S60 include:
and c1, if the matching score is less than or equal to a preset threshold value, inputting the current conversation state into a preset voice response model to generate response voice.
And c2, outputting the response voice.
It should be noted that, if the score of the target utterance is lower than the preset threshold, it indicates that there may not be a matched answer in the preset dialogue database, in this case, the physical and mental health monitoring program may also input the current dialogue state into the preset voice response model to generate a response voice, and it is known that the preset voice response model is a machine learning model, and the response voice can be easily obtained by using the existing artificial intelligence technology, but the response voice also has a certain disadvantage that the answer is neutral, and may not play a role in adjusting the physical and mental health of the user, and the output mode of the target utterance or the response voice may be played through an intelligent television.
In this embodiment, after the voice emotional state is generated, the physical and mental health monitoring device can also reply to the physical and mental health state of the user reflected by the voice emotional state, so that the physical and mental health of the user is comprehensively protected.
In addition, an embodiment of the present invention further provides a physical and mental health monitoring device, where the physical and mental health monitoring device includes:
the text emotional state generating module is used for receiving voice information, converting the voice information into text information and generating a text emotional state according to the text information;
the extraction module is used for extracting the audio features of the voice information and generating an audio emotional state according to the audio features;
the fusion module is used for fusing the text emotion state and the audio emotion state to obtain a voice emotion state;
and the generating module is used for acquiring biological indexes, combining the voice emotion state with the biological indexes and generating physical and mental health monitoring information.
Optionally, the text emotional state generating module includes:
the word segmentation unit is used for carrying out word segmentation processing on the text information to obtain a target word;
the emotion classification result acquisition unit is used for acquiring emotion classification results corresponding to the preset voice classification model and a text database associated with each emotion classification result;
the existence proportion calculation unit is used for calculating the existence proportion of each target vocabulary in each text database and the sum of the existence proportions of all the target vocabularies in each text database;
and the selecting unit is used for taking the emotion classification result associated with the text database with the largest existence ratio as the text emotion state.
Optionally, the text emotional state generating module includes:
the vectorization processing unit is used for vectorizing the text information to obtain a text vector;
and the first input unit is used for inputting the text vector to a preset text emotion sensor to obtain a text emotion state corresponding to the text information.
Optionally, the first input unit includes:
the second input unit is used for inputting the text vector to the coding module of the closed recurrence model to obtain a text coding vector;
the first decoding unit is used for decoding the text coding vector through a decoding module of the closed recurrence model to obtain emotional characteristics;
and the third input unit is used for inputting the emotion characteristics into the logistic regression model to perform emotion classification processing, and generating a text emotion state.
Optionally, the extraction module includes:
the extraction unit is used for extracting the audio features of the voice information and carrying out vectorization processing on the audio features to obtain audio vectors;
the fourth input unit is used for inputting the audio vector to a coding module of a preset sequence coding and decoding model to obtain an audio coding vector;
and the second decoding unit is used for decoding the audio coding vector through a decoding module of the preset sequence coding and decoding model to generate an audio emotional state.
Optionally, the fusion module includes:
the input unit is used for inputting the text emotion state and the audio emotion state into a preset classification model and sequentially passing through a full connection layer and a logistic regression layer in the preset classification model;
the query unit is used for querying the classification result corresponding to the preset classification model and acquiring a text probability value and an audio probability value associated with each classification result;
and the calculating unit is used for calculating the numerical sum of the text probability value and the audio probability value associated with each classification result, and taking the classification result with the maximum numerical sum as the speech emotion state.
Optionally, the physical and mental health monitoring device further includes:
the searching unit is used for determining a target conversation state according to the text information and the voice emotion state and searching a target utterance which is matched with the target conversation state in a preset conversation database to the highest degree;
the scoring unit is used for scoring the target utterance to obtain a matching score;
and the first output unit is used for outputting the target utterance if the matching score is larger than a preset threshold.
Optionally, the physical and mental health monitoring device further includes:
a fifth input unit, configured to input the current dialog state to a preset voice response model if the matching score is less than or equal to a preset threshold, and generate a response voice;
and the second output unit is used for outputting the response voice.
The method executed by each program module can refer to each embodiment of the method of the present invention, and is not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a tablet computer, etc.) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A physical and mental health monitoring method is characterized by comprising the following steps:
receiving voice information, converting the voice information into text information, and generating a text emotional state according to the text information;
extracting audio features of the voice information, and generating an audio emotional state according to the audio features;
fusing the text emotion state and the audio emotion state to obtain a voice emotion state;
and acquiring biological indexes, and combining the voice emotion state with the biological indexes to generate physical and mental health monitoring information.
2. The method for physical and mental health monitoring of claim 1, wherein the step of generating a textual emotional state based on the textual information comprises:
performing word segmentation processing on the text information to obtain a target word;
acquiring emotion classification results corresponding to a preset voice classification model and a text database associated with each emotion classification result;
calculating the existence proportion of each target vocabulary in each text database and the sum of the existence proportions of all the target vocabularies in each text database;
and taking the emotion classification result associated with the text database with the largest sum of the existing ratios as the text emotion state.
3. The method for physical and mental health monitoring of claim 1, wherein the step of generating a textual emotional state based on the textual information comprises:
vectorizing the text information to obtain a text vector;
and inputting the text vector to a preset text emotion sensor to obtain a text emotion state corresponding to the text information.
4. The method for monitoring physical and mental health of claim 3, wherein the predetermined textual emotion sensor comprises: closed recurrence models and logistic regression models;
the step of inputting the text vector to a preset text emotion sensor to obtain a text emotion state corresponding to the text information comprises:
inputting the text vector into a coding module of the closed recurrence model to obtain a text coding vector;
decoding the text coding vector through a decoding module of the closed recurrence model to obtain emotional characteristics;
and inputting the emotion characteristics into the logistic regression model to perform emotion classification processing, and generating a text emotion state.
5. A method as claimed in claim 1, wherein the step of extracting audio features of the voice information and generating an audio emotional state according to the audio features comprises:
extracting the audio features of the voice information, and performing vectorization processing on the audio features to obtain audio vectors;
inputting the audio vector to a coding module of a preset sequence coding and decoding model to obtain an audio coding vector;
and decoding the audio coding vector through a decoding module of the preset sequence coding and decoding model to generate an audio emotional state.
6. The method for monitoring physical and mental health of claim 1, wherein the step of fusing the textual emotional state and the audio emotional state to obtain a speech emotional state comprises:
inputting the text emotion state and the audio emotion state into a preset classification model, and sequentially passing through a full connection layer and a logistic regression layer in the preset classification model;
querying a classification result corresponding to the preset classification model, and acquiring a text probability value and an audio probability value associated with each classification result;
and calculating the numerical sum of the text probability value and the audio probability value associated with each classification result, and taking the classification result with the maximum numerical sum as the speech emotion state.
7. The method for physical and mental health monitoring of claim 1, wherein the step of fusing the textual emotional state and the audio emotional state to obtain a speech emotional state is followed by:
determining a target conversation state according to the text information and the voice emotion state, and searching a target utterance with the highest matching degree with the target conversation state in a preset conversation database;
scoring the target utterance to obtain a matching score;
and if the matching score is larger than a preset threshold value, outputting the target utterance.
8. The wellness monitoring method as set forth in claim 7 wherein the step of scoring the target utterance to obtain a match score is followed by the step of:
if the matching score is smaller than or equal to a preset threshold value, inputting the current conversation state into a preset voice response model to generate response voice;
and outputting the response voice.
9. A physical and mental health monitoring apparatus, comprising: a memory, a processor and a physical and mental health monitoring program stored on the memory and executable on the processor, the physical and mental health monitoring program when executed by the processor implementing the steps of the physical and mental health monitoring method according to any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a physical and mental health monitoring program, which when executed by a processor implements the steps of the physical and mental health monitoring method according to any one of claims 1 to 8.
CN202010925877.7A 2020-09-03 2020-09-03 Physical and mental health monitoring method, equipment and computer readable storage medium Active CN112002329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010925877.7A CN112002329B (en) 2020-09-03 2020-09-03 Physical and mental health monitoring method, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010925877.7A CN112002329B (en) 2020-09-03 2020-09-03 Physical and mental health monitoring method, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112002329A true CN112002329A (en) 2020-11-27
CN112002329B CN112002329B (en) 2024-04-02

Family

ID=73468790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010925877.7A Active CN112002329B (en) 2020-09-03 2020-09-03 Physical and mental health monitoring method, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112002329B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749895A (en) * 2021-01-12 2021-05-04 深圳前海微众银行股份有限公司 Guest group index management method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120117041A (en) * 2011-04-14 2012-10-24 한국과학기술원 Method and system of synthesizing emotional speech based on personal prosody model and recording medium
US20130054244A1 (en) * 2010-08-31 2013-02-28 International Business Machines Corporation Method and system for achieving emotional text to speech
CN107714056A (en) * 2017-09-06 2018-02-23 上海斐讯数据通信技术有限公司 A kind of wearable device of intellectual analysis mood and the method for intellectual analysis mood
CN110085221A (en) * 2018-01-26 2019-08-02 上海智臻智能网络科技股份有限公司 Speech emotional exchange method, computer equipment and computer readable storage medium
CN110555204A (en) * 2018-05-31 2019-12-10 北京京东尚科信息技术有限公司 emotion judgment method and device
CN111223498A (en) * 2020-01-10 2020-06-02 平安科技(深圳)有限公司 Intelligent emotion recognition method and device and computer readable storage medium
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130054244A1 (en) * 2010-08-31 2013-02-28 International Business Machines Corporation Method and system for achieving emotional text to speech
KR20120117041A (en) * 2011-04-14 2012-10-24 한국과학기술원 Method and system of synthesizing emotional speech based on personal prosody model and recording medium
CN107714056A (en) * 2017-09-06 2018-02-23 上海斐讯数据通信技术有限公司 A kind of wearable device of intellectual analysis mood and the method for intellectual analysis mood
CN110085221A (en) * 2018-01-26 2019-08-02 上海智臻智能网络科技股份有限公司 Speech emotional exchange method, computer equipment and computer readable storage medium
CN110555204A (en) * 2018-05-31 2019-12-10 北京京东尚科信息技术有限公司 emotion judgment method and device
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111368609A (en) * 2018-12-26 2020-07-03 深圳Tcl新技术有限公司 Voice interaction method based on emotion engine technology, intelligent terminal and storage medium
CN111223498A (en) * 2020-01-10 2020-06-02 平安科技(深圳)有限公司 Intelligent emotion recognition method and device and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KIM, TAE-HO AND CHO, SUNGJAE AND CHOI, SHINKOOK AND PARK, SEJIK AND LEE, SOO-YOUNG: "Emotional Voice Conversion Using Multitask Learning with Text-To-Speech", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 9 August 2020 (2020-08-09), pages 7774 - 7778 *
杜漫;徐学可;杜慧;伍大勇;刘悦;程学旗;: "面向情绪分类的情绪词向量学习", 山东大学学报(理学版), no. 07, 14 June 2017 (2017-06-14), pages 56 - 62 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749895A (en) * 2021-01-12 2021-05-04 深圳前海微众银行股份有限公司 Guest group index management method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112002329B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US10977452B2 (en) Multi-lingual virtual personal assistant
US20210081056A1 (en) Vpa with integrated object recognition and facial expression recognition
KR102191425B1 (en) Apparatus and method for learning foreign language based on interactive character
EP3824462B1 (en) Electronic apparatus for processing user utterance and controlling method thereof
US11264008B2 (en) Method and electronic device for translating speech signal
US10468052B2 (en) Method and device for providing information
JP2017058673A (en) Dialog processing apparatus and method, and intelligent dialog processing system
KR102216768B1 (en) System and Method for Analyzing Emotion in Text using Psychological Counseling data
CN111967224A (en) Method and device for processing dialog text, electronic equipment and storage medium
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
US10783329B2 (en) Method, device and computer readable storage medium for presenting emotion
KR101221188B1 (en) Assistive robot with emotional speech synthesizing function, method of synthesizing emotional speech for the assistive robot, and recording medium
US11568853B2 (en) Voice recognition method using artificial intelligence and apparatus thereof
CN112632242A (en) Intelligent conversation method and device and electronic equipment
CN112002329B (en) Physical and mental health monitoring method, equipment and computer readable storage medium
KR102297480B1 (en) System and method for structured-paraphrasing the unstructured query or request sentence
CN107943299B (en) Emotion presenting method and device, computer equipment and computer readable storage medium
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
US20220100959A1 (en) Conversation support device, conversation support system, conversation support method, and storage medium
CN109359181B (en) Negative emotion reason identification method, device and computer-readable storage medium
US11922127B2 (en) Method for outputting text in artificial intelligence virtual assistant service and electronic device for supporting the same
JP2022032691A (en) Information processing device, program and information processing method
Abbas Improving Arabic Sign Language to support communication between vehicle drivers and passengers from deaf people
Zaid et al. Jewelry Shop Conversational Chatbot
KR20230000044A (en) Apparatus for automatically providing response and a method for controlling the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant