WO2020135194A1

WO2020135194A1 - Emotion engine technology-based voice interaction method, smart terminal, and storage medium

Info

Publication number: WO2020135194A1
Application number: PCT/CN2019/126443
Authority: WO
Inventors: 温馨
Original assignee: 深圳Tcl新技术有限公司
Priority date: 2018-12-26
Filing date: 2019-12-19
Publication date: 2020-07-02
Also published as: CN111368609A; CN111368609B

Abstract

Disclosed are an emotion engine technology-based voice interaction method, a storage medium, and a smart terminal. The method comprises: acquiring voice information input by a user and acquiring face image information of the user; extracting emotion recognition features from the voice information and face image information and inputting the extracted emotion recognition features into a preset emotion recognition model; calculating the emotion of the user by means of the emotion recognition model, generating a personified voice interaction strategy on the basis of the emotion of the user, and outputting voice interaction information. In the present disclosure, a user's emotion is analyzed, and the emotion is added into voice interaction, thereby developing an emotional intelligent voice interaction mode, getting rid of inflexible and passive communication modes of traditional voice interaction systems, and providing convenience for users.

Description

Voice interaction method, intelligent terminal and storage medium based on emotion engine technology

priority

The PCT patent application requires the application date to be December 26, 2018, and the Chinese patent priority of the application number 201811605103.5. This patent application combines the technical solutions of the above patents.

Technical field

The present disclosure relates to the field of Internet interaction technology, and in particular, to a voice interaction method based on emotion engine technology, an intelligent terminal, and a storage medium.

Background technique

With the continuous innovation of human-computer interaction technology, people's interaction methods are constantly changing, from mouse, keyboard, remote control to touch screen, the interaction method is getting simpler. In the era of the first platform of the computer, the interaction between humans and machines can only be achieved through the keyboard and mouse. The technology in this period can only exist in the computer room, and the operation is very cumbersome. In the era of the second platform, the computer added some relatively friendly interactive interface design. People do not need to enter commands on the DOS interface, they can interact with the computer through a simple interface operation, and the interactive experience has been greatly improved; in the era of the third platform, with the rise of touch screen technology, people can directly complete their interactive operations by moving their fingers. In addition to the constraints of auxiliary interactive devices such as keyboard and mouse, the interaction method is more convenient, and it also provides the possibility for the reform of mobile devices, so that the technology can exist in everyone's pocket. The rise of artificial intelligence technology provides a more natural way of interaction-natural language conversation, users can interact with the machine through natural language, obtain information, and use conversational interaction as the core, using voice technology and images The combination of technology, face recognition technology, and enhanced display technology enables technology to exist in ubiquitous devices.

Conversational artificial intelligence is a major application of AI technology, mainly refers to the use of speech recognition, semantic understanding, multi-round dialogue and natural language understanding and other technologies to allow users to communicate with robots in natural language. However, the current voice interaction between the user and the robot mainly stays in a passive task-style dialogue. The solid dialogue management mechanism is used to replies or answer the user. Although this method can complete the user's basic dialogue needs, it cannot be based on the user. To respond more intelligently to the current emotions and inconvenient to use.

Therefore, the existing technology needs to be improved and developed.

Summary of the invention

The technical problem to be solved by the present disclosure is to provide a voice interaction method, an intelligent terminal and a storage medium based on the emotion engine technology in view of the above-mentioned defects of the prior art, aiming to solve the problem between the user and the intelligent robot in the prior art In the solidified response mode adopted by the conversation, the intelligent robot cannot make more intelligent responses and other questions based on the user's current emotions.

The technical solutions adopted by the present disclosure to solve the technical problems are as follows:

A voice interaction method based on emotion engine technology, wherein the method includes:

Obtain the voice information input by the user, and obtain the user's face image information;

Extract emotion recognition features from the voice information and face image information, and input the extracted emotion recognition features into a preset emotion recognition model;

The emotion of the user is calculated through the emotion recognition model, and an anthropomorphic voice interaction strategy is generated based on the emotion of the user, and the voice interaction information is output.

The voice interaction method based on the emotion engine technology, wherein the step of obtaining voice information input by the user and obtaining face image information of the user specifically includes:

Obtain the voice information input by the user through the preset remote device or remote control pickup device;

Obtain the user's face image information through a preset camera device.

The voice interaction method based on the emotion engine technology, wherein the step of extracting emotion recognition features from the voice information and face image information, and inputting the extracted emotion recognition features into a preset emotion recognition model, This includes:

Convert one voice signal in the acquired voice information into text information through the ASR voice recognition module, and extract the user's text emotional state from the text information;

The other voice signal in the obtained voice information is used to extract the user's audio emotional state through a preset voice emotion sensor;

The obtained facial image information is used to extract the user's expression state through a preset expression recognition system;

The text emotion state, audio emotion state and expression state are input to a preset emotion recognition model.

The voice interaction method based on the emotion engine technology, wherein the step of extracting the user's text emotional state from the text information specifically includes:

Perform feature extraction on the text information, extract sentence information, and obtain user's personal information from a preset memory map according to the sentence information;

The sentence information and the user's personal information are input into a preset emotional state recognition model to identify the user's text emotional state.

The voice interaction method based on the emotion engine technology, wherein the step of inputting the sentence information and the user's personal information into a preset emotion recognition model to recognize the user's textual emotion state specifically includes:

Extract keywords from the sentence information, and obtain the user's first emotional state and first confidence score according to the keywords;

Input the sentence information and the user's personal information into the emotion recognition model to obtain the user's second emotional state and second confidence score;

Compare the first confidence score with a preset threshold;

If the first confidence score is greater than the threshold, the first emotional state is used as the user's text emotional state; if the first confidence score is less than the threshold, the first emotional state and the second emotional state are dynamically Sort, and determine the user's text emotional state according to the result of dynamic sorting.

In the voice interaction method based on the emotion engine technology, the sentence information includes: Chinese word segmentation information of the sentence, part-of-speech tagging information of the sentence after word segmentation processing, sentence sentence information of the sentence, and sentence2vector information of the sentence.

In the voice interaction method based on the emotion engine technology, the parameters involved in the dynamic ranking include: text length, extracted keywords, text input by the user, and confidence scores of the first/second emotional states.

The voice interaction method based on the emotion engine technology, wherein the step of calculating the user's emotion through the emotion recognition model, generating an anthropomorphic voice interaction strategy based on the user's emotion, and outputting the voice interaction information, specific include:

The emotion recognition model performs weighted calculation on the input text emotion state, audio emotion state and expression state to obtain the user's emotion;

Compare and match the obtained emotion with the emotion feature information in the preset emotion database to obtain the corresponding emotion feature information;

Based on the obtained emotional feature information, perform emotional intention decision and user portrait filling;

Based on the obtained emotional intent decision result and user portrait information, a dialogue generation model is used to generate voice interactive information with emotions, and output voice interactive information.

The emotion database includes a variety of emotions and emotion feature information corresponding to each emotion.

In the voice interaction method based on the emotion engine technology, before the emotion recognition model performs weighted calculation on the input text emotion state, audio emotion state and expression state, it includes:

Unweighted weights are set in advance for the text emotional state, audio emotional state, and expression state.

The voice interaction method based on the emotion engine technology, wherein the step of generating the voice interaction information with emotion through the dialogue generation model specifically includes:

The dialogue generation model receives the question information input by the user, and records the user's historical dialogue information, position change information, and mood change information;

Analyzing the user's personal information and activity status according to the historical dialogue information, the position change information, and the emotion change information, to obtain user portrait information;

Generate voice interaction information according to the problem information and user portrait information; the voice interaction information is also used to update the dialogue generation model.

In the voice interaction method based on the emotion engine technology, the dialogue generation module is implemented by a three-layer recurrent neural network RNN architecture, and is based on a back propagation algorithm algorithm.

The voice interaction method based on the emotion engine technology, wherein the emotion recognition of the user is calculated through the emotion recognition model, and an anthropomorphic voice interaction strategy is generated based on the emotion of the user, and the output step further includes:

Use the user's emotion and the obtained emotional intention decision result as the first input of the network model;

Use the customized scene structured data as the second input of the network model;

Through the learning and training of the network model, an emotion engine model that outputs anthropomorphic voice interaction strategies under specific scenarios is obtained.

An intelligent terminal, comprising: a processor and a storage medium communicatively connected to the processor, the storage medium is suitable for storing multiple instructions; the processor is suitable for calling instructions in the storage medium to execute the implementation The steps of the voice interaction method based on the emotion engine technology described in any one of the above.

A storage medium on which a plurality of instructions are stored, wherein the instructions are suitable for being loaded and executed by a processor to perform the steps of implementing any of the voice interaction methods based on the emotion engine technology described above.

Beneficial effect of the present disclosure: The present disclosure analyzes the emotions of users and adds emotions to voice interactions, thereby shaping an emotional intelligent voice interaction mode and making the user and the smart terminal achieve more interesting voice cunning, Get rid of the mechanized and passive communication mode of the traditional voice interaction system, which provides convenience for users.

BRIEF DESCRIPTION

FIG. 1 is a flowchart of a preferred embodiment of a voice interaction method based on emotion engine technology of the present disclosure.

FIG. 2 is a general control flowchart of the voice interaction method based on the emotion engine technology of the present disclosure.

FIG. 3 is a logic flow diagram of an emotion recognition system of a voice interaction method based on emotion engine technology of the present disclosure.

4 is a functional schematic diagram of the smart terminal of the present disclosure.

detailed description

In order to make the purpose, technical solutions and advantages of the disclosure more clear and unambiguous, the disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present disclosure and are not intended to limit the present disclosure.

The technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, but not all the embodiments. The following description of at least one exemplary embodiment is actually merely illustrative, and in no way serves as any limitation to the present disclosure and its application or use. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the protection scope of the present disclosure.

The voice interaction method based on the emotion engine technology provided by the present disclosure can be applied to a terminal. Among them, the terminal may be, but not limited to, various personal computers, notebook computers, mobile phones, tablet computers, in-vehicle computers, and portable wearable devices. The terminal of the present disclosure uses a multi-core processor. The processor of the terminal may be at least one of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), a video processing unit (Video Processing Unit, VPU), and the like.

The present disclosure provides a voice interaction method based on emotion engine technology. Specifically, as shown in FIG. 1, the method includes:

Step S100: Acquire voice information input by the user, and acquire face image information of the user.

Step S200: Extract emotion recognition features from the voice information and face image information, and input the extracted emotion recognition features to a preset emotion recognition model.

Step S300: Calculate the user's emotion through the emotion recognition model, and generate an anthropomorphic voice interaction strategy based on the user's emotion, and output it.

Since the current voice interaction mode still stays in a passive task-style dialogue, it is usually boring and uninteresting to ask or answer questions to users through a solid dialogue management mechanism. In order to solve the above problems, this embodiment provides a voice interaction method based on the emotion engine technology, which mainly analyzes the user's emotions and adds emotions to the voice interactions, thereby shaping an emotional intelligent voice interaction method and getting rid of the traditional The mechanized and passive communication mode of the voice interaction system provides convenience for users.

Specifically, in this embodiment, whether the user performs voice interaction is monitored in real time. When the user is detected to perform voice interaction, the voice information input by the user is obtained by presetting a remote device or a remote pickup device; considering that the user is in different emotional states The facial expressions will also change, and the changes in facial expressions also represent the user's emotional state, so this embodiment presets the camera device, and when the user performs voice interaction, the user's person is acquired in real time through the preset camera device Face image information, combined with voice information and user's face image information, can more accurately determine the user's current mood.

Further, since the acquired voice information includes language and text information when the user speaks and tone and speed information when the user speaks, for example, a happy expression sentence appears in the user's language expression, indicating that the user may currently be in a relatively happy state , The user speaks faster, and the louder sound means the user is in a more excited state. In addition, some words in the user's voice information can also indicate the user's current emotional state. For example, the user's voice information contains the words "very annoying", which shows that the user is more anxious. Therefore, in order to better analyze the voice information used, as shown in FIG. 2, in this embodiment, the obtained voice information is divided into two voice signals, and one voice signal passes the preset ASR (Automatic Sp Recognition) The speech recognition module converts into text information and extracts the user's text emotional state from the text information; another voice signal extracts the user's audio emotional state through a preset voice emotion sensor. Since the user's facial expression will change under different emotional states, the facial expression information obtained by the user can be extracted through the preset facial expression recognition system; finally, the extracted text emotional state, The audio emotion state and the expression state are input to a preset emotion recognition module for emotion recognition, which can more accurately recognize the user emotion.

Specifically, as shown in FIG. 3, in this embodiment, extracting the user's text emotional state from the text information specifically includes the following steps:

Step 301: Extract sentence information according to user input.

Step 302: Acquire user personal information from the memory map.

Step 303: Input sentence information into the rule model, extract keywords, and obtain the user's first emotional state and first confidence score according to the keywords.

Step 304: Input sentence information and user information into the deep learning model to obtain the user's second emotional state and second confidence score.

Step 305: Determine whether the first confidence score is greater than a preset threshold. If not, perform step 307. If yes, perform step 306.

Step 306: Use the first emotional state as the user's text emotional state.

Step 307: Dynamically sort the first emotional state and the second emotional state, and make a decision according to the result of the dynamic sorting.

In one implementation, the sentence information in the above steps includes: Chinese word segmentation information of the sentence, part-of-speech tagging information after the sentence segmentation, sentence pattern information of the sentence, sentence2vector information of the sentence, etc.; personal information of the user includes: name, gender , Birthday, age, constellation, user's psychological state and physiological state, etc. The parameters involved in dynamic sorting include: text length, extracted keywords, text input by the user, confidence scores of the first/second emotional states, etc. When the above-mentioned first confidence score is less than the preset threshold, in this embodiment, these parameters are used as inputs to enter the dynamic ranking model, which affects the ranking result by giving different weights, and finally determines the user's text emotional state according to the ranking result. The parameter selection and weight adjustment of dynamic sequencing will be adjusted according to the performance of the overall model. The method of extracting sentence information includes existing Chinese word segmentation information and part-of-speech tagging information technology, which will not be repeated here.

Further, in this embodiment, the emotion data of a plurality of users are pre-stated to generate an emotion database. In one implementation, the emotion database includes 22 emotions such as emotions of human emotions, and also includes each emotion Corresponding emotional feature information, for example, the happy emotional feature information in the emotional database includes corresponding expression image data (such as mouth corners rising), corresponding high-frequency text (such as happy, happy and other words), corresponding tone and intonation Information (eg, cheerful tone), etc. Therefore, when happy emotions are found in the emotion database, corresponding emotion feature information can be obtained. Similarly, corresponding emotion states can also be found in the emotion database through the emotion feature information.

During specific implementation, considering that user voice information and facial image information may have different influence weights on the final emotional state judgment in different application scenarios, this embodiment uses the acquired text emotional state, audio emotional state, and expression state Input to the emotion recognition model, weighted calculation of the input text emotion state, audio emotion state and expression state through the emotion recognition model, and the calculation result is compared and matched with the preset emotion database to obtain the user's emotion. Specifically, the emotion recognition model is formed by inputting various collected text emotional states, audio emotional states, and expression states into the network model for deep learning and training in advance. In this embodiment, unweighted weights may be set in advance for the text emotional state, audio emotional state, and expression state. For example, the text emotional state weight is set to 20%, the audio emotional state weight is set to 50%, and the expression state weight is set. 30%, according to the set weights, you can get the emotion closest to the user's current emotional state. Then, based on the obtained user emotions, they are compared and matched in the emotion database to obtain emotional feature information corresponding to the emotions. The emotional feature information is used for emotional intention decision-making and user portrait filling, so as to generate voice interaction information with emotions. For example, when the user's emotion is calculated to be pleasant, the emotional feature information corresponding to happiness includes: frequently appearing words such as "happy", "happy", expression images with raised mouth corners, and cheerful intonation. According to these emotional feature information, You can determine the user's portrait and the user's current specific emotions, and the smart terminal can make corresponding emotional intention decisions (that is, the emotional feedback that the smart terminal needs to make according to the user's emotions), and make response information with corresponding emotions, that is, the same Output response information with pleasant emotions to achieve a more humanized voice interaction.

Further, in this embodiment, when interacting with voice information, a dialogue generation module is used to realize the response. Specifically, the dialogue generation module receives the question information input by the user, and records the user's historical dialogue information, position change information, and mood changes Information, and then analyze the user's personal information and activity status based on the above information to obtain user portrait information; based on problem information and user portrait information (the user portrait information at this time is analyzed based on the emotional feature information corresponding to the user's emotion) To generate voice interaction information. It can be seen that in this embodiment, not only the emotional response information can be made according to the user's emotional state, but also different voice interaction strategies can be made in real time according to the user's emotional changes. The emotions carried in the voice interaction strategy will also be Changes in real time. In an implementation manner, the dialog generation module in this embodiment is implemented by a three-layer recurrent neural network RNN architecture, using a back propagation algorithm (backpropagation, bp) algorithm as the basis. In one implementation, the more complete the user information in the dialogue generation model, the higher the accuracy of the voice interaction information. Therefore, the method provided in this embodiment further includes: adding voice interaction information to the dialogue generation model, which can be mixed Use rules, machine learning, and deep learning techniques to save the voice interaction information from the voice interaction information and learn and train the dialogue generation model, thereby updating the dialogue generation model to make the dialogue generation model better generate emotional Voice response information.

Further, in this embodiment, considering that in different scenarios, the character characteristics of the interacting party will be different, the corresponding scene structured data is set according to the character characteristics corresponding to the different scenes. After acquiring the user's emotions, the user's emotions and the obtained emotional intention decision results (that is, the emotional feedback made by the smart terminal according to the user's emotions) are used as the first input of the network model; the custom scene structured data is used as The second input of the network model; through the learning and training of the network model, an emotion engine model that outputs anthropomorphic voice interaction strategies in specific scenarios is obtained. The emotion engine model can enable intelligent terminals to automatically output anthropomorphic voice interactions according to specific scenarios Strategy to achieve more intelligent and humanized voice interaction.

Based on the above embodiment, the present disclosure also provides an intelligent terminal, and a functional block diagram thereof may be shown in FIG. 4. The intelligent terminal includes a processor, a memory, a network interface, a display screen, and a temperature sensor connected through a system bus. Among them, the processor of the intelligent terminal is used to provide computing and control capabilities. The memory of the intelligent terminal includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operating system and computer programs in the non-volatile storage medium. The network interface of the intelligent terminal is used to communicate with external terminals through a network connection. The computer program is executed by the processor to implement a voice interaction method based on the emotion engine technology. The display screen of the intelligent terminal may be a liquid crystal display screen or an electronic ink display screen. The temperature sensor of the intelligent terminal is set in the interior of the intelligent terminal in advance to detect the current operating temperature of the internal device.

Those skilled in the art can understand that the functional block diagram shown in FIG. 4 is only a block diagram of a part of the structure related to the disclosed solution, and does not constitute a limitation on the smart terminal to which the disclosed solution is applied. The specific smart terminal It may include more or less components than shown in the figures, or combine certain components, or have a different arrangement of components.

In one embodiment, an intelligent terminal is provided, which includes a memory and a processor, and a computer program is stored in the memory. When the processor executes the computer program, at least the following steps may be implemented:

In one of the embodiments, when the processor executes the computer program, it can also be implemented: start a preset monitoring program to monitor whether the user performs voice interaction; when the user is monitored for voice interaction, start the preset remote device or remote control The audio device obtains the voice information input by the user itself, and starts a preset camera to obtain the user's face information. Divide the acquired voice information into two voice signals, one voice information is converted into text information through the preset ASR voice recognition module, and the user's text emotional state is extracted from the text information; the other voice signal passes the preset The voice emotion sensor extracts the user's audio emotion state; the facial image information obtained through the preset expression recognition system can extract the user's expression state; the extracted text emotion state, audio emotion state and expression The state is input to a preset emotion recognition module for emotion recognition.

In one of the embodiments, when the processor executes the computer program, it can also be realized: after acquiring text information of the user's voice interaction, extract sentence information according to the voice information input by the user, and obtain the user's personal information from the memory map; Information input rule model, extract keywords, and get the user's first emotional state and first confidence score according to the keywords; input sentence information and user information into the deep learning model to get the user's second emotional state and second confidence score ; Determine the size of the first confidence score and the preset threshold, when the first confidence score is greater than the preset threshold, the first emotional state is used as the user's emotional state; when the first confidence score is less than the preset threshold, will The first emotional state and the second emotional state are dynamically sorted, and decisions are made based on the results of the dynamic sorting.

In one of the embodiments, when the processor executes the computer program, it can also realize: pre-stating the emotional data of multiple users to generate an emotional database, and input the acquired text emotional state, audio emotional state, and expression state into the emotional recognition model Then, perform weighted calculation to obtain the user's emotions, compare and match the user's emotions with the preset emotion database, and obtain corresponding emotion feature information. Based on the obtained emotional feature information, emotional intention decision-making and user portrait filling are performed; based on the obtained emotional intention decision result and user portrait information, a voice generation information with emotion is generated through a dialogue generation model. In the specific voice interaction process, the dialogue generation model receives the question information input by the user, records the user's historical dialogue information, position change information, and mood change information, analyzes the user's personal information and activity status, and obtains the user's portrait information; according to the question Information and user portrait information to generate voice interaction information, which can also be used to update the dialogue generation model. In this embodiment, not only the emotional response information can be made according to the user's emotional state, but also different voice interaction strategies can be made in real time according to the user's emotional changes, and the emotions carried in the voice interaction strategy will also change in real time .

In one of the embodiments, when the processor executes the computer program, it can also be realized: using the user's emotions and the obtained emotional intent decision results as the first input of the network model; using the custom scene structured data as the network model's The second input; through the learning and training of the network model, get the emotion engine model that outputs the anthropomorphic voice interaction strategy in the specific scene. The emotion engine model can enable the intelligent terminal to automatically output the anthropomorphic voice interaction strategy according to the specific scene. More intelligent and user-friendly voice interaction.

A person of ordinary skill in the art may understand that all or part of the processes in the method of the above embodiments may be completed by instructing relevant hardware through a computer program, and the computer program may be stored in a non-volatile computer readable storage In the medium, when the computer program is executed, the process of the foregoing method embodiments may be included. Among them, any reference to memory, storage, database or other media used in the embodiments provided by the present disclosure may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

In summary, the present disclosure provides a voice interaction method based on the emotion engine technology. The method includes: obtaining voice information input by a user, and obtaining face image information of the user; from the voice information and face image information Extract emotion recognition features, and input the extracted emotion recognition features into a preset emotion recognition model; calculate the user's emotions through the emotion recognition model, and generate an anthropomorphic voice interaction strategy based on the user's emotions, and output voice interactions information. The present disclosure analyzes the user's emotions and adds emotions to the voice interaction, thereby shaping an emotional intelligent voice interaction mode, getting rid of the traditional voice interaction system's mechanized and passive communication mode, and providing convenience for users.

It should be understood that the application of the present disclosure is not limited to the above examples. For those of ordinary skill in the art, improvements or changes can be made according to the above description, and all such improvements and changes should fall within the protection scope of the claims appended to the present disclosure.

Claims

A voice interaction method based on emotion engine technology, wherein the method includes:

Obtain the voice information input by the user, and obtain the user's face image information;

Extract emotion recognition features from the voice information and face image information, and input the extracted emotion recognition features into a preset emotion recognition model;

The emotion of the user is calculated through the emotion recognition model, and an anthropomorphic voice interaction strategy is generated based on the emotion of the user, and the voice interaction information is output.
The voice interaction method based on the emotion engine technology according to claim 1, wherein the step of obtaining voice information input by the user and obtaining face image information of the user specifically includes:

Obtain the voice information input by the user through the preset remote device or remote control pickup device;

Obtain the user's face image information through a preset camera device.
The voice interaction method based on the emotion engine technology according to claim 1, wherein the emotion recognition feature is extracted from the voice information and face image information, and the extracted emotion recognition feature is input to a preset emotion recognition The steps of the model include:

Convert one voice signal in the acquired voice information into text information through the ASR voice recognition module, and extract the user's text emotional state from the text information;

The other voice signal in the obtained voice information is used to extract the user's audio emotional state through a preset voice emotion sensor;

The obtained facial image information is used to extract the user's expression state through a preset expression recognition system;

The text emotion state, audio emotion state and expression state are input to a preset emotion recognition model.
The voice interaction method based on the emotion engine technology according to claim 3, wherein the step of extracting the user's text emotional state from the text information specifically includes:

Perform feature extraction on the text information, extract sentence information, and obtain user personal information from a preset memory map according to the sentence information;

The sentence information and the user's personal information are input into a preset emotion recognition model to identify the user's textual emotion state.
The voice interaction method based on the emotion engine technology according to claim 4, wherein the step of inputting the sentence information and the user's personal information into a preset emotion recognition model to recognize the user's textual emotion state, specifically include:

Extract keywords from the sentence information, and obtain the user's first emotional state and first confidence score according to the keywords;

Input the sentence information and the user's personal information into the emotion recognition model to obtain the user's second emotional state and second confidence score;

Compare the first confidence score with a preset threshold;

If the first confidence score is greater than the threshold, the first emotional state is used as the user's text emotional state; if the first confidence score is less than the threshold, the first emotional state and the second emotional state are dynamically Sort, and determine the user's text emotional state according to the result of dynamic sorting.
The voice interaction method based on emotion engine technology according to claim 5, wherein the sentence information includes: Chinese word segmentation information of the sentence, part-of-speech tagging information of the sentence after word segmentation processing, sentence sentence information, sentence sentence2vector information .
The voice interaction method based on emotion engine technology according to claim 5, wherein the parameters involved in the dynamic ranking include: text length, extracted keywords, text input by the user, confidence scores of the first/second emotional states value.
The voice interaction method based on the emotion engine technology according to claim 1, wherein the emotion recognition of the user is calculated through the emotion recognition model, and an anthropomorphic voice interaction strategy is generated based on the emotion of the user, and the voice interaction information is output The steps include:

The emotion recognition model performs weighted calculation on the input text emotion state, audio emotion state and expression state to obtain the user's emotion;

Compare and match the obtained emotion with the emotion feature information in the preset emotion database to obtain the corresponding emotion feature information;

Based on the obtained emotional feature information, emotional intention decision-making and user portrait filling are performed to obtain the emotional intention decision result;

According to the obtained emotion intention decision result and user portrait information, a voice generation information with emotion is generated through a dialogue generation model, and the voice interaction information is output.
The voice interaction method based on the emotion engine technology according to claim 8, wherein the emotion database includes multiple emotions and emotional feature information corresponding to each emotion.
The voice interaction method based on the emotion engine technology according to claim 1, wherein before the emotion recognition model performs weighted calculation on the input text emotional state, audio emotional state and expression state, it includes:

Unweighted weights are set in advance for the text emotional state, audio emotional state, and expression state.
The voice interaction method based on the emotion engine technology according to claim 9, wherein the step of generating voice interaction information with emotions through a dialog generation model specifically includes:

The dialogue generation model receives the question information input by the user, and records the user's historical dialogue information, position change information, and mood change information;

Analyzing the user's personal information and activity status according to the historical dialogue information, the position change information, and the emotion change information, to obtain user portrait information;

Generate voice interaction information according to the problem information and user portrait information; the voice interaction information is also used to update the dialogue generation model.
The voice interaction method based on emotion engine technology according to claim 11, wherein the dialogue generation module is implemented by a three-layer recurrent neural network RNN architecture, and is based on a back propagation algorithm algorithm.
The voice interaction method based on the emotion engine technology according to claim 9, wherein the emotion of the user is calculated through the emotion recognition model, and an anthropomorphic voice interaction strategy is generated based on the emotion of the user, and the voice interaction information is output The steps include:

Use the user's emotion and the obtained emotional intention decision result as the first input of the network model;

Use the customized scene structured data as the second input of the network model;

Through the learning and training of the network model, an emotion engine model that outputs anthropomorphic voice interaction strategies under specific scenarios is obtained.
An intelligent terminal, including: a processor and a storage medium communicatively connected to the processor, the storage medium is suitable for storing multiple instructions; the processor is suitable for calling instructions in the storage medium to execute the implementation The steps of the voice interaction method based on the emotion engine technology according to any one of claims 1-13.
A storage medium on which a plurality of instructions are stored, wherein the instructions are suitable for being loaded and executed by a processor to implement the voice interaction method based on the emotion engine technology of any one of claims 1-13 A step of.