WO2023137922A1

WO2023137922A1 - Voice message generation method and apparatus, computer device and storage medium

Info

Publication number: WO2023137922A1
Application number: PCT/CN2022/090752
Authority: WO
Inventors: 郑喜民; 贾云舒; 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-01-18
Filing date: 2022-04-29
Publication date: 2023-07-27
Also published as: CN114400005A

Abstract

Embodiments of the present application provide a voice message generation method and apparatus based on facial expression recognition, a computer device and a storage medium, and relate to the technical field of artificial intelligence. The voice message generation method based on facial expression recognition comprises: obtaining voice data and a corresponding facial image, performing voice recognition on the voice data to obtain a text message, and performing facial expression recognition on the facial image to obtain a facial expression message; inputting the text message and the facial expression message into a first model, and obtaining an answer text message by means of the first model according to the text message and the facial expression message; and finally performing voice conversion on the answer text message to obtain a corresponding answer voice message. According to the embodiments of the present application, the facial image is added into a chatting robot, the current scene can be determined more accurately by identifying the facial image, the first model obtains the answer text message according to the text message and the facial expression message, and the answer text message is converted into a voice reply message, such that the accuracy of the voice reply message is improved.

Description

Voice message generation method and device, computer equipment, storage medium

This application claims the priority of the Chinese patent application with the application number 202210057040.4 submitted to the China Patent Office on January 18, 2022, and the invention title is "Voice Message Generation Method and Device, Computer Equipment, Storage Medium", the entire content of which is incorporated in this application by reference.

technical field

The present application relates to the technical field of artificial intelligence, and in particular to a voice message generation method and device, computer equipment, and storage media.

Background technique

With the development of computer technology, communication means such as instant messaging tools and mobile phone text messages are becoming more and more popular. Based on these means of communication, in addition to realizing communication between people, it also makes it possible to communicate between people and artificial intelligence systems. For example, a chatbot is an artificial intelligence system that communicates with people by means of communication.

At present, chatbots are divided into two types: active interaction type and passive interaction type. Active interaction is initiated by the robot and interacts with humans by sharing or recommending hotspot information that users are interested in. Passive interaction, that is, the user initiates a dialogue, and the machine understands the dialogue and responds accordingly.

At present, most of the chat robots that users come into contact with belong to the passive interaction type. The inventor realizes that the current passive interaction chat robot has a single interaction function, that is, it can only respond to the text recognized by the user's voice, but the use of this single recognition method often affects the accuracy of the voice reply message generated by the chat robot.

technical problem

The following is the technical problem of the prior art realized by the inventor: the interactive function of the passive interactive chat robot is relatively single at present, that is, it can only respond according to the text recognized by the user's voice, but the use of this single recognition method often affects the accuracy of the voice reply message generated by the chat robot.

technical solution

In the first aspect, the embodiment of the present application proposes a method for generating a voice message based on facial expression recognition, the method comprising:

Get speech data and its corresponding face image:

performing speech recognition on the speech data to obtain a text message;

Carry out expression recognition to described facial image and obtain expression message:

The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:

Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.

In the second aspect, the embodiment of the present application proposes a voice message generation device based on facial expression recognition, including:

The data collection module is used to obtain voice data and its corresponding facial image:

A voice recognition module, configured to perform voice recognition on the voice data to obtain a text message;

Expression recognition module, for carrying out expression recognition to described facial image and obtain expression message:

A text message acquisition module, configured to input the text message and the emoticon message into the first model, and the first model obtains a reply text message according to the text message and the emoticon message:

The voice message acquisition module is configured to perform voice conversion on the answering text message to obtain a corresponding answering voice message.

In a third aspect, the embodiment of the present application proposes a computer device, the computer device includes a memory and a processor, wherein a program is stored in the memory, and when the program is executed by the processor, the processor is used to execute a voice message generation method based on expression recognition, and the voice message generation method includes:

Get speech data and its corresponding face image:

performing speech recognition on the speech data to obtain a text message;

In a fourth aspect, the embodiment of the present application proposes a storage medium, the storage medium is a computer-readable storage medium, and the storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute a voice message generation method based on expression recognition, and the voice message generation method includes:

Get speech data and its corresponding face image:

performing speech recognition on the speech data to obtain a text message;

Beneficial effect

The expression recognition-based voice message generation method and device, computer equipment, and storage medium proposed in the embodiments of the present application obtain voice data and their corresponding facial images, perform voice recognition on the voice data to obtain text messages, and perform expression recognition on the facial images to obtain expression messages; input the text messages and expression messages into the first model, and the first model obtains answer text messages based on the text messages and expression messages, and finally performs voice conversion on the answer text messages to obtain corresponding answer voice messages. In the embodiment of the present application, the face image is added to the chat robot. Through the recognition of the face image, the current situation can be judged more accurately, and the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.

Description of drawings

The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.

Fig. 1 is the first flowchart of the voice message generation method based on expression recognition provided by the embodiment of the present application;

Fig. 2 is the flowchart of step S200 in Fig. 1;

Fig. 3 is a flowchart of step S300 in Fig. 1;

Fig. 4 is the flowchart of step S330 in Fig. 3;

Fig. 5 is a flowchart of step S500 in Fig. 1;

FIG. 6 is a second flow chart of a voice message generation method based on facial expression recognition provided by an embodiment of the present application;

FIG. 7 is a flow chart of the actual application of the facial expression recognition-based voice message generation method provided by the embodiment of the present application;

FIG. 8 is a block diagram of a module structure of a voice message generation device based on expression recognition provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a hardware structure of a computer device provided by an embodiment of the present application.

Embodiments of the present invention

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the flow chart. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.

First, analyze some nouns involved in this application:

Artificial intelligence (AI): It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, artificial intelligence attempts to understand the essence of intelligence, and produce a new intelligent machine that can respond in a similar way to human intelligence. Research in this field includes robots, language recognition, image recognition, natural language processing and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

Chatterbot: A computer program that communicates via dialogue or text. Able to simulate human conversations and pass the Turing test. Chatbots can be used for practical purposes such as customer service or information acquisition. Some chatbots are equipped with natural language processing systems, but most of the simple systems only extract the input keywords, and then find the most suitable response sentences from the database. Chatbots are part of virtual assistants like Google Assistant that can interface with many organizations' apps, websites, and instant messaging platforms (Facebook Messenger). Non-Assistant applications include chat rooms for entertainment purposes, research and specific product promotions, social bots.

Convolutional Neural Networks (CNN): It is a type of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolution calculations and has a deep structure. It is one of the representative algorithms for deep learning. The convolutional neural network has the ability of representation learning, and can perform shift-invariant classification on the input information according to its hierarchical structure. With the introduction of deep learning theory and the improvement of numerical computing equipment, convolutional neural networks have developed rapidly and have been applied in computer vision, natural language processing and other fields. The convolutional neural network imitates the biological visual perception (visual perception) mechanism, and can perform supervised learning and unsupervised learning. The convolution kernel parameter sharing in the hidden layer and the sparsity of the inter-layer connection enable the convolutional neural network to learn grid-like topology features, such as pixels and audio, with a small amount of calculation. It has a stable effect and has no additional feature engineering requirements for the data.

Recurrent Neural Network (RNN) is a kind of recursive neural network (recursive neural network) that takes sequence (sequence) data as input, performs recursion (recursion) in the evolution direction of the sequence, and all nodes (circulation units) are connected in a chain. works, LSTM) is a common recurrent neural network. The cyclic neural network has memory, parameter sharing and Turing completeness, so it has certain advantages in learning the nonlinear characteristics of the sequence. Recurrent neural networks are used in natural language processing (Natural Language Processing, NLP), such as speech recognition, language modeling, machine translation and other fields, and are also used in various time series forecasts. The recurrent neural network constructed by the convolutional neural network can deal with computer vision problems involving sequence input.

Gated recurrent unit (Gated Recurrent Unit, GRU): It is a gating mechanism in the recurrent neural network, similar to other gating mechanisms, which aims to solve the gradient disappearance/explosion problem in the standard RNN while retaining the long-term information of the sequence. GRU is as good as LSTM on many sequential tasks such as speech recognition, but it has fewer parameters than LSTM, including only a reset gate and an update gate.

CTC (Connectionist temporal classification): It is a loss function in the sequence labeling problem, which is mainly used to deal with the alignment of input and output labels in the sequence labeling problem. The traditional sequence labeling algorithm requires the input and output symbols to be completely aligned at each moment, while CTC expands the label set and adds empty elements. After the sequence is marked with the extended label set, all predicted sequences that can be converted into real sequences through the mapping function are correct prediction results, that is, the predicted sequence can be obtained without data alignment processing. Its objective function is to maximize the probability sum of all correct prediction sequences. When searching for all correct prediction sequences, a forward-backward algorithm is used.

Region of interest (region of interest, ROI): In machine vision and image processing, the region to be processed is outlined from the processed image in the form of a box, circle, ellipse, irregular polygon, etc., called the region of interest.

OpenCV: It is a cross-platform computer vision and machine learning software library released based on the Apache2.0 license (open source), which can run on Linux, Windows, Android and Mac OS operating systems. It is lightweight and efficient. It consists of a series of C functions and a small number of C++ classes. It also provides interfaces for languages such as Python, Ruby, and MATLAB, and implements many general-purpose algorithms in image processing and computer vision. OpenCV is written in C++ language, it has C++, Python, Java and MATLAB interfaces, and supports Windows, Linux, Android and Mac OS, OpenCV is mainly inclined to real-time vision applications, and utilizes MMX and SSE instructions when available, and now also provides support for C#, Ch, Ruby, GO.

VGG model (Visual Geometry Group Network): This network is a related work on ILSVRC 2014. The main work is to prove that increasing the depth of the network can affect the final performance of the network to a certain extent. VGG has two structures, namely VGG16 and VGG19. There is no essential difference between the two, but the network depth is different. An improvement of VGG16 compared to AlexNet is to use several consecutive 3x3 convolution kernels instead of larger convolution kernels (11x11, 7x7, 5x5) in AlexNet. For a given receptive field (the local size of the input image related to the output), using a stacked small convolution kernel is better than using a large convolution kernel, because multiple nonlinear layers can increase the depth of the network to ensure learning more complex patterns, and the cost is relatively small (fewer parameters).

Embedding (embedding): embedding is a vector representation, which refers to the use of a low-dimensional vector to represent an object, which can be a word, or a commodity, or a movie, etc.; the nature of this embedding vector is to make the objects corresponding to the vectors with similar distances have similar meanings. . Embedding is essentially a mapping from semantic space to vector space, while maintaining the relationship of the original sample in the semantic space as much as possible in the vector space. For example, the positions of two words with close semantics in the vector space are relatively close. Embedding can encode an object with a low-dimensional vector and retain its meaning. It is often used in machine learning. In the process of building a machine learning model, the object is encoded as a low-dimensional dense vector and then passed to DNN to improve efficiency.

Cross Entropy (Cross Entropy): It is an important concept in Shannon information theory, which is mainly used to measure the difference information between two probability distributions. The performance of language models is usually measured by cross entropy and complexity. The meaning of cross-entropy is the difficulty of using the model to recognize text, or from a compression point of view, how many bits are used to encode each word on average. The meaning of complexity is to use the model to represent the average number of branches of this text, and its reciprocal can be regarded as the average probability of each word. Smoothing refers to assigning a probability value to unobserved N-gram combinations to ensure that the word sequence can always obtain a probability value through the language model. Commonly used smoothing techniques are Turing estimation, deletion interpolation smoothing, Katz smoothing and Kneser-Ney smoothing.

jieba word segmenter: jieba word segmenter is also called jieba word segmenter, which is an open source word segmenter; Chinese word segmentation is a basic step in Chinese text processing and a basic module of Chinese human-computer natural language interaction. When performing Chinese natural language processing, it is usually necessary to perform word segmentation first. Among them, the jieba word segmenter is commonly used for word segmentation; Then, dynamic programming is used to find the maximum probability path, and the maximum segmentation combination based on word frequency is found. For unregistered words, the HMM model based on the ability of Chinese characters to form words is used, and the Viterbi algorithm is used. Jieba word segmentation supports three word segmentation modes: the first is the precise mode, which tries to cut the sentence most accurately, which is suitable for text analysis; the second is the full mode, which scans all the words that can be formed into words in the sentence, which is very fast, but cannot resolve the ambiguity; the third is the search engine mode, which is based on the precise mode, segmenting long words and improving the recall rate, and is suitable for word segmentation in search engines.

Analyzer tokenizer: Analyzer tokenizer is a component that deals with word segmentation, and generally includes three parts: Character Filters, Tokenizer (segmented into words according to rules), Token Filters; Among them, Character Filters is mainly used to process original text, such as removing html, special characters; Tokenizer is used to segment words according to rules; Token Filters is used to process the segmented words, including lowercase, delete stopwords (stop words) , adding synonyms, etc.

Encoder: encoding is to convert the input sequence into a fixed-length vector; decoding (decoder) is to convert the previously generated fixed vector into an output sequence; where the input sequence can be text, voice, image, video; the output sequence can be text, image.

word2vec(word to vector): It is a group of related models used to generate word vectors. These models are shallow, two-layer neural networks trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. Under the assumption of the word bag model in word2vec, the order of words is not important. After the training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words, and the vector is the hidden layer of the neural network.

Self-attention mechanism (Attention Mechanism): The attention mechanism can enable the neural network to have the ability to focus on its input (or feature) subset, select a specific input, and can be applied to any type of input regardless of its shape. In the case of limited computing power, the attention mechanism is a resource allocation scheme that is the main means to solve the problem of information overload, and allocates computing resources to more important tasks.

Seq2Seq: It is an important RNN model, also known as the Encoder-Decoder model, which can be understood as an N×M model. The model consists of two parts: Encoder is used to encode sequence information, and encode sequence information of any length into a vector c. The Decoder is a decoder. After the decoder obtains the context information vector c, it can decode the information and output it as a sequence.

Short Time Fourier Transform (Short Time Fourier Transform, STFT) is only suitable for stationary signals. The whistle signal of porpoise is a non-stationary signal, and the frequency characteristics change with time. In order to capture this time-varying feature, time-frequency analysis of the signal is required. Short-time Fourier transform, wavelet transform, Hilbert-Huang transform, etc. are commonly used.

Mel-Frequency Cepstrum (Mel-Frequency Cepstrum): It is a linear transformation of the logarithmic energy spectrum based on the nonlinear Mel scale of sound frequency. Mel-frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCCs) are the coefficients that make up the Mel-frequency cepstral. It is derived from the cepstrum of an audio segment. The difference between the cepstrum and the Mel-frequency cepstrum is that the frequency band division of the Mel-frequency cepstrum is equally spaced on the Mel scale, which more closely approximates the human auditory system than the linearly spaced frequency bands used in the normal log-log cepstrum. Such non-linear representations can lead to better representations of sound signals in multiple domains.

Griffin-lim: It is a vocoder, which is often used in speech synthesis, and is used to convert the acoustic parameters generated by the speech synthesis system into speech waveforms. This vocoder does not require training, does not need to predict the phase spectrum, but estimates the phase information through the relationship between frames and frames, thereby reconstructing the speech waveform.

Softmax classifier: For the generalization of multiple classifications for the logistic regression classifier, the output is the probability value belonging to different categories.

With the development of computer technology, communication means such as instant messaging tools and mobile phone text messages are becoming more and more popular. Based on these means of communication, in addition to realizing communication between people, it also makes it possible to communicate between people and artificial intelligence systems. For example, a chatbot is an artificial intelligence system that communicates with people by means of communication. At present, most of the chatbots that users come into contact with are passive interaction type, but the interaction function of the passive interaction type chatbot is relatively single, that is, it can only respond to the text recognized by the user's voice, but this single recognition method often affects the accuracy of the voice reply message generated by the chatbot.

Based on this, the embodiments of the present application provide a voice message generation method and device based on facial expression recognition, a computer device, and a storage medium, which can improve the accuracy of text emotion classification.

Embodiments of the present application provide a voice message generation method and device based on facial expression recognition, a computer device, and a storage medium. Specifically, the following embodiments are used for illustration.

The expression recognition-based voice message generation method provided in the embodiment of the present application relates to the field of artificial intelligence. The voice message generation method based on facial expression recognition provided by the embodiment of the present application can be applied to a terminal, can also be applied to a server, and can also be software running on the terminal or the server. In some embodiments, the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, or a smart watch; the server end can be configured as an independent physical server, or can be configured as a server cluster or a distributed system composed of multiple physical servers, and can also be configured as a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms; limited to the above forms.

The embodiments of the present application can be used in many general-purpose or special-purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc. This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

Referring to FIG. 1 , the method for generating a voice message based on expression recognition according to the first aspect of the embodiment of the present application includes but is not limited to steps S100 to S500 .

Step S100, acquiring voice data and its corresponding facial image:

Step S200, performing voice recognition on the voice data to obtain a text message;

Step S300, performing expression recognition on the facial image to obtain an expression message:

Step S400, input the text message and emoticon message into the first model, and the first model gets the reply text message according to the text message and emoticon message:

Step S500, performing voice conversion on the answering text message to obtain the corresponding answering voice message.

In step S100 of some embodiments, the voice data sent by the user is captured through the microphone, that is, the content of the user's speech to the chat robot; while the voice data sent by the user is captured, the camera is used to capture the image of the user speaking, specifically, the facial image of the user can be captured. In practical applications, some images captured by the camera may not capture the user's face area, or the image does not only include the user's face area. In this case, the images captured by the camera need to be further screened. Specifically, images that do not contain the user's facial area may be deleted. In order to further improve the accuracy of expression recognition, it is also possible to detect the region of interest of the image, such as the human face region, where the human face region is the region that needs to be focused on for the expression in the embodiment of the present application.

In some embodiments, the CascadeClassifier function in the open source openCV can also be used to automatically detect all face regions in the picture, so as to realize face detection and positioning of the image.

In step S200 of some embodiments, after the voice data from the user is collected, the voice data needs to be converted into text to obtain a text message.

In step S300 of some embodiments, after the user's facial image is collected, it is necessary to perform expression classification processing on the facial image, for example, it is necessary to determine which expression a certain facial image is, and generate a corresponding expression message according to the expression, such as a text vector or an image vector corresponding to the expression, which is used to make the first model generate a reply text message. In the embodiment of the present application, the expressions can be divided into happy, sad, angry, neutral, surprised, and scared.

In step S400 of some embodiments, a text message and an emoticon message are input into the first model, and the first model obtains a reply text message according to the text message and the emoticon message.

In step S500 of some embodiments, voice conversion is performed on the answer text message to obtain a corresponding answer voice message. After the answer voice message is generated, the chat robot makes a corresponding voice answer to the user according to the voice message.

In some embodiments, as shown in FIG. 2 , step S200 specifically includes but not limited to step S210 to step S250.

Step S210, performing integral transformation on the time-domain signal of the speech data to obtain the frequency-domain signal;

Step S220, constructing a planar space according to the time-domain signal and the frequency-domain signal;

Step S230, through the first neural network, perform convolution operation on the voice data in the plane space, to obtain the voice sequence and sequence length;

Step S240, slice the speech sequence according to the sequence length to obtain multiple slice sequences;

Step S250, performing text conversion on multiple slice sequences through the second neural network to obtain text messages.

In step S210 of some embodiments, the time-domain signal of the voice data is integrally transformed to obtain the frequency-domain signal. In the embodiment of the present application, the integral transformation can use Fourier transform, wherein the Fourier transform transforms the time-domain signal that is difficult to process into an easy-to-analyze frequency-domain signal. The function of the fast Fourier transform is to transform the time-domain digital signal into the frequency domain, and analyze the positions with higher energy in the frequency domain. These positions may be the frequency bands where the sounds that need attention are located.

In step S220 of some embodiments, the time-domain signal and the frequency-domain signal are combined into a two-dimensional space, that is, a planar space.

In step S230 of some embodiments, the first neural network is used to perform a convolution operation on the voice data in the plane space to obtain the voice sequence and sequence length. Wherein, the first neural network is composed of multiple CNNs, and is used to perform convolution operation on the speech data to obtain a speech sequence and the length of the speech sequence.

In step S240 of some embodiments, the speech sequence is sliced according to the sequence length, specifically, the speech data is modeled. During the modeling process, the speech sequence can be cut into multiple slices to obtain a slice sequence, for example, the speech sequence is cut into N slices to obtain N slice sequences.

In step S250 of some embodiments, text conversion is performed on multiple slice sequences through the second neural network to obtain text messages. Specifically, the second neural network may be an RNN network, and the RNN network is applied to multiple GRU units, and the N slices obtained in step S240 are used as N inputs of the RNN, and the text message output by the RNN is obtained, thus completing the process of converting voice data into a text message. It should be noted that when the number of time steps is large or the time step is small, the gradient of the RNN network is more likely to decay or explode. Although clipping the gradient can cope with the gradient explosion, it cannot solve the problem of gradient attenuation, which makes it difficult for the RNN network to capture the dependencies with large time step distances in the time series in practice. Based on this, the embodiment of the present application adopts the GRU unit in the RNN network, which can better capture the dependencies with large time step distances in the time series, and control the flow of information, so as to achieve a better model training effect and make the converted text messages more accurate.

In some embodiments, the first neural network and the second neural network can form a voice model, and the voice model can convert voice data into text messages. In order to further improve the training effect of the speech model, consider using a loss function to optimize the speech model. For example, the CTC loss function is used. The specific loss function is shown in formula (1), where X represents a given segment of speech, Z represents the text corresponding to X, Π represents the product operation, p represents the probability, p(Z|X) represents the probability that the output of a given X is X, and L represents the output probability of Z corresponding to X. The loss function is minimized by minimizing the product of the probability. strategy to minimize the loss function.

In some embodiments, as shown in FIG. 3 , step S300 specifically includes but not limited to step S310 to step S330.

Step S310, performing self-attention screening on the face image through the third neural network to obtain transformation parameters;

Step S320, distorting and transforming the facial image according to the transformation parameters to obtain a transformed image;

Step S330, perform expression recognition on the facial image and the transformed image through the fourth neural network, and obtain expression messages.

In step S310 of some embodiments, the face image is screened by self-attention through the third neural network to obtain a transformation parameter, namely parameter θ. Among them, the third neural network in the embodiment of this application refers to the self-attention network, which consists of two convolutional layers and two fully connected layers, and can locate key areas of the face. Since different expressions have different key areas, for example, when the user is angry, the key area of the facial expression is the eyebrows; when the user is happy, the key area of the facial expression is the mouth; when the user is surprised, the key area of the facial expression is the mouth, eyes, etc., using the self-attention network can more accurately classify the expression of facial images.

In step S320 of some embodiments, the facial image is distorted and transformed according to the transformation parameter θ to obtain a transformed image. Specifically, the transformation parameters can be various. For example, if the transformation parameter is the transformation direction, and the transformation direction is 90 degrees clockwise rotation, then the direction of the facial image can be transformed according to the transformation parameter;

In step S330 of some embodiments, the feature extraction process is performed on the face image to obtain the corresponding feature vector, and the feature vector and the transformed image obtained in step S320 are input to the fourth neural network, and the expression classification result is output. In practical applications, the VGG-19 network can be used to extract features from facial images. It should be noted that the classified messages described in this application are embodied in various forms, for example, the output is in the form of emoticon images or emoticon texts. If it is in the form of emoticon images, the emoticon images are converted into vectors to obtain emoticon messages; if it is in the form of emoticon text, the emoticon text is converted into vectors to obtain emoticon messages.

In some embodiments, the fourth neural network includes a convolutional layer, a fully connected layer, and a classifier. As shown in FIG. 4 , step S330 specifically includes, but is not limited to, steps S331 to S333.

Step S331, performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors;

Step S332, splicing a plurality of image feature vectors through a fully connected layer to obtain an image splicing vector;

In step S333, the classifier performs expression classification on the image mosaic vector to obtain an expression message.

In step S331 of some embodiments, the facial image and the transformed image are input to the convolutional layer of the fourth neural network, and feature extraction processing is performed on the facial image through the convolutional layer to obtain a plurality of image feature vectors.

In step S332 of some embodiments, multiple image feature vectors are input to the fully connected layer, and the multiple image feature vectors are spliced through the fully connected layer to obtain an image splicing vector.

In step S333 of some embodiments, the image mosaic vector is input to the classifier, and the classifier outputs the classification result of the expression, and the expression message is obtained according to the classification result. In practical applications, the classifier referred to in this application may be a Softmax classifier or the like.

In some embodiments, the third neural network and the fourth neural network can constitute an expression recognition model, and the expression recognition model can realize expression classification of facial images. In order to further improve the training effect of the expression recognition model, consider using a loss function to optimize the expression recognition model, for example, using the cross-entropy loss function, as shown in formula (2), where M is the number of categories, y _ic is the real category, and p _ic is the predicted probability that the observation sample i belongs to a certain category C among the M categories.

In some embodiments, face photos, that is, the facial images mentioned in the embodiments of the present application, are subjected to feature extraction through VGG-19 to obtain the image feature vectors corresponding to the facial images; at the same time, the face photos are input to the self-attention network to generate a parameter θ, and T _θ (G) is obtained according to the parameter θ. Among them, T _θ (G) is equivalent to an affine transformation on the input, and θ is the parameter of the transformation, which is equivalent to generating a twisted and transformed sample of the input face photo, that is, the transformed image, which helps the neural network to find important areas related to expressions in the face. Next, feature extraction is performed on the transformed image to obtain an image feature vector corresponding to the transformed image. Finally, the image feature vector corresponding to the face image and the image feature vector corresponding to the transformed image are input to the two fully connected layers, and the fully connected layer outputs the classification result of the expression. The embodiment of the present application introduces an attention mechanism, which can locate different key areas of the face according to different expressions, so that the neural network can focus on areas related to expressions in the human face, so that the effect of expression recognition is more accurate.

In some embodiments, as shown in FIG. 5 , step S500 specifically includes but not limited to step S510 to step S550.

Step S510, performing voice conversion on the answering text message to obtain a preliminary voice message;

Step S520, transforming the preliminary voice message to obtain a spectrogram;

Step S530, extracting audio features of the spectrogram;

Step S540, decoding the audio feature through the fifth neural network model to obtain the corresponding audio data of each frame;

In step S550, the audio data is synthesized to obtain a corresponding reply voice message.

In step S510 of some embodiments, voice conversion is performed on the reply text message to obtain a preliminary voice message. In practical applications, voice conversion can be performed by software such as OCR text recognition.

In step S520 of some embodiments, the preliminary voice message is transformed to obtain a spectrogram. Specifically, the preliminary voice message refers to the sound signal corresponding to the text message, and the sound signal can be converted into a corresponding two-dimensional signal through STFT, so as to obtain a spectrogram. Specifically, the principle of STFT is: divide a long signal into frames and add windows, then perform Fourier Transform (FFT) on each frame, and finally stack the results of each frame along another dimension to obtain a two-dimensional signal form similar to a picture, so as to obtain the corresponding spectrogram.

In step S530 of some embodiments, an encoder is used to extract MFCC audio features of the spectrogram.

In step S540 of some embodiments, the fifth neural network based on the self-attention mechanism is used to decode the audio features to obtain the audio data corresponding to each frame. Specifically, the fifth neural network is an RNN network, specifically composed of two GRU network layers, wherein each GRU network layer includes 256 GRU units.

In step S550 of some embodiments, when generating audio from the frequency spectrum, it is necessary to consider the law of phase changes between consecutive frames, so after obtaining the audio corresponding to each frame, it is necessary to use the Griffin_lim reconstruction algorithm to fine-tune the phase changes between consecutive frames, and then generate consecutive frames of audio to obtain corresponding answer voice messages. It should be noted that, in the case of large phase changes between consecutive frames, an intermediate phase needs to be obtained so that the phase changes of consecutive frames of audio will not be too large, thereby affecting the effect of generating the reply voice message. In addition, the embodiment of the present application can also change the output audio parameters such as voice intonation according to different expressions, so that the robot can make more appropriate answers.

In some embodiments, as shown in FIG. 6 , before step S400 , a step is further included: building a first model, specifically including but not limited to step S610 to step S650 .

Step S610, acquiring a message data set;

Step S620, performing word segmentation on multiple question sample data to obtain multiple question word segmentation data;

Step S630, performing word segmentation on multiple answer sample data to obtain multiple answer word segmentation data;

Step S640, acquiring the first original model;

In step S650, the first original model is trained according to a plurality of question word segmentation data, a plurality of answer word segmentation data and a plurality of preset expressions to obtain a first model.

In step S610 of some embodiments, a message data set used for model training is obtained. Among them, the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, and the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each binding group has a mapping relationship with the answer sample data;

In step S620 of some embodiments, a Chinese word segmentation tool jieba or Analyzer is used to perform word segmentation processing on a plurality of question sample data to obtain a plurality of question word segmentation data.

In step S630 of some embodiments, a Chinese word segmentation tool jieba or Analyzer is used to perform word segmentation processing on multiple answer sample data to obtain multiple answer word segmentation data.

In step S640 of some embodiments, the first original model is obtained, where the first original model may specifically be a Seq2seq model, which has not been trained.

In step S650 of some embodiments, the first original model is trained according to a plurality of question word segmentation data, a plurality of answer word segmentation data and a plurality of preset expressions to obtain a first model.

In some embodiments, step S650 also includes but not limited to the following steps:

Inputting a plurality of question word segmentation data and a plurality of answer word segmentation data into an encoder for first encoding to obtain sample encoding data;

Inputting a plurality of preset expressions into the encoder for second encoding to obtain expression encoding data;

Splicing the sample coded data and the expression coded data to obtain the sample spliced data;

Input the sample splicing data to the decoder for decoding to obtain sample decoded data;

Calculate the loss function of the first original model according to the sample splicing data and the sample decoding data to obtain a loss value;

The first original model is updated according to the loss value to obtain the first model.

More specifically, a plurality of question word segmentation data and a plurality of answer word segmentation data are input into an encoder for first encoding to obtain sample encoding data. Among them, the encoder refers to word2vec, and the generated sample encoding data is a word embedding vector. At the same time, a plurality of preset expressions are input into word2vec for second encoding to obtain expression encoding data. Next, input the sample coded data and expression coded data into the Seq2seq model for training. Specifically, the sample coded data and the expression coded data are spliced through the Seq2seq model to obtain the sample spliced data, and the sample spliced data is input to the decoder for decoding to obtain the sample decoded data; according to the sample spliced data and the sample decoded data, the loss function of the first original model is calculated, such as a cross-entropy loss function, to obtain a loss value; the first original model is updated according to the loss value to obtain the first model. In order to solve the problem of information loss caused by the decoder only accepting the last output of the encoder in Seq2seq and away from the previous output, the embodiment of the present application also uses an attention model to focus on some key positions of the problem.

In some embodiments, as shown in FIG. 7 , the present application adopts multiple modules to realize the process of the voice message generation method based on facial expression recognition. Specifically, the modules include: a speech recognition module, an expression recognition module, a text understanding module, and a voice conversion module. The specific method includes: the speech recognition module recognizes the voice information of the user speaking to the chat robot, and converts the voice information into corresponding text. At the same time, the camera captures the image of the user speaking, and captures the face area to obtain an image of the face area, and then inputs the image of the face area to the expression recognition module, which recognizes the corresponding expression. The text obtained by the speech recognition module and the expression obtained by the expression recognition module are input into the text understanding module, and the text understanding module generates a text answer according to the text and the expression. The text is input into the voice conversion module to generate a voice answer, thereby completing the process of the expression recognition-based voice message generation method.

The voice message generation method based on expression recognition proposed in the embodiment of the present application obtains voice data and its corresponding facial image, conducts voice recognition on the voice data to obtain a text message, and performs expression recognition on the facial image to obtain an expression message; the text message and the expression message are input into the first model, and the first model obtains the answer text message according to the text message and the expression message, and finally performs voice conversion on the answer text message to obtain the corresponding answer voice message. In the embodiment of the present application, the face image is added to the chat robot. Through the recognition of the face image, the current situation can be judged more accurately, and the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.

The embodiment of the present application also provides a voice message generating device based on facial expression recognition. As shown in FIG. 8 , the voice message generating method based on facial expression recognition can be realized. The voice message generating device based on facial expression recognition includes: a data acquisition module 710, a voice recognition module 720, an facial expression recognition module 730, a text message acquisition module 740, and a voice message acquisition module 750, wherein the data acquisition module 710 is used to acquire voice data and corresponding facial images; the voice recognition module 720 is used to perform voice recognition on voice data to obtain text messages; Perform facial expression recognition on the facial image to obtain an emoticon message: the text message acquisition module 740 is used to input the text message and the emoticon message to the first model, and the first model obtains an answer text message according to the text message and the emoticon message: the voice message acquisition module 750 is used to perform speech conversion on the answer text message to obtain a corresponding answer voice message. In the embodiment of the present application, the face image is added to the chat robot. Through the recognition of the face image, the current situation can be judged more accurately, and the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.

The voice message generation device based on expression recognition in the embodiment of the present application is used to execute the method for generating a voice message based on expression recognition in the above embodiment, and its specific processing process is the same as the method for generating a voice message based on expression recognition in the above embodiment, and will not be repeated here.

The embodiment of the present application also provides a computer device, including:

at least one processor, and,

memory communicatively coupled to at least one processor; wherein,

The memory stores instructions, and the instructions are executed by at least one processor, so that when the at least one processor executes the instructions, a voice message generation method based on expression recognition is implemented. The voice message generation method includes: acquiring voice data and its corresponding facial image: performing voice recognition on the voice data to obtain a text message; performing facial expression recognition on the facial image to obtain an expression message: inputting the text message and the expression message to the first model, and the first model obtains an answer text message according to the text message and the expression message: performing voice conversion on the answer text message to obtain a corresponding answer voice message.

The hardware structure of the computer device will be described in detail below in conjunction with FIG. 9 . The computer device includes: a processor 810 , a memory 820 , an input/output interface 830 , a communication interface 840 and a bus 850 .

The processor 810 can be implemented by a general-purpose central processing unit (Central Processin Unit, CPU), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs to realize the technical solutions provided by the embodiments of the present application;

The memory 820 may be implemented in the form of a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM). The memory 820 can store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 820, and are called by the processor 810 to execute the expression recognition-based voice message generation method of the embodiments of the present application;

The input/output interface 830 is used to realize information input and output;

The communication interface 840 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.); and

bus 850, to transfer information between various components of the device (eg, processor 810, memory 820, input/output interface 830, and communication interface 840);

The processor 810 , the memory 820 , the input/output interface 830 and the communication interface 840 are connected to each other within the device through the bus 850 .

An embodiment of the present application further provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions. The computer-executable instructions are used to make the computer execute a voice message generation method based on facial expression recognition. answer the voice message.

The computer-readable storage medium may be non-volatile or volatile. As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are to illustrate the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation to the technical solutions provided by the embodiments of the present application. Those skilled in the art know that with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of the present application are also applicable to similar technical problems.

Those skilled in the art can understand that the technical solutions shown in FIG. 1 to FIG. 7 do not limit the embodiments of the present application, and may include more or fewer steps than those shown in the illustrations, or combine some steps, or different steps.

The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on such an understanding, the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes multiple instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disks or optical discs and other media that can store programs.

The preferred embodiments of the embodiments of the present application have been described above with reference to the accompanying drawings, which does not limit the scope of rights of the embodiments of the present application. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present application shall fall within the scope of rights of the embodiments of the present application.

Claims

A method for generating voice messages based on facial expression recognition, comprising:

Get speech data and its corresponding face image:

performing speech recognition on the speech data to obtain a text message;

Carry out expression recognition to described facial image and obtain expression message:

The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:

Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
The method according to claim 1, wherein said performing speech recognition on said speech data to obtain a text message comprises:

Carrying out integral transformation to the time-domain signal of the speech data to obtain the frequency-domain signal;

constructing a planar space according to the time domain signal and the frequency domain signal;

Carrying out a convolution operation on the voice data in the planar space through the first neural network to obtain a voice sequence and sequence length;

Slicing the speech sequence according to the sequence length to obtain a plurality of slice sequences;

performing text conversion on multiple slice sequences through the second neural network to obtain the text message.
The method according to claim 1, wherein said performing expression recognition on said facial image to obtain an expression message comprises:

Carry out self-attention screening to the facial image through the third neural network to obtain transformation parameters;

performing warping transformation on the facial image according to the transformation parameters to obtain a transformed image;

Perform expression recognition on the facial image and the transformed image through the fourth neural network to obtain the expression message.
The method according to claim 3, wherein the fourth neural network includes a convolutional layer, a fully connected layer and a classifier; the facial expression and the transformed image are recognized by the fourth neural network to obtain an expression message, including:

performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors;

splicing a plurality of the image feature vectors through the fully connected layer to obtain an image splicing vector;

Using the classifier to classify the expression of the image mosaic vector to obtain the expression message.
The method according to claim 1, wherein, before said inputting said text message and said emoticon message into the first model, said first model getting a reply text message according to said text message and said emoticon message, comprising:

Obtaining a message data set; wherein, the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each of the binding groups has a mapping relationship with the answer sample data;

Segmenting a plurality of question sample data to obtain a plurality of question word segmentation data;

Segmenting a plurality of said answer sample data to obtain a plurality of answer word segmentation data;

obtaining a first primitive model;

The first original model is obtained by training the first original model according to the plurality of word segmentation data of the question, the plurality of word segmentation data of the answer and the plurality of preset expressions.
The method according to claim 5, wherein the first original model includes an encoder and a decoder; the first original model is trained according to a plurality of the question word segmentation data, a plurality of the answer word segmentation data, and a plurality of preset expressions to obtain a first model, comprising:

Inputting a plurality of the question word segmentation data and a plurality of the answer word segmentation data into the encoder for first encoding to obtain sample encoding data;

Inputting a plurality of preset expressions into the encoder for second encoding to obtain expression encoding data;

Splicing the sample coded data and the expression coded data to obtain sample spliced data;

inputting the sample mosaic data to the decoder for decoding to obtain sample decoded data;

calculating a loss function of the first original model according to the sample stitching data and the sample decoding data to obtain a loss value;

The first original model is updated according to the loss value to obtain a first model.
The method according to any one of claims 1 to 6, wherein said performing voice conversion on said answering text message to obtain a corresponding answering voice message comprises:

Carrying out voice conversion to the answering text message to obtain a preliminary voice message;

transforming the preliminary voice message to obtain a spectrogram;

extracting audio features of the spectrogram;

Decoding the audio feature by a fifth neural network model to obtain audio data corresponding to each frame;

The audio data is synthesized to obtain a corresponding reply voice message.
A voice message generation device based on facial expression recognition, comprising:

Data collection module, is used for obtaining voice data and its corresponding facial image;

A voice recognition module, configured to perform voice recognition on the voice data to obtain a text message;

An expression recognition module, configured to perform expression recognition on the facial image to obtain an expression message;

A text message acquiring module, configured to input the text message and the emoticon message into the first model, and the first model obtains a reply text message according to the text message and the emoticon message;

The voice message acquisition module is configured to perform voice conversion on the answering text message to obtain a corresponding answering voice message.
A computer device, wherein the computer device includes a memory and a processor, wherein a program is stored in the memory, and when the program is executed by the processor, the processor is used to perform a voice message generation method based on facial expression recognition, the voice message generation method comprising:

Get speech data and its corresponding face image:

performing speech recognition on the speech data to obtain a text message;

Carry out expression recognition to described facial image and obtain expression message:

The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:

Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
A kind of computer equipment according to claim 9, wherein, performing speech recognition on the speech data to obtain a text message comprises:

Carrying out integral transformation to the time-domain signal of the speech data to obtain the frequency-domain signal;

constructing a planar space according to the time domain signal and the frequency domain signal;

Carrying out a convolution operation on the voice data in the planar space through the first neural network to obtain a voice sequence and sequence length;

Slicing the speech sequence according to the sequence length to obtain a plurality of slice sequences;

performing text conversion on multiple slice sequences through the second neural network to obtain the text message.
A kind of computer equipment according to claim 9, wherein, said performing expression recognition on said facial image to obtain an expression message comprises:

Carry out self-attention screening to the facial image through the third neural network to obtain transformation parameters;

performing warping transformation on the facial image according to the transformation parameters to obtain a transformed image;

Perform expression recognition on the facial image and the transformed image through the fourth neural network to obtain the expression message.
A computer device according to claim 11, wherein the fourth neural network includes a convolutional layer, a fully connected layer, and a classifier; the fourth neural network performs expression recognition on the facial image and the transformed image to obtain an expression message, including:

performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors;

splicing a plurality of the image feature vectors through the fully connected layer to obtain an image splicing vector;

Using the classifier to classify the expression of the image mosaic vector to obtain the expression message.
A computer device according to claim 9, wherein, before said inputting said text message and said emoticon message into the first model, said first model obtaining an answer text message according to said text message and said emoticon message, including:

Obtaining a message data set; wherein, the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each of the binding groups has a mapping relationship with the answer sample data;

Segmenting a plurality of question sample data to obtain a plurality of question word segmentation data;

Segmenting a plurality of said answer sample data to obtain a plurality of answer word segmentation data;

obtaining a first primitive model;

The first original model is obtained by training the first original model according to the multiple word segmentation data for the question, the multiple word segmentation data for the answer, and the multiple preset expressions.
A computer device according to claim 13, wherein said first original model includes an encoder and a decoder; said first original model is trained according to a plurality of said question word segmentation data, a plurality of said answer word segmentation data, and a plurality of said preset expressions to obtain a first model, comprising:

Inputting a plurality of the question word segmentation data and a plurality of the answer word segmentation data into the encoder for first encoding to obtain sample encoding data;

Inputting a plurality of preset expressions into the encoder for second encoding to obtain expression encoding data;

Splicing the sample coded data and the expression coded data to obtain sample spliced data;

inputting the sample mosaic data to the decoder for decoding to obtain sample decoded data;

calculating a loss function of the first original model according to the sample stitching data and the sample decoding data to obtain a loss value;

The first original model is updated according to the loss value to obtain a first model.
A storage medium, the storage medium is a computer-readable storage medium, wherein the computer readable storage computer program, when the computer program is executed by the computer, the computer is used to perform a voice message generation method based on facial expression recognition, the voice message generation method includes:

Get speech data and its corresponding face image:

performing speech recognition on the speech data to obtain a text message;

Carry out expression recognition to described facial image and obtain expression message:

The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:

Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
A storage medium according to claim 15, wherein said performing speech recognition on said speech data to obtain a text message comprises:

Carrying out integral transformation to the time-domain signal of the speech data to obtain the frequency-domain signal;

constructing a planar space according to the time domain signal and the frequency domain signal;

Carrying out a convolution operation on the voice data in the planar space through the first neural network to obtain a voice sequence and sequence length;

Slicing the speech sequence according to the sequence length to obtain a plurality of slice sequences;

performing text conversion on multiple slice sequences through the second neural network to obtain the text message.
A storage medium according to claim 15, wherein said expression recognition of said facial image to obtain an expression message comprises:

Carry out self-attention screening to the facial image through the third neural network to obtain transformation parameters;

performing warping transformation on the facial image according to the transformation parameters to obtain a transformed image;

Perform expression recognition on the facial image and the transformed image through the fourth neural network to obtain the expression message.
A storage medium according to claim 17, wherein the fourth neural network includes a convolutional layer, a fully connected layer, and a classifier; the fourth neural network performs expression recognition on the facial image and the transformed image to obtain an expression message, including:

performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors;

splicing a plurality of the image feature vectors through the fully connected layer to obtain an image splicing vector;

Using the classifier to classify the expression of the image mosaic vector to obtain the expression message.
A storage medium according to claim 15, wherein, before said inputting said text message and said emoticon message into the first model, said first model getting a reply text message according to said text message and said emoticon message, comprising:

Obtaining a message data set; wherein, the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each of the binding groups has a mapping relationship with the answer sample data;

Segmenting a plurality of question sample data to obtain a plurality of question word segmentation data;

Segmenting a plurality of said answer sample data to obtain a plurality of answer word segmentation data;

obtaining a first primitive model;

The first original model is obtained by training the first original model according to the multiple word segmentation data for the question, the multiple word segmentation data for the answer, and the multiple preset expressions.
A storage medium according to claim 19, wherein the first original model includes an encoder and a decoder; the first original model is trained according to a plurality of the question word segmentation data, a plurality of the answer word segmentation data, and a plurality of preset expressions to obtain a first model, comprising:

Inputting a plurality of the question word segmentation data and a plurality of the answer word segmentation data into the encoder for first encoding to obtain sample encoding data;

Inputting a plurality of preset expressions into the encoder for second encoding to obtain expression encoding data;

Splicing the sample coded data and the expression coded data to obtain sample spliced data;

inputting the sample mosaic data to the decoder for decoding to obtain sample decoded data;

calculating a loss function of the first original model according to the sample stitching data and the sample decoding data to obtain a loss value;

The first original model is updated according to the loss value to obtain a first model.