WO2023137922A1 - Voice message generation method and apparatus, computer device and storage medium - Google Patents

Voice message generation method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2023137922A1
WO2023137922A1 PCT/CN2022/090752 CN2022090752W WO2023137922A1 WO 2023137922 A1 WO2023137922 A1 WO 2023137922A1 CN 2022090752 W CN2022090752 W CN 2022090752W WO 2023137922 A1 WO2023137922 A1 WO 2023137922A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
message
voice
expression
text message
Prior art date
Application number
PCT/CN2022/090752
Other languages
French (fr)
Chinese (zh)
Inventor
郑喜民
贾云舒
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023137922A1 publication Critical patent/WO2023137922A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular to a voice message generation method and device, computer equipment, and storage media.
  • chatbot is an artificial intelligence system that communicates with people by means of communication.
  • chatbots are divided into two types: active interaction type and passive interaction type.
  • Active interaction is initiated by the robot and interacts with humans by sharing or recommending hotspot information that users are interested in.
  • Passive interaction that is, the user initiates a dialogue, and the machine understands the dialogue and responds accordingly.
  • the chat robots that users come into contact with belong to the passive interaction type.
  • the interactive function of the passive interactive chat robot is relatively single at present, that is, it can only respond according to the text recognized by the user's voice, but the use of this single recognition method often affects the accuracy of the voice reply message generated by the chat robot.
  • the embodiment of the present application proposes a method for generating a voice message based on facial expression recognition, the method comprising:
  • the text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
  • Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
  • the embodiment of the present application proposes a voice message generation device based on facial expression recognition, including:
  • the data collection module is used to obtain voice data and its corresponding facial image:
  • a voice recognition module configured to perform voice recognition on the voice data to obtain a text message
  • Expression recognition module for carrying out expression recognition to described facial image and obtain expression message:
  • a text message acquisition module configured to input the text message and the emoticon message into the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
  • the voice message acquisition module is configured to perform voice conversion on the answering text message to obtain a corresponding answering voice message.
  • the embodiment of the present application proposes a computer device, the computer device includes a memory and a processor, wherein a program is stored in the memory, and when the program is executed by the processor, the processor is used to execute a voice message generation method based on expression recognition, and the voice message generation method includes:
  • the text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
  • Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
  • the embodiment of the present application proposes a storage medium, the storage medium is a computer-readable storage medium, and the storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute a voice message generation method based on expression recognition, and the voice message generation method includes:
  • the text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
  • Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
  • the expression recognition-based voice message generation method and device, computer equipment, and storage medium proposed in the embodiments of the present application obtain voice data and their corresponding facial images, perform voice recognition on the voice data to obtain text messages, and perform expression recognition on the facial images to obtain expression messages; input the text messages and expression messages into the first model, and the first model obtains answer text messages based on the text messages and expression messages, and finally performs voice conversion on the answer text messages to obtain corresponding answer voice messages.
  • the face image is added to the chat robot.
  • the current situation can be judged more accurately, and the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.
  • Fig. 1 is the first flowchart of the voice message generation method based on expression recognition provided by the embodiment of the present application;
  • Fig. 2 is the flowchart of step S200 in Fig. 1;
  • Fig. 3 is a flowchart of step S300 in Fig. 1;
  • Fig. 4 is the flowchart of step S330 in Fig. 3;
  • Fig. 5 is a flowchart of step S500 in Fig. 1;
  • FIG. 6 is a second flow chart of a voice message generation method based on facial expression recognition provided by an embodiment of the present application.
  • FIG. 7 is a flow chart of the actual application of the facial expression recognition-based voice message generation method provided by the embodiment of the present application.
  • FIG. 8 is a block diagram of a module structure of a voice message generation device based on expression recognition provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a hardware structure of a computer device provided by an embodiment of the present application.
  • Artificial intelligence It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, artificial intelligence attempts to understand the essence of intelligence, and produce a new intelligent machine that can respond in a similar way to human intelligence. Research in this field includes robots, language recognition, image recognition, natural language processing and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Chatterbot A computer program that communicates via dialogue or text. Able to simulate human conversations and pass the Turing test. Chatbots can be used for practical purposes such as customer service or information acquisition. Some chatbots are equipped with natural language processing systems, but most of the simple systems only extract the input keywords, and then find the most suitable response sentences from the database. Chatbots are part of virtual assistants like Google Assistant that can interface with many organizations' apps, websites, and instant messaging platforms (Facebook Messenger). Non-Assistant applications include chat rooms for entertainment purposes, research and specific product promotions, social bots.
  • Convolutional Neural Networks It is a type of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolution calculations and has a deep structure. It is one of the representative algorithms for deep learning.
  • the convolutional neural network has the ability of representation learning, and can perform shift-invariant classification on the input information according to its hierarchical structure. With the introduction of deep learning theory and the improvement of numerical computing equipment, convolutional neural networks have developed rapidly and have been applied in computer vision, natural language processing and other fields. The convolutional neural network imitates the biological visual perception (visual perception) mechanism, and can perform supervised learning and unsupervised learning.
  • the convolution kernel parameter sharing in the hidden layer and the sparsity of the inter-layer connection enable the convolutional neural network to learn grid-like topology features, such as pixels and audio, with a small amount of calculation. It has a stable effect and has no additional feature engineering requirements for the data.
  • Recurrent Neural Network is a kind of recursive neural network (recursive neural network) that takes sequence (sequence) data as input, performs recursion (recursion) in the evolution direction of the sequence, and all nodes (circulation units) are connected in a chain.
  • LSTM a common recurrent neural network.
  • the cyclic neural network has memory, parameter sharing and Turing completeness, so it has certain advantages in learning the nonlinear characteristics of the sequence.
  • Recurrent neural networks are used in natural language processing (Natural Language Processing, NLP), such as speech recognition, language modeling, machine translation and other fields, and are also used in various time series forecasts.
  • NLP Natural Language Processing
  • the recurrent neural network constructed by the convolutional neural network can deal with computer vision problems involving sequence input.
  • Gated recurrent unit (Gated Recurrent Unit, GRU): It is a gating mechanism in the recurrent neural network, similar to other gating mechanisms, which aims to solve the gradient disappearance/explosion problem in the standard RNN while retaining the long-term information of the sequence. GRU is as good as LSTM on many sequential tasks such as speech recognition, but it has fewer parameters than LSTM, including only a reset gate and an update gate.
  • CTC Connectionist temporal classification
  • Region of interest region of interest, ROI: In machine vision and image processing, the region to be processed is outlined from the processed image in the form of a box, circle, ellipse, irregular polygon, etc., called the region of interest.
  • OpenCV It is a cross-platform computer vision and machine learning software library released based on the Apache2.0 license (open source), which can run on Linux, Windows, Android and Mac OS operating systems. It is lightweight and efficient. It consists of a series of C functions and a small number of C++ classes. It also provides interfaces for languages such as Python, Ruby, and MATLAB, and implements many general-purpose algorithms in image processing and computer vision. OpenCV is written in C++ language, it has C++, Python, Java and MATLAB interfaces, and supports Windows, Linux, Android and Mac OS, OpenCV is mainly inclined to real-time vision applications, and utilizes MMX and SSE instructions when available, and now also provides support for C#, Ch, Ruby, GO.
  • VGG model (Visual Geometry Group Network): This network is a related work on ILSVRC 2014. The main work is to prove that increasing the depth of the network can affect the final performance of the network to a certain extent.
  • VGG has two structures, namely VGG16 and VGG19. There is no essential difference between the two, but the network depth is different.
  • An improvement of VGG16 compared to AlexNet is to use several consecutive 3x3 convolution kernels instead of larger convolution kernels (11x11, 7x7, 5x5) in AlexNet.
  • using a stacked small convolution kernel is better than using a large convolution kernel, because multiple nonlinear layers can increase the depth of the network to ensure learning more complex patterns, and the cost is relatively small (fewer parameters).
  • Embedding is a vector representation, which refers to the use of a low-dimensional vector to represent an object, which can be a word, or a commodity, or a movie, etc.; the nature of this embedding vector is to make the objects corresponding to the vectors with similar distances have similar meanings. .
  • Embedding is essentially a mapping from semantic space to vector space, while maintaining the relationship of the original sample in the semantic space as much as possible in the vector space. For example, the positions of two words with close semantics in the vector space are relatively close.
  • Embedding can encode an object with a low-dimensional vector and retain its meaning. It is often used in machine learning. In the process of building a machine learning model, the object is encoded as a low-dimensional dense vector and then passed to DNN to improve efficiency.
  • Cross Entropy It is an important concept in Shannon information theory, which is mainly used to measure the difference information between two probability distributions.
  • the performance of language models is usually measured by cross entropy and complexity.
  • the meaning of cross-entropy is the difficulty of using the model to recognize text, or from a compression point of view, how many bits are used to encode each word on average.
  • the meaning of complexity is to use the model to represent the average number of branches of this text, and its reciprocal can be regarded as the average probability of each word.
  • Smoothing refers to assigning a probability value to unobserved N-gram combinations to ensure that the word sequence can always obtain a probability value through the language model.
  • Commonly used smoothing techniques are Turing estimation, deletion interpolation smoothing, Katz smoothing and Kneser-Ney smoothing.
  • jieba word segmenter is also called jieba word segmenter, which is an open source word segmenter;
  • Chinese word segmentation is a basic step in Chinese text processing and a basic module of Chinese human-computer natural language interaction.
  • the jieba word segmenter is commonly used for word segmentation;
  • dynamic programming is used to find the maximum probability path, and the maximum segmentation combination based on word frequency is found.
  • the HMM model based on the ability of Chinese characters to form words is used, and the Viterbi algorithm is used.
  • Jieba word segmentation supports three word segmentation modes: the first is the precise mode, which tries to cut the sentence most accurately, which is suitable for text analysis; the second is the full mode, which scans all the words that can be formed into words in the sentence, which is very fast, but cannot resolve the ambiguity; the third is the search engine mode, which is based on the precise mode, segmenting long words and improving the recall rate, and is suitable for word segmentation in search engines.
  • Analyzer tokenizer is a component that deals with word segmentation, and generally includes three parts: Character Filters, Tokenizer (segmented into words according to rules), Token Filters; Among them, Character Filters is mainly used to process original text, such as removing html, special characters; Tokenizer is used to segment words according to rules; Token Filters is used to process the segmented words, including lowercase, delete stopwords (stop words) , adding synonyms, etc.
  • Encoder is to convert the input sequence into a fixed-length vector; decoding (decoder) is to convert the previously generated fixed vector into an output sequence; where the input sequence can be text, voice, image, video; the output sequence can be text, image.
  • word2vec It is a group of related models used to generate word vectors. These models are shallow, two-layer neural networks trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. Under the assumption of the word bag model in word2vec, the order of words is not important. After the training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words, and the vector is the hidden layer of the neural network.
  • the attention mechanism can enable the neural network to have the ability to focus on its input (or feature) subset, select a specific input, and can be applied to any type of input regardless of its shape.
  • the attention mechanism is a resource allocation scheme that is the main means to solve the problem of information overload, and allocates computing resources to more important tasks.
  • Seq2Seq It is an important RNN model, also known as the Encoder-Decoder model, which can be understood as an N ⁇ M model.
  • the model consists of two parts: Encoder is used to encode sequence information, and encode sequence information of any length into a vector c.
  • the Decoder is a decoder. After the decoder obtains the context information vector c, it can decode the information and output it as a sequence.
  • Short Time Fourier Transform (Short Time Fourier Transform, STFT) is only suitable for stationary signals.
  • the whistle signal of porpoise is a non-stationary signal, and the frequency characteristics change with time. In order to capture this time-varying feature, time-frequency analysis of the signal is required.
  • Short-time Fourier transform, wavelet transform, Hilbert-Huang transform, etc. are commonly used.
  • Mel-Frequency Cepstrum (Mel-Frequency Cepstrum): It is a linear transformation of the logarithmic energy spectrum based on the nonlinear Mel scale of sound frequency.
  • Mel-frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCCs) are the coefficients that make up the Mel-frequency cepstral. It is derived from the cepstrum of an audio segment. The difference between the cepstrum and the Mel-frequency cepstrum is that the frequency band division of the Mel-frequency cepstrum is equally spaced on the Mel scale, which more closely approximates the human auditory system than the linearly spaced frequency bands used in the normal log-log cepstrum. Such non-linear representations can lead to better representations of sound signals in multiple domains.
  • Griffin-lim It is a vocoder, which is often used in speech synthesis, and is used to convert the acoustic parameters generated by the speech synthesis system into speech waveforms. This vocoder does not require training, does not need to predict the phase spectrum, but estimates the phase information through the relationship between frames and frames, thereby reconstructing the speech waveform.
  • Softmax classifier For the generalization of multiple classifications for the logistic regression classifier, the output is the probability value belonging to different categories.
  • chatbot is an artificial intelligence system that communicates with people by means of communication.
  • the interaction function of the passive interaction type chatbot is relatively single, that is, it can only respond to the text recognized by the user's voice, but this single recognition method often affects the accuracy of the voice reply message generated by the chatbot.
  • the embodiments of the present application provide a voice message generation method and device based on facial expression recognition, a computer device, and a storage medium, which can improve the accuracy of text emotion classification.
  • Embodiments of the present application provide a voice message generation method and device based on facial expression recognition, a computer device, and a storage medium. Specifically, the following embodiments are used for illustration.
  • the expression recognition-based voice message generation method provided in the embodiment of the present application relates to the field of artificial intelligence.
  • the voice message generation method based on facial expression recognition provided by the embodiment of the present application can be applied to a terminal, can also be applied to a server, and can also be software running on the terminal or the server.
  • the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, or a smart watch;
  • the server end can be configured as an independent physical server, or can be configured as a server cluster or a distributed system composed of multiple physical servers, and can also be configured as a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms; limited to the above forms.
  • the embodiments of the present application can be used in many general-purpose or special-purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc.
  • This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • the method for generating a voice message based on expression recognition includes but is not limited to steps S100 to S500 .
  • Step S100 acquiring voice data and its corresponding facial image:
  • Step S200 performing voice recognition on the voice data to obtain a text message
  • Step S300 performing expression recognition on the facial image to obtain an expression message:
  • Step S400 input the text message and emoticon message into the first model, and the first model gets the reply text message according to the text message and emoticon message:
  • Step S500 performing voice conversion on the answering text message to obtain the corresponding answering voice message.
  • the voice data sent by the user is captured through the microphone, that is, the content of the user's speech to the chat robot; while the voice data sent by the user is captured, the camera is used to capture the image of the user speaking, specifically, the facial image of the user can be captured.
  • some images captured by the camera may not capture the user's face area, or the image does not only include the user's face area. In this case, the images captured by the camera need to be further screened. Specifically, images that do not contain the user's facial area may be deleted.
  • the CascadeClassifier function in the open source openCV can also be used to automatically detect all face regions in the picture, so as to realize face detection and positioning of the image.
  • step S200 of some embodiments after the voice data from the user is collected, the voice data needs to be converted into text to obtain a text message.
  • step S300 of some embodiments after the user's facial image is collected, it is necessary to perform expression classification processing on the facial image, for example, it is necessary to determine which expression a certain facial image is, and generate a corresponding expression message according to the expression, such as a text vector or an image vector corresponding to the expression, which is used to make the first model generate a reply text message.
  • the expressions can be divided into happy, sad, angry, neutral, surprised, and scared.
  • step S400 of some embodiments a text message and an emoticon message are input into the first model, and the first model obtains a reply text message according to the text message and the emoticon message.
  • step S500 of some embodiments voice conversion is performed on the answer text message to obtain a corresponding answer voice message.
  • the chat robot makes a corresponding voice answer to the user according to the voice message.
  • step S200 specifically includes but not limited to step S210 to step S250.
  • Step S210 performing integral transformation on the time-domain signal of the speech data to obtain the frequency-domain signal
  • Step S220 constructing a planar space according to the time-domain signal and the frequency-domain signal
  • Step S230 through the first neural network, perform convolution operation on the voice data in the plane space, to obtain the voice sequence and sequence length;
  • Step S240 slice the speech sequence according to the sequence length to obtain multiple slice sequences
  • Step S250 performing text conversion on multiple slice sequences through the second neural network to obtain text messages.
  • the time-domain signal of the voice data is integrally transformed to obtain the frequency-domain signal.
  • the integral transformation can use Fourier transform, wherein the Fourier transform transforms the time-domain signal that is difficult to process into an easy-to-analyze frequency-domain signal.
  • the function of the fast Fourier transform is to transform the time-domain digital signal into the frequency domain, and analyze the positions with higher energy in the frequency domain. These positions may be the frequency bands where the sounds that need attention are located.
  • step S220 of some embodiments the time-domain signal and the frequency-domain signal are combined into a two-dimensional space, that is, a planar space.
  • the first neural network is used to perform a convolution operation on the voice data in the plane space to obtain the voice sequence and sequence length.
  • the first neural network is composed of multiple CNNs, and is used to perform convolution operation on the speech data to obtain a speech sequence and the length of the speech sequence.
  • the speech sequence is sliced according to the sequence length, specifically, the speech data is modeled.
  • the speech sequence can be cut into multiple slices to obtain a slice sequence, for example, the speech sequence is cut into N slices to obtain N slice sequences.
  • step S250 of some embodiments text conversion is performed on multiple slice sequences through the second neural network to obtain text messages.
  • the second neural network may be an RNN network, and the RNN network is applied to multiple GRU units, and the N slices obtained in step S240 are used as N inputs of the RNN, and the text message output by the RNN is obtained, thus completing the process of converting voice data into a text message.
  • the gradient of the RNN network is more likely to decay or explode. Although clipping the gradient can cope with the gradient explosion, it cannot solve the problem of gradient attenuation, which makes it difficult for the RNN network to capture the dependencies with large time step distances in the time series in practice.
  • the embodiment of the present application adopts the GRU unit in the RNN network, which can better capture the dependencies with large time step distances in the time series, and control the flow of information, so as to achieve a better model training effect and make the converted text messages more accurate.
  • the first neural network and the second neural network can form a voice model, and the voice model can convert voice data into text messages.
  • a loss function to optimize the speech model.
  • the CTC loss function is used.
  • the specific loss function is shown in formula (1), where X represents a given segment of speech, Z represents the text corresponding to X, ⁇ represents the product operation, p represents the probability, p(Z
  • the loss function is minimized by minimizing the product of the probability. strategy to minimize the loss function.
  • step S300 specifically includes but not limited to step S310 to step S330.
  • Step S310 performing self-attention screening on the face image through the third neural network to obtain transformation parameters
  • Step S320 distorting and transforming the facial image according to the transformation parameters to obtain a transformed image
  • Step S330 perform expression recognition on the facial image and the transformed image through the fourth neural network, and obtain expression messages.
  • the face image is screened by self-attention through the third neural network to obtain a transformation parameter, namely parameter ⁇ .
  • the third neural network in the embodiment of this application refers to the self-attention network, which consists of two convolutional layers and two fully connected layers, and can locate key areas of the face. Since different expressions have different key areas, for example, when the user is angry, the key area of the facial expression is the eyebrows; when the user is happy, the key area of the facial expression is the mouth; when the user is surprised, the key area of the facial expression is the mouth, eyes, etc., using the self-attention network can more accurately classify the expression of facial images.
  • the facial image is distorted and transformed according to the transformation parameter ⁇ to obtain a transformed image.
  • the transformation parameters can be various. For example, if the transformation parameter is the transformation direction, and the transformation direction is 90 degrees clockwise rotation, then the direction of the facial image can be transformed according to the transformation parameter;
  • step S330 of some embodiments the feature extraction process is performed on the face image to obtain the corresponding feature vector, and the feature vector and the transformed image obtained in step S320 are input to the fourth neural network, and the expression classification result is output.
  • the VGG-19 network can be used to extract features from facial images.
  • the classified messages described in this application are embodied in various forms, for example, the output is in the form of emoticon images or emoticon texts. If it is in the form of emoticon images, the emoticon images are converted into vectors to obtain emoticon messages; if it is in the form of emoticon text, the emoticon text is converted into vectors to obtain emoticon messages.
  • the fourth neural network includes a convolutional layer, a fully connected layer, and a classifier.
  • step S330 specifically includes, but is not limited to, steps S331 to S333.
  • Step S331 performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors
  • Step S332 splicing a plurality of image feature vectors through a fully connected layer to obtain an image splicing vector
  • step S333 the classifier performs expression classification on the image mosaic vector to obtain an expression message.
  • step S331 of some embodiments the facial image and the transformed image are input to the convolutional layer of the fourth neural network, and feature extraction processing is performed on the facial image through the convolutional layer to obtain a plurality of image feature vectors.
  • step S332 of some embodiments multiple image feature vectors are input to the fully connected layer, and the multiple image feature vectors are spliced through the fully connected layer to obtain an image splicing vector.
  • step S333 of some embodiments the image mosaic vector is input to the classifier, and the classifier outputs the classification result of the expression, and the expression message is obtained according to the classification result.
  • the classifier referred to in this application may be a Softmax classifier or the like.
  • the third neural network and the fourth neural network can constitute an expression recognition model, and the expression recognition model can realize expression classification of facial images.
  • the expression recognition model can realize expression classification of facial images.
  • a loss function to optimize the expression recognition model, for example, using the cross-entropy loss function, as shown in formula (2), where M is the number of categories, y ic is the real category, and p ic is the predicted probability that the observation sample i belongs to a certain category C among the M categories.
  • face photos that is, the facial images mentioned in the embodiments of the present application
  • face photos are subjected to feature extraction through VGG-19 to obtain the image feature vectors corresponding to the facial images; at the same time, the face photos are input to the self-attention network to generate a parameter ⁇ , and T ⁇ (G) is obtained according to the parameter ⁇ .
  • T ⁇ (G) is equivalent to an affine transformation on the input
  • is the parameter of the transformation, which is equivalent to generating a twisted and transformed sample of the input face photo, that is, the transformed image, which helps the neural network to find important areas related to expressions in the face.
  • feature extraction is performed on the transformed image to obtain an image feature vector corresponding to the transformed image.
  • the image feature vector corresponding to the face image and the image feature vector corresponding to the transformed image are input to the two fully connected layers, and the fully connected layer outputs the classification result of the expression.
  • the embodiment of the present application introduces an attention mechanism, which can locate different key areas of the face according to different expressions, so that the neural network can focus on areas related to expressions in the human face, so that the effect of expression recognition is more accurate.
  • step S500 specifically includes but not limited to step S510 to step S550.
  • Step S510 performing voice conversion on the answering text message to obtain a preliminary voice message
  • Step S520 transforming the preliminary voice message to obtain a spectrogram
  • Step S530 extracting audio features of the spectrogram
  • Step S540 decoding the audio feature through the fifth neural network model to obtain the corresponding audio data of each frame
  • step S550 the audio data is synthesized to obtain a corresponding reply voice message.
  • voice conversion is performed on the reply text message to obtain a preliminary voice message.
  • voice conversion can be performed by software such as OCR text recognition.
  • the preliminary voice message is transformed to obtain a spectrogram.
  • the preliminary voice message refers to the sound signal corresponding to the text message, and the sound signal can be converted into a corresponding two-dimensional signal through STFT, so as to obtain a spectrogram.
  • STFT the principle of STFT is: divide a long signal into frames and add windows, then perform Fourier Transform (FFT) on each frame, and finally stack the results of each frame along another dimension to obtain a two-dimensional signal form similar to a picture, so as to obtain the corresponding spectrogram.
  • an encoder is used to extract MFCC audio features of the spectrogram.
  • the fifth neural network based on the self-attention mechanism is used to decode the audio features to obtain the audio data corresponding to each frame.
  • the fifth neural network is an RNN network, specifically composed of two GRU network layers, wherein each GRU network layer includes 256 GRU units.
  • step S550 of some embodiments when generating audio from the frequency spectrum, it is necessary to consider the law of phase changes between consecutive frames, so after obtaining the audio corresponding to each frame, it is necessary to use the Griffin_lim reconstruction algorithm to fine-tune the phase changes between consecutive frames, and then generate consecutive frames of audio to obtain corresponding answer voice messages. It should be noted that, in the case of large phase changes between consecutive frames, an intermediate phase needs to be obtained so that the phase changes of consecutive frames of audio will not be too large, thereby affecting the effect of generating the reply voice message.
  • the embodiment of the present application can also change the output audio parameters such as voice intonation according to different expressions, so that the robot can make more appropriate answers.
  • a step is further included: building a first model, specifically including but not limited to step S610 to step S650 .
  • Step S610 acquiring a message data set
  • Step S620 performing word segmentation on multiple question sample data to obtain multiple question word segmentation data
  • Step S630 performing word segmentation on multiple answer sample data to obtain multiple answer word segmentation data
  • Step S640 acquiring the first original model
  • step S650 the first original model is trained according to a plurality of question word segmentation data, a plurality of answer word segmentation data and a plurality of preset expressions to obtain a first model.
  • a message data set used for model training is obtained.
  • the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, and the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each binding group has a mapping relationship with the answer sample data;
  • a Chinese word segmentation tool jieba or Analyzer is used to perform word segmentation processing on a plurality of question sample data to obtain a plurality of question word segmentation data.
  • a Chinese word segmentation tool jieba or Analyzer is used to perform word segmentation processing on multiple answer sample data to obtain multiple answer word segmentation data.
  • the first original model is obtained, where the first original model may specifically be a Seq2seq model, which has not been trained.
  • the first original model is trained according to a plurality of question word segmentation data, a plurality of answer word segmentation data and a plurality of preset expressions to obtain a first model.
  • step S650 also includes but not limited to the following steps:
  • the first original model is updated according to the loss value to obtain the first model.
  • a plurality of question word segmentation data and a plurality of answer word segmentation data are input into an encoder for first encoding to obtain sample encoding data.
  • the encoder refers to word2vec
  • the generated sample encoding data is a word embedding vector.
  • a plurality of preset expressions are input into word2vec for second encoding to obtain expression encoding data.
  • input the sample coded data and expression coded data into the Seq2seq model for training.
  • the sample coded data and the expression coded data are spliced through the Seq2seq model to obtain the sample spliced data, and the sample spliced data is input to the decoder for decoding to obtain the sample decoded data; according to the sample spliced data and the sample decoded data, the loss function of the first original model is calculated, such as a cross-entropy loss function, to obtain a loss value; the first original model is updated according to the loss value to obtain the first model.
  • the embodiment of the present application also uses an attention model to focus on some key positions of the problem.
  • the present application adopts multiple modules to realize the process of the voice message generation method based on facial expression recognition.
  • the modules include: a speech recognition module, an expression recognition module, a text understanding module, and a voice conversion module.
  • the specific method includes: the speech recognition module recognizes the voice information of the user speaking to the chat robot, and converts the voice information into corresponding text.
  • the camera captures the image of the user speaking, and captures the face area to obtain an image of the face area, and then inputs the image of the face area to the expression recognition module, which recognizes the corresponding expression.
  • the text obtained by the speech recognition module and the expression obtained by the expression recognition module are input into the text understanding module, and the text understanding module generates a text answer according to the text and the expression.
  • the text is input into the voice conversion module to generate a voice answer, thereby completing the process of the expression recognition-based voice message generation method.
  • the voice message generation method based on expression recognition proposed in the embodiment of the present application obtains voice data and its corresponding facial image, conducts voice recognition on the voice data to obtain a text message, and performs expression recognition on the facial image to obtain an expression message; the text message and the expression message are input into the first model, and the first model obtains the answer text message according to the text message and the expression message, and finally performs voice conversion on the answer text message to obtain the corresponding answer voice message.
  • the face image is added to the chat robot.
  • the current situation can be judged more accurately, and the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.
  • the embodiment of the present application also provides a voice message generating device based on facial expression recognition.
  • the voice message generating method based on facial expression recognition can be realized.
  • the voice message generating device based on facial expression recognition includes: a data acquisition module 710, a voice recognition module 720, an facial expression recognition module 730, a text message acquisition module 740, and a voice message acquisition module 750, wherein the data acquisition module 710 is used to acquire voice data and corresponding facial images; the voice recognition module 720 is used to perform voice recognition on voice data to obtain text messages; Perform facial expression recognition on the facial image to obtain an emoticon message: the text message acquisition module 740 is used to input the text message and the emoticon message to the first model, and the first model obtains an answer text message according to the text message and the emoticon message: the voice message acquisition module 750 is used to perform speech conversion on the answer text message to obtain a corresponding answer voice message.
  • the face image is added to the chat robot.
  • the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.
  • the voice message generation device based on expression recognition in the embodiment of the present application is used to execute the method for generating a voice message based on expression recognition in the above embodiment, and its specific processing process is the same as the method for generating a voice message based on expression recognition in the above embodiment, and will not be repeated here.
  • the embodiment of the present application also provides a computer device, including:
  • At least one processor and,
  • the memory stores instructions, and the instructions are executed by at least one processor, so that when the at least one processor executes the instructions, a voice message generation method based on expression recognition is implemented.
  • the voice message generation method includes: acquiring voice data and its corresponding facial image: performing voice recognition on the voice data to obtain a text message; performing facial expression recognition on the facial image to obtain an expression message: inputting the text message and the expression message to the first model, and the first model obtains an answer text message according to the text message and the expression message: performing voice conversion on the answer text message to obtain a corresponding answer voice message.
  • the computer device includes: a processor 810 , a memory 820 , an input/output interface 830 , a communication interface 840 and a bus 850 .
  • the processor 810 can be implemented by a general-purpose central processing unit (Central Processin Unit, CPU), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs to realize the technical solutions provided by the embodiments of the present application;
  • CPU Central Processin Unit
  • ASIC Application Specific Integrated Circuit
  • the memory 820 may be implemented in the form of a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 820 can store an operating system and other application programs.
  • the relevant program codes are stored in the memory 820, and are called by the processor 810 to execute the expression recognition-based voice message generation method of the embodiments of the present application;
  • the input/output interface 830 is used to realize information input and output
  • the communication interface 840 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.); and
  • bus 850 to transfer information between various components of the device (eg, processor 810, memory 820, input/output interface 830, and communication interface 840);
  • the processor 810 , the memory 820 , the input/output interface 830 and the communication interface 840 are connected to each other within the device through the bus 850 .
  • An embodiment of the present application further provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions.
  • the computer-executable instructions are used to make the computer execute a voice message generation method based on facial expression recognition. answer the voice message.
  • the computer-readable storage medium may be non-volatile or volatile.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • FIG. 1 to FIG. 7 do not limit the embodiments of the present application, and may include more or fewer steps than those shown in the illustrations, or combine some steps, or different steps.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium and includes multiple instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disks or optical discs and other media that can store programs.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Abstract

Embodiments of the present application provide a voice message generation method and apparatus based on facial expression recognition, a computer device and a storage medium, and relate to the technical field of artificial intelligence. The voice message generation method based on facial expression recognition comprises: obtaining voice data and a corresponding facial image, performing voice recognition on the voice data to obtain a text message, and performing facial expression recognition on the facial image to obtain a facial expression message; inputting the text message and the facial expression message into a first model, and obtaining an answer text message by means of the first model according to the text message and the facial expression message; and finally performing voice conversion on the answer text message to obtain a corresponding answer voice message. According to the embodiments of the present application, the facial image is added into a chatting robot, the current scene can be determined more accurately by identifying the facial image, the first model obtains the answer text message according to the text message and the facial expression message, and the answer text message is converted into a voice reply message, such that the accuracy of the voice reply message is improved.

Description

语音消息生成方法和装置、计算机设备、存储介质Voice message generation method and device, computer equipment, storage medium
本申请要求于2022年01月18日提交中国专利局、申请号为202210057040.4,发明名称为“语音消息生成方法和装置、计算机设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202210057040.4 submitted to the China Patent Office on January 18, 2022, and the invention title is "Voice Message Generation Method and Device, Computer Equipment, Storage Medium", the entire content of which is incorporated in this application by reference.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种语音消息生成方法和装置、计算机设备、存储介质。The present application relates to the technical field of artificial intelligence, and in particular to a voice message generation method and device, computer equipment, and storage media.
背景技术Background technique
随着计算机技术的发展,如即时通信工具、手机短信等通讯手段日渐风行。基于这些通讯手段,除了实现人与人之间的沟通交流外,也使得人与人工智能系统之间的沟通交流成为可能。例如,聊天机器人就是一种借助于通讯手段实现与人沟通交流的人工智能系统。With the development of computer technology, communication means such as instant messaging tools and mobile phone text messages are becoming more and more popular. Based on these means of communication, in addition to realizing communication between people, it also makes it possible to communicate between people and artificial intelligence systems. For example, a chatbot is an artificial intelligence system that communicates with people by means of communication.
目前,聊天机器人分为主动交互型和被动交互型两种。主动交互,即由机器人主动发起,通过共享或推荐用户感兴趣的热点信息和人类进行互动。被动交互,即由用户发起对话,机器理解对话并作出相应的回应。At present, chatbots are divided into two types: active interaction type and passive interaction type. Active interaction is initiated by the robot and interacts with humans by sharing or recommending hotspot information that users are interested in. Passive interaction, that is, the user initiates a dialogue, and the machine understands the dialogue and responds accordingly.
目前用户所接触到的大多数聊天机器人属于被动交互型,发明人意识到目前的被动交互型的聊天机器人交互功能比较单一,即只能根据用户的语音识别出的文本进行相应的回答,但是采用这种单一的识别方式,往往影响聊天机器人所生成的语音回复消息的准确率。At present, most of the chat robots that users come into contact with belong to the passive interaction type. The inventor realizes that the current passive interaction chat robot has a single interaction function, that is, it can only respond to the text recognized by the user's voice, but the use of this single recognition method often affects the accuracy of the voice reply message generated by the chat robot.
技术问题technical problem
以下是发明人意识到的现有技术的技术问题:目前被动交互型的聊天机器人交互功能比较单一,即只能根据用户的语音识别出的文本进行相应的回答,但是采用这种单一的识别方式,往往影响聊天机器人所生成的语音回复消息的准确率。The following is the technical problem of the prior art realized by the inventor: the interactive function of the passive interactive chat robot is relatively single at present, that is, it can only respond according to the text recognized by the user's voice, but the use of this single recognition method often affects the accuracy of the voice reply message generated by the chat robot.
技术解决方案technical solution
第一方面,本申请实施例提出了一种基于表情识别的语音消息生成方法,所述方法包括:In the first aspect, the embodiment of the present application proposes a method for generating a voice message based on facial expression recognition, the method comprising:
获取语音数据及其对应的面部图像:Get speech data and its corresponding face image:
对所述语音数据进行语音识别得到文本消息;performing speech recognition on the speech data to obtain a text message;
对所述面部图像进行表情识别得到表情消息:Carry out expression recognition to described facial image and obtain expression message:
将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息:The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
对所述回答文本消息进行语音转换,得到对应的回答语音消息。Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
第二方面,本申请实施例提出了一种基于表情识别的语音消息生成装置,包括:In the second aspect, the embodiment of the present application proposes a voice message generation device based on facial expression recognition, including:
数据采集模块,用于获取语音数据及其对应的面部图像:The data collection module is used to obtain voice data and its corresponding facial image:
语音识别模块,用于对所述语音数据进行语音识别得到文本消息;A voice recognition module, configured to perform voice recognition on the voice data to obtain a text message;
表情识别模块,用于对所述面部图像进行表情识别得到表情消息:Expression recognition module, for carrying out expression recognition to described facial image and obtain expression message:
文本消息获取模块,用于将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息:A text message acquisition module, configured to input the text message and the emoticon message into the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
语音消息获取模块,用于对所述回答文本消息进行语音转换,得到对应的回答语音消息。The voice message acquisition module is configured to perform voice conversion on the answering text message to obtain a corresponding answering voice message.
第三方面,本申请实施例提出了一种计算机设备,所述计算机设备包括存储器和处理器,其中,所述存储器中存储有程序,所述程序被所述处理器执行时所述处理器用于执行一种基于表情识别的语音消息生成方法,所述语音消息生成方法包括:In a third aspect, the embodiment of the present application proposes a computer device, the computer device includes a memory and a processor, wherein a program is stored in the memory, and when the program is executed by the processor, the processor is used to execute a voice message generation method based on expression recognition, and the voice message generation method includes:
获取语音数据及其对应的面部图像:Get speech data and its corresponding face image:
对所述语音数据进行语音识别得到文本消息;performing speech recognition on the speech data to obtain a text message;
对所述面部图像进行表情识别得到表情消息:Carry out expression recognition to described facial image and obtain expression message:
将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息:The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
对所述回答文本消息进行语音转换,得到对应的回答语音消息。Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
第四方面,本申请实施例提出了一种存储介质,该存储介质为计算机可读存储介质,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种基于表情识别的语音消息生成方法,所述语音消息生成方法包括:In a fourth aspect, the embodiment of the present application proposes a storage medium, the storage medium is a computer-readable storage medium, and the storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute a voice message generation method based on expression recognition, and the voice message generation method includes:
获取语音数据及其对应的面部图像:Get speech data and its corresponding face image:
对所述语音数据进行语音识别得到文本消息;performing speech recognition on the speech data to obtain a text message;
对所述面部图像进行表情识别得到表情消息:Carry out expression recognition to described facial image and obtain expression message:
将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息:The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
对所述回答文本消息进行语音转换,得到对应的回答语音消息。Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
有益效果Beneficial effect
本申请实施例提出的基于表情识别的语音消息生成方法和装置、计算机设备、存储介质,通过获取语音数据及其对应的面部图像,对语音数据进行语音识别得到文本消息,并对面部图像进行表情识别得到表情消息;将文本消息和表情消息输入至第一模型,由第一模型根据文本消息和表情消息得到回答文本消息,最后对回答文本消息进行语音转换,得到对应的回答语音消息。本申请实施例将面部图像加入到聊天机器人中,通过对面部图像的识别,能够更加精准判断出当前的情景,并由第一模型根据文本消息和表情消息得到回答文本消息,且将回答文本消息转换成语音回复消息,进而提高语音回复消息的准确率。The expression recognition-based voice message generation method and device, computer equipment, and storage medium proposed in the embodiments of the present application obtain voice data and their corresponding facial images, perform voice recognition on the voice data to obtain text messages, and perform expression recognition on the facial images to obtain expression messages; input the text messages and expression messages into the first model, and the first model obtains answer text messages based on the text messages and expression messages, and finally performs voice conversion on the answer text messages to obtain corresponding answer voice messages. In the embodiment of the present application, the face image is added to the chat robot. Through the recognition of the face image, the current situation can be judged more accurately, and the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.
附图说明Description of drawings
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.
图1是本申请实施例提供的基于表情识别的语音消息生成方法的第一流程图;Fig. 1 is the first flowchart of the voice message generation method based on expression recognition provided by the embodiment of the present application;
图2是图1中的步骤S200的流程图;Fig. 2 is the flowchart of step S200 in Fig. 1;
图3是图1中的步骤S300的流程图;Fig. 3 is a flowchart of step S300 in Fig. 1;
图4是图3中的步骤S330的流程图;Fig. 4 is the flowchart of step S330 in Fig. 3;
图5是图1中的步骤S500的流程图;Fig. 5 is a flowchart of step S500 in Fig. 1;
图6是本申请实施例提供的基于表情识别的语音消息生成方法的第二流程图;FIG. 6 is a second flow chart of a voice message generation method based on facial expression recognition provided by an embodiment of the present application;
图7是本申请实施例提供的基于表情识别的语音消息生成方法的实际应用流程图;FIG. 7 is a flow chart of the actual application of the facial expression recognition-based voice message generation method provided by the embodiment of the present application;
图8是本申请实施例提供的基于表情识别的语音消息生成装置的模块结构框图;FIG. 8 is a block diagram of a module structure of a voice message generation device based on expression recognition provided by an embodiment of the present application;
图9是本申请实施例提供的计算机设备的硬件结构示意图。FIG. 9 is a schematic diagram of a hardware structure of a computer device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the flow chart. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.
首先,对本申请中涉及的若干名词进行解析:First, analyze some nouns involved in this application:
人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。Artificial intelligence (AI): It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, artificial intelligence attempts to understand the essence of intelligence, and produce a new intelligent machine that can respond in a similar way to human intelligence. Research in this field includes robots, language recognition, image recognition, natural language processing and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
聊天机器人(Chatterbot):是经由对话或文字进行交谈的计算机程序。能够模拟人类对话,通过图灵测试。聊天机器人可用于实用的目的,如客户服务或资讯获取。有些聊天机器人会搭载自然语言处理系统,但大多简单的系统只会撷取输入的关键字,再从数据库中找寻最合适的应答句。聊天机器人是虚拟助理(如Google智能助理)的一部分,可以与许多组织的应用程序,网站以及即时消息平台(Facebook Messenger)连接。非助理应用程序包括娱乐目的的聊天室,研究和特定产品促销,社交机器人。Chatterbot: A computer program that communicates via dialogue or text. Able to simulate human conversations and pass the Turing test. Chatbots can be used for practical purposes such as customer service or information acquisition. Some chatbots are equipped with natural language processing systems, but most of the simple systems only extract the input keywords, and then find the most suitable response sentences from the database. Chatbots are part of virtual assistants like Google Assistant that can interface with many organizations' apps, websites, and instant messaging platforms (Facebook Messenger). Non-Assistant applications include chat rooms for entertainment purposes, research and specific product promotions, social bots.
卷积神经网络(Convolutional Neural Networks,CNN):是一类包含卷积计算且具有深度结构的前馈神经网络(Feedforward Neural Networks),是深度学习(deep learning)的代表算法之一。卷积神经网络具有表征学习(representation learning)能力,能够按其阶层结构对输入信息进行平移不变分类(shift-invariant classification)。随着深度学习理论的提出和数值计算设备的改进,卷积神经网络得到了快速发展,并被应用于计算机视觉、自然语言处理等领域。卷积神经网络仿造生物的视知觉(visual perception)机制构建,可以进行监督学习和非监督学习,其隐含层内的卷积核参数共享和层间连接的稀疏性使得卷积神经网络能够以较小的计算量对格点化(grid-like topology)特征,例如像素和音频进行学习、有稳定的效果且对数据没有额外的特征工程(feature engineering)要求。Convolutional Neural Networks (CNN): It is a type of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolution calculations and has a deep structure. It is one of the representative algorithms for deep learning. The convolutional neural network has the ability of representation learning, and can perform shift-invariant classification on the input information according to its hierarchical structure. With the introduction of deep learning theory and the improvement of numerical computing equipment, convolutional neural networks have developed rapidly and have been applied in computer vision, natural language processing and other fields. The convolutional neural network imitates the biological visual perception (visual perception) mechanism, and can perform supervised learning and unsupervised learning. The convolution kernel parameter sharing in the hidden layer and the sparsity of the inter-layer connection enable the convolutional neural network to learn grid-like topology features, such as pixels and audio, with a small amount of calculation. It has a stable effect and has no additional feature engineering requirements for the data.
循环神经网络(Recurrent Neural Network,RNN)是一类以序列(sequence)数据为输入,在序列的演进方向进行递归(recursion)且所有节点(循环单元)按链式连接的递归神经网络(recursive neural network),其中双向循环神经网络(Bidirectional RNN,Bi-RNN)和长短期记忆网络(Long Short-Term Memory networks,LSTM)是常见的循环神经网络。循环神经网络具有记忆性、参数共享并且图灵完备(Turing completeness),因此在对序列的非线性特征进行学习时具有一定优势。循环神经网络在自然语言处理(Natural Language Processing,NLP),例如语音识别、语言建模、机器翻译等领域有应用,也被用于各类时间序列预报。引入了卷积神经网络构筑的循环神经网络可以处理包含序列输入的计算机视觉问题。Recurrent Neural Network (RNN) is a kind of recursive neural network (recursive neural network) that takes sequence (sequence) data as input, performs recursion (recursion) in the evolution direction of the sequence, and all nodes (circulation units) are connected in a chain. works, LSTM) is a common recurrent neural network. The cyclic neural network has memory, parameter sharing and Turing completeness, so it has certain advantages in learning the nonlinear characteristics of the sequence. Recurrent neural networks are used in natural language processing (Natural Language Processing, NLP), such as speech recognition, language modeling, machine translation and other fields, and are also used in various time series forecasts. The recurrent neural network constructed by the convolutional neural network can deal with computer vision problems involving sequence input.
门控循环单元(Gated Recurrent Unit,GRU):是循环神经网络中的一种门控机制,与其他门控机制相似,其旨在解决标准RNN中的梯度消失/爆炸问题并同时保留序列的长期信息。GRU在许多诸如语音识别的序列任务上与LSTM同样出色,不过它的参数比LSTM少,仅包含一个重置门(reset gate)和一个更新门(update gate)。Gated recurrent unit (Gated Recurrent Unit, GRU): It is a gating mechanism in the recurrent neural network, similar to other gating mechanisms, which aims to solve the gradient disappearance/explosion problem in the standard RNN while retaining the long-term information of the sequence. GRU is as good as LSTM on many sequential tasks such as speech recognition, but it has fewer parameters than LSTM, including only a reset gate and an update gate.
CTC(Connectionist temporal classification):是序列标注问题中的一种损失函数,主要用于处理序列标注问题中的输入与输出标签的对齐问题。传统序列标注算法需要每一时刻输入与输出符号完全对齐,而CTC扩展了标签集合,添加空元素。在使用扩展标签集合对序列进行标注后,所有可以通过映射函数转换为真实序列的预测序列,都是正确的预测结果,也就是在无需数据对齐处理,即可得到预测序列。其目标函数就是最大化所有正确的预测序列的概率和,在查找所有正确预测序列时,采用了前向后向算法。CTC (Connectionist temporal classification): It is a loss function in the sequence labeling problem, which is mainly used to deal with the alignment of input and output labels in the sequence labeling problem. The traditional sequence labeling algorithm requires the input and output symbols to be completely aligned at each moment, while CTC expands the label set and adds empty elements. After the sequence is marked with the extended label set, all predicted sequences that can be converted into real sequences through the mapping function are correct prediction results, that is, the predicted sequence can be obtained without data alignment processing. Its objective function is to maximize the probability sum of all correct prediction sequences. When searching for all correct prediction sequences, a forward-backward algorithm is used.
感兴趣区域(region of interest,ROI):机器视觉、图像处理中,从被处理的图像以 方框、圆、椭圆、不规则多边形等方式勾勒出需要处理的区域,称为感兴趣区域。Region of interest (region of interest, ROI): In machine vision and image processing, the region to be processed is outlined from the processed image in the form of a box, circle, ellipse, irregular polygon, etc., called the region of interest.
OpenCV:是一个基于Apache2.0许可(开源)发行的跨平台计算机视觉和机器学习软件库,可以运行在Linux、Windows、Android和Mac OS操作系统上。它轻量级而且高效,由一系列C函数和少量C++类构成,同时提供了Python、Ruby、MATLAB等语言的接口,实现了图像处理和计算机视觉方面的很多通用算法。OpenCV用C++语言编写,它具有C++,Python,Java和MATLAB接口,并支持Windows,Linux,Android和Mac OS,OpenCV主要倾向于实时视觉应用,并在可用时利用MMX和SSE指令,如今也提供对于C#、Ch、Ruby,GO的支持。OpenCV: It is a cross-platform computer vision and machine learning software library released based on the Apache2.0 license (open source), which can run on Linux, Windows, Android and Mac OS operating systems. It is lightweight and efficient. It consists of a series of C functions and a small number of C++ classes. It also provides interfaces for languages such as Python, Ruby, and MATLAB, and implements many general-purpose algorithms in image processing and computer vision. OpenCV is written in C++ language, it has C++, Python, Java and MATLAB interfaces, and supports Windows, Linux, Android and Mac OS, OpenCV is mainly inclined to real-time vision applications, and utilizes MMX and SSE instructions when available, and now also provides support for C#, Ch, Ruby, GO.
VGG模型(Visual Geometry Group Network):该网络是在ILSVRC 2014上的相关工作,主要工作是证明了增加网络的深度能够在一定程度上影响网络最终的性能。VGG有两种结构,分别是VGG16和VGG19,两者并没有本质上的区别,只是网络深度不一样。VGG16相比AlexNet的一个改进是采用连续的几个3x3的卷积核代替AlexNet中的较大卷积核(11x11,7x7,5x5)。对于给定的感受野(与输出有关的输入图片的局部大小),采用堆积的小卷积核是优于采用大的卷积核,因为多层非线性层可以增加网络深度来保证学习更复杂的模式,而且代价还比较小(参数更少)。VGG model (Visual Geometry Group Network): This network is a related work on ILSVRC 2014. The main work is to prove that increasing the depth of the network can affect the final performance of the network to a certain extent. VGG has two structures, namely VGG16 and VGG19. There is no essential difference between the two, but the network depth is different. An improvement of VGG16 compared to AlexNet is to use several consecutive 3x3 convolution kernels instead of larger convolution kernels (11x11, 7x7, 5x5) in AlexNet. For a given receptive field (the local size of the input image related to the output), using a stacked small convolution kernel is better than using a large convolution kernel, because multiple nonlinear layers can increase the depth of the network to ensure learning more complex patterns, and the cost is relatively small (fewer parameters).
嵌入(embedding):embedding是一种向量表征,是指用一个低维的向量表示一个物体,该物体可以是一个词,或是一个商品,或是一个电影等等;这个embedding向量的性质是能使距离相近的向量对应的物体有相近的含义,比如embedding(复仇者联盟)和embedding(钢铁侠)之间的距离就会很接近,但embedding(复仇者联盟)和embedding(乱世佳人)的距离就会远一些。embedding实质是一种映射,从语义空间到向量空间的映射,同时尽可能在向量空间保持原样本在语义空间的关系,如语义接近的两个词汇在向量空间中的位置也比较接近。embedding能够用低维向量对物体进行编码还能保留其含义,常应用于机器学习,在机器学习模型构建过程中,通过把物体编码为一个低维稠密向量再传给DNN,以提高效率。Embedding (embedding): embedding is a vector representation, which refers to the use of a low-dimensional vector to represent an object, which can be a word, or a commodity, or a movie, etc.; the nature of this embedding vector is to make the objects corresponding to the vectors with similar distances have similar meanings. . Embedding is essentially a mapping from semantic space to vector space, while maintaining the relationship of the original sample in the semantic space as much as possible in the vector space. For example, the positions of two words with close semantics in the vector space are relatively close. Embedding can encode an object with a low-dimensional vector and retain its meaning. It is often used in machine learning. In the process of building a machine learning model, the object is encoded as a low-dimensional dense vector and then passed to DNN to improve efficiency.
交叉熵(Cross Entropy):是Shannon信息论中一个重要概念,主要用于度量两个概率分布间的差异性信息。语言模型的性能通常用交叉熵和复杂度(perplexity)来衡量。交叉熵的意义是用该模型对文本识别的难度,或者从压缩的角度来看,每个词平均要用几个位来编码。复杂度的意义是用该模型表示这一文本平均的分支数,其倒数可视为每个词的平均概率。平滑是指对没观察到的N元组合赋予一个概率值,以保证词序列总能通过语言模型得到一个概率值。通常使用的平滑技术有图灵估计、删除插值平滑、Katz平滑和Kneser-Ney平滑。Cross Entropy (Cross Entropy): It is an important concept in Shannon information theory, which is mainly used to measure the difference information between two probability distributions. The performance of language models is usually measured by cross entropy and complexity. The meaning of cross-entropy is the difficulty of using the model to recognize text, or from a compression point of view, how many bits are used to encode each word on average. The meaning of complexity is to use the model to represent the average number of branches of this text, and its reciprocal can be regarded as the average probability of each word. Smoothing refers to assigning a probability value to unobserved N-gram combinations to ensure that the word sequence can always obtain a probability value through the language model. Commonly used smoothing techniques are Turing estimation, deletion interpolation smoothing, Katz smoothing and Kneser-Ney smoothing.
jieba分词器:jieba分词器也叫结巴分词器,是一种开源分词器;中文分词是中文文本处理的一个基础步骤,也是中文人机自然语言交互的基础模块,在进行中文自然语言处理时,通常需要先进行分词,其中,常用jieba分词器进行分词;jieba分词算法使用了基于前缀词典实现高效的词图扫描,生成句子中汉字所有可能生成词情况所构成的有向无环图(DAG),再采用了动态规划查找最大概率路径,找出基于词频的最大切分组合,对于未登录词,采用了基于汉字成词能力的HMM模型,使用了Viterbi算法。jieba分词支持三种分词模式:第一种是精确模式,该精确模式试图将句子最精确地切开,适合文本分析:第二种是全模式,该全模式是把句子中所有的可以成词的词语都扫描出来,速度非常快,但是不能解决歧义;第三种是搜索引擎模式,该搜索引擎模式是在精确模式的基础上,对长词再词切分,提高召回率,适合用于搜索引擎分词。jieba word segmenter: jieba word segmenter is also called jieba word segmenter, which is an open source word segmenter; Chinese word segmentation is a basic step in Chinese text processing and a basic module of Chinese human-computer natural language interaction. When performing Chinese natural language processing, it is usually necessary to perform word segmentation first. Among them, the jieba word segmenter is commonly used for word segmentation; Then, dynamic programming is used to find the maximum probability path, and the maximum segmentation combination based on word frequency is found. For unregistered words, the HMM model based on the ability of Chinese characters to form words is used, and the Viterbi algorithm is used. Jieba word segmentation supports three word segmentation modes: the first is the precise mode, which tries to cut the sentence most accurately, which is suitable for text analysis; the second is the full mode, which scans all the words that can be formed into words in the sentence, which is very fast, but cannot resolve the ambiguity; the third is the search engine mode, which is based on the precise mode, segmenting long words and improving the recall rate, and is suitable for word segmentation in search engines.
Analyzer分词器:Analyzer分词器是专门处理分词的组件,一般包括三部分:Character Filters、Tokenizer(按照规则切分为单词)、Token Filters;其中,Character Filters主要用于处理原始文本,例如去除html、特殊字符;Tokenizer用于按照规则切分为单词;Token Filters用于将切分的单词加工,包括小写、删除stopwords(停用词),增加同义词等。Analyzer tokenizer: Analyzer tokenizer is a component that deals with word segmentation, and generally includes three parts: Character Filters, Tokenizer (segmented into words according to rules), Token Filters; Among them, Character Filters is mainly used to process original text, such as removing html, special characters; Tokenizer is used to segment words according to rules; Token Filters is used to process the segmented words, including lowercase, delete stopwords (stop words) , adding synonyms, etc.
encoder:编码,就是将输入序列转化成一个固定长度的向量;解码(decoder),就是将之前生成的固定向量再转化成输出序列;其中,输入序列可以是文字、语音、图像、视频;输出序列可以是文字、图像。Encoder: encoding is to convert the input sequence into a fixed-length vector; decoding (decoder) is to convert the previously generated fixed vector into an output sequence; where the input sequence can be text, voice, image, video; the output sequence can be text, image.
word2vec(word to vector):是一群用来产生词向量的相关模型。这些模型为浅而双层的神经网络,用来训练以重新建构语言学之词文本。网络以词表现,并且需猜测相邻位置的输入词,在word2vec中词袋模型假设下,词的顺序是不重要的。训练完成之后,word2vec模型可用来映射每个词到一个向量,可用来表示词对词之间的关系,该向量为神经网络之隐藏层。word2vec(word to vector): It is a group of related models used to generate word vectors. These models are shallow, two-layer neural networks trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. Under the assumption of the word bag model in word2vec, the order of words is not important. After the training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words, and the vector is the hidden layer of the neural network.
自注意力机制(Attention Mechanism):注意力机制可以使得神经网络具备专注于其输入(或特征)子集的能力,选择特定的输入,可以应用于任何类型的输入而不管其形状如何。在计算能力有限情况下,注意力机制是解决信息超载问题的主要手段的一种资源分配方案,将计算资源分配给更重要的任务。Self-attention mechanism (Attention Mechanism): The attention mechanism can enable the neural network to have the ability to focus on its input (or feature) subset, select a specific input, and can be applied to any type of input regardless of its shape. In the case of limited computing power, the attention mechanism is a resource allocation scheme that is the main means to solve the problem of information overload, and allocates computing resources to more important tasks.
Seq2Seq:是一种重要的RNN模型,也称为Encoder-Decoder模型,可以理解为一种N×M的模型。模型包含两个部分:Encoder用于编码序列的信息,将任意长度的序列信息编码到一个向量c里。而Decoder是解码器,解码器得到上下文信息向量c之后可以将信息解码,并输出为序列。Seq2Seq: It is an important RNN model, also known as the Encoder-Decoder model, which can be understood as an N×M model. The model consists of two parts: Encoder is used to encode sequence information, and encode sequence information of any length into a vector c. The Decoder is a decoder. After the decoder obtains the context information vector c, it can decode the information and output it as a sequence.
短时傅里叶变换(Short Time Fourier Transform,STFT),只适用于平稳信号,豚类的whistle信号属于非平稳信号,频率特性随时间变化,为了捕捉这一时变特征,需要对信号进行时频分析,常用短时傅里叶变换、小波变换、希尔伯特黄变换等。Short Time Fourier Transform (Short Time Fourier Transform, STFT) is only suitable for stationary signals. The whistle signal of porpoise is a non-stationary signal, and the frequency characteristics change with time. In order to capture this time-varying feature, time-frequency analysis of the signal is required. Short-time Fourier transform, wavelet transform, Hilbert-Huang transform, etc. are commonly used.
梅尔频率倒谱(Mel-Frequency Cepstrum):是基于声音频率的非线性梅尔刻度(mel scale)的对数能量频谱的线性变换。梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCCs)就是组成梅尔频率倒谱的系数。它衍生自音讯片段的倒频谱(cepstrum)。倒谱和梅尔频率倒谱的区别在于,梅尔频率倒谱的频带划分是在梅尔刻度上等距划分的,它比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统。这样的非线性表示,可以在多个领域中使声音信号有更好的表示。Mel-Frequency Cepstrum (Mel-Frequency Cepstrum): It is a linear transformation of the logarithmic energy spectrum based on the nonlinear Mel scale of sound frequency. Mel-frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCCs) are the coefficients that make up the Mel-frequency cepstral. It is derived from the cepstrum of an audio segment. The difference between the cepstrum and the Mel-frequency cepstrum is that the frequency band division of the Mel-frequency cepstrum is equally spaced on the Mel scale, which more closely approximates the human auditory system than the linearly spaced frequency bands used in the normal log-log cepstrum. Such non-linear representations can lead to better representations of sound signals in multiple domains.
Griffin-lim:是一种声码器,常用于语音合成,用于将语音合成系统生成的声学参数转换成语音波形,这种声码器不需要训练,不需要预知相位谱,而是通过帧与帧之间的关系估计相位信息,从而重建语音波形。Griffin-lim: It is a vocoder, which is often used in speech synthesis, and is used to convert the acoustic parameters generated by the speech synthesis system into speech waveforms. This vocoder does not require training, does not need to predict the phase spectrum, but estimates the phase information through the relationship between frames and frames, thereby reconstructing the speech waveform.
Softmax分类器:为逻辑回归分类器面对多个分类的一般化归纳,输出的是属于不同类别的概率值。Softmax classifier: For the generalization of multiple classifications for the logistic regression classifier, the output is the probability value belonging to different categories.
随着计算机技术的发展,如即时通信工具、手机短信等通讯手段日渐风行。基于这些通讯手段,除了实现人与人之间的沟通交流外,也使得人与人工智能系统之间的沟通交流成为可能。例如,聊天机器人就是一种借助于通讯手段实现与人沟通交流的人工智能系统。目前用户所接触到的大多数聊天机器人属于被动交互型,但是目前被动交互型的聊天机器人交互功能比较单一,即只能根据用户的语音识别出的文本进行相应的回答,但是采用这种单一的识别方式,往往影响聊天机器人所生成的语音回复消息的准确率。With the development of computer technology, communication means such as instant messaging tools and mobile phone text messages are becoming more and more popular. Based on these means of communication, in addition to realizing communication between people, it also makes it possible to communicate between people and artificial intelligence systems. For example, a chatbot is an artificial intelligence system that communicates with people by means of communication. At present, most of the chatbots that users come into contact with are passive interaction type, but the interaction function of the passive interaction type chatbot is relatively single, that is, it can only respond to the text recognized by the user's voice, but this single recognition method often affects the accuracy of the voice reply message generated by the chatbot.
基于此,本申请实施例提供一种基于表情识别的语音消息生成方法和装置、计算机设备、存储介质,能够提高文本情感分类的准确率。Based on this, the embodiments of the present application provide a voice message generation method and device based on facial expression recognition, a computer device, and a storage medium, which can improve the accuracy of text emotion classification.
本申请实施例提供基于表情识别的语音消息生成方法和装置、计算机设备、存储介质,具体通过如下实施例进行说明,首先描述本申请实施例中的基于表情识别的语音消息生成方法。Embodiments of the present application provide a voice message generation method and device based on facial expression recognition, a computer device, and a storage medium. Specifically, the following embodiments are used for illustration.
本申请实施例提供的基于表情识别的语音消息生成方法,涉及人工智能领域。本申请实施例提供的基于表情识别的语音消息生成方法可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等;服务器端可以配置成独立的物理服务器,也可以配置成多个物理服务器构成的服务器集群或者分布式系统,还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务器;软件可以是实现基于表情识别的语音消息生成方法的应用等,但并不局限于以上形式。The expression recognition-based voice message generation method provided in the embodiment of the present application relates to the field of artificial intelligence. The voice message generation method based on facial expression recognition provided by the embodiment of the present application can be applied to a terminal, can also be applied to a server, and can also be software running on the terminal or the server. In some embodiments, the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, or a smart watch; the server end can be configured as an independent physical server, or can be configured as a server cluster or a distributed system composed of multiple physical servers, and can also be configured as a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms; limited to the above forms.
本申请实施例可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The embodiments of the present application can be used in many general-purpose or special-purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc. This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
参照图1,根据本申请实施例第一方面实施例的基于表情识别的语音消息生成方法,包括但不限于包括步骤S100至步骤S500。Referring to FIG. 1 , the method for generating a voice message based on expression recognition according to the first aspect of the embodiment of the present application includes but is not limited to steps S100 to S500 .
步骤S100,获取语音数据及其对应的面部图像:Step S100, acquiring voice data and its corresponding facial image:
步骤S200,对语音数据进行语音识别得到文本消息;Step S200, performing voice recognition on the voice data to obtain a text message;
步骤S300,对面部图像进行表情识别得到表情消息:Step S300, performing expression recognition on the facial image to obtain an expression message:
步骤S400,将文本消息和表情消息输入至第一模型,第一模型根据文本消息和表情消息得到回答文本消息:Step S400, input the text message and emoticon message into the first model, and the first model gets the reply text message according to the text message and emoticon message:
步骤S500,对回答文本消息进行语音转换,得到对应的回答语音消息。Step S500, performing voice conversion on the answering text message to obtain the corresponding answering voice message.
在一些实施例的步骤S100中,通过麦克风捕捉用户发出的语音数据,即用户对聊天机器人说话的内容;捕捉用户发出的语音数据的同时,利用摄像头捕捉用户说话时的图像,具体地可以捕捉用户的面部图像。在实际应用中,摄像头捕捉到的某些图像可能没有拍到用户的面部区域,或者图像中并不仅仅包含用户的面部区域,此时还需要对摄像头捕捉到的图像进行进一步的筛选。具体地,可以将不包含用户面部区域的图像进行删除。为了进一步提高表情识别的准确率,还可以对图像的感兴趣区域,例如人脸区域进行检测,其中人脸区域为本申请实施例中的表情需要重点关注的区域。In step S100 of some embodiments, the voice data sent by the user is captured through the microphone, that is, the content of the user's speech to the chat robot; while the voice data sent by the user is captured, the camera is used to capture the image of the user speaking, specifically, the facial image of the user can be captured. In practical applications, some images captured by the camera may not capture the user's face area, or the image does not only include the user's face area. In this case, the images captured by the camera need to be further screened. Specifically, images that do not contain the user's facial area may be deleted. In order to further improve the accuracy of expression recognition, it is also possible to detect the region of interest of the image, such as the human face region, where the human face region is the region that needs to be focused on for the expression in the embodiment of the present application.
在一些实施例中,还可以采用开源的openCV中的CascadeClassifier函数来自动检测图片中的所有人脸区域,以实现图像的人脸检测定位。In some embodiments, the CascadeClassifier function in the open source openCV can also be used to automatically detect all face regions in the picture, so as to realize face detection and positioning of the image.
在一些实施例的步骤S200中,采集到用户发出的语音数据之后,还需要将语音数据转换成文本,得到文本消息。In step S200 of some embodiments, after the voice data from the user is collected, the voice data needs to be converted into text to obtain a text message.
在一些实施例的步骤S300中,采集到用户的面部图像之后,需要对面部图像进行表情分类处理,例如需要判断某一面部图像是哪一个表情,并根据该表情生成对应的表情消息,例如该表情对应的文字向量或者图像向量,用于使第一模型生成回答文本消息。在本申请实施例中,可以将表情分为高兴、伤心、生气、中性、惊讶和害怕等。In step S300 of some embodiments, after the user's facial image is collected, it is necessary to perform expression classification processing on the facial image, for example, it is necessary to determine which expression a certain facial image is, and generate a corresponding expression message according to the expression, such as a text vector or an image vector corresponding to the expression, which is used to make the first model generate a reply text message. In the embodiment of the present application, the expressions can be divided into happy, sad, angry, neutral, surprised, and scared.
在一些实施例的步骤S400中,将文本消息和表情消息输入至第一模型,第一模型根据文本消息和表情消息得到回答文本消息。In step S400 of some embodiments, a text message and an emoticon message are input into the first model, and the first model obtains a reply text message according to the text message and the emoticon message.
在一些实施例的步骤S500中,对回答文本消息进行语音转换,得到对应的回答语音消息,生成回答语音消息之后,由聊天机器人根据语音消息对用户做出相应的语音回答。In step S500 of some embodiments, voice conversion is performed on the answer text message to obtain a corresponding answer voice message. After the answer voice message is generated, the chat robot makes a corresponding voice answer to the user according to the voice message.
在一些实施例中,如图2所示,步骤S200具体包括但不限于步骤S210至步骤S250。In some embodiments, as shown in FIG. 2 , step S200 specifically includes but not limited to step S210 to step S250.
步骤S210,对语音数据的时域信号进行积分变换得到频域信号;Step S210, performing integral transformation on the time-domain signal of the speech data to obtain the frequency-domain signal;
步骤S220,根据时域信号和频域信号,构建平面空间;Step S220, constructing a planar space according to the time-domain signal and the frequency-domain signal;
步骤S230,通过第一神经网络,在平面空间中对语音数据进行卷积运算,得到语音序列和序列长度;Step S230, through the first neural network, perform convolution operation on the voice data in the plane space, to obtain the voice sequence and sequence length;
步骤S240,根据序列长度对语音序列进行切片,得到多个切片序列;Step S240, slice the speech sequence according to the sequence length to obtain multiple slice sequences;
步骤S250,通过第二神经网络对多个切片序列进行文本转换,得到文本消息。Step S250, performing text conversion on multiple slice sequences through the second neural network to obtain text messages.
在一些实施例的步骤S210中,对语音数据的时域信号进行积分变换得到频域信号,在本申请实施例中,积分变换可以采用傅里叶变换,其中,傅里叶变换是将原来难以处理的时域信号转换成了易于分析的频域信号,快速傅里叶变换的功能就是把时域的数字信号变换到频域当中,可以在频域上来分析能量较高的位置,这些位置可能就是需要关注的声音所处的频 段。In step S210 of some embodiments, the time-domain signal of the voice data is integrally transformed to obtain the frequency-domain signal. In the embodiment of the present application, the integral transformation can use Fourier transform, wherein the Fourier transform transforms the time-domain signal that is difficult to process into an easy-to-analyze frequency-domain signal. The function of the fast Fourier transform is to transform the time-domain digital signal into the frequency domain, and analyze the positions with higher energy in the frequency domain. These positions may be the frequency bands where the sounds that need attention are located.
在一些实施例的步骤S220中,将时域信号和频域信号组成一个二维空间,也即平面空间。In step S220 of some embodiments, the time-domain signal and the frequency-domain signal are combined into a two-dimensional space, that is, a planar space.
在一些实施例的步骤S230中,通过第一神经网络,在平面空间中对语音数据进行卷积运算,得到语音序列和序列长度。其中,第一神经网络由多个CNN组成,用于对语音数据进行卷积运算,得到语音序列,以及该语音序列的长度。In step S230 of some embodiments, the first neural network is used to perform a convolution operation on the voice data in the plane space to obtain the voice sequence and sequence length. Wherein, the first neural network is composed of multiple CNNs, and is used to perform convolution operation on the speech data to obtain a speech sequence and the length of the speech sequence.
在一些实施例的步骤S240中,根据序列长度对语音序列进行切片,具体地,对该语音数据进行建模,建模的过程中,可以将语音序列切成多片,得到切片序列,例如将语音序列切成N片,得到N个切片序列。In step S240 of some embodiments, the speech sequence is sliced according to the sequence length, specifically, the speech data is modeled. During the modeling process, the speech sequence can be cut into multiple slices to obtain a slice sequence, for example, the speech sequence is cut into N slices to obtain N slice sequences.
在一些实施例的步骤S250中,通过第二神经网络对多个切片序列进行文本转换,得到文本消息。具体地,第二神经网络可以为RNN网络,RNN网络运用到多个GRU单元,将步骤S240得到的N个切片作为RNN的N个输入,并获取RNN输出的文本消息,由此完成了将语音数据转换成文本消息的过程。需要说明的是,当时间步数较大或者时间步较小时,RNN网络的梯度较容易出现衰减或爆炸。虽然裁剪梯度可以应对梯度爆炸,但无法解决梯度衰减的问题,从而导致RNN网络在实际中较难捕捉时间序列中时间步距离较大的依赖关系,基于此,本申请实施例在RNN网络中采用GRU单元,能够更好地捕捉时间序列中时间步距离较大的依赖关系,并控制信息的流动,以此达到较好的模型训练效果,使转换得到的文本消息更为准确。In step S250 of some embodiments, text conversion is performed on multiple slice sequences through the second neural network to obtain text messages. Specifically, the second neural network may be an RNN network, and the RNN network is applied to multiple GRU units, and the N slices obtained in step S240 are used as N inputs of the RNN, and the text message output by the RNN is obtained, thus completing the process of converting voice data into a text message. It should be noted that when the number of time steps is large or the time step is small, the gradient of the RNN network is more likely to decay or explode. Although clipping the gradient can cope with the gradient explosion, it cannot solve the problem of gradient attenuation, which makes it difficult for the RNN network to capture the dependencies with large time step distances in the time series in practice. Based on this, the embodiment of the present application adopts the GRU unit in the RNN network, which can better capture the dependencies with large time step distances in the time series, and control the flow of information, so as to achieve a better model training effect and make the converted text messages more accurate.
在一些实施例中,第一神经网络与第二神经网络可构成语音模型,语音模型能够实现将语音数据转换成文本消息。为了进一步提高语音模型的训练效果,考虑使用损失函数对语音模型进行优化,例如采用CTC损失函数,具体地损失函数如公式(1)所示,其中,X表示给定的一段语音,Z表示与X对应的文本,Π为求积运算,p表示概率,p(Z|X)表示给定X输出为X的概率,L表示X对应的Z的输出概率,通过最小化概率的乘积最小化损失函数,具体地,可以通过按照相同字母连续出现多次则去重的策略以及去除空格的策略来最小化损失函数。In some embodiments, the first neural network and the second neural network can form a voice model, and the voice model can convert voice data into text messages. In order to further improve the training effect of the speech model, consider using a loss function to optimize the speech model. For example, the CTC loss function is used. The specific loss function is shown in formula (1), where X represents a given segment of speech, Z represents the text corresponding to X, Π represents the product operation, p represents the probability, p(Z|X) represents the probability that the output of a given X is X, and L represents the output probability of Z corresponding to X. The loss function is minimized by minimizing the product of the probability. strategy to minimize the loss function.
Figure PCTCN2022090752-appb-000001
Figure PCTCN2022090752-appb-000001
在一些实施例中,如图3所示,步骤S300具体包括但不限于步骤S310至步骤S330。In some embodiments, as shown in FIG. 3 , step S300 specifically includes but not limited to step S310 to step S330.
步骤S310,通过第三神经网络对面部图像进行自注意力筛选,得到变换参数;Step S310, performing self-attention screening on the face image through the third neural network to obtain transformation parameters;
步骤S320,根据变换参数对面部图像进行扭曲变换,得到变换图像;Step S320, distorting and transforming the facial image according to the transformation parameters to obtain a transformed image;
步骤S330,通过第四神经网络对面部图像和变换图像进行表情识别,得到表情消息。Step S330, perform expression recognition on the facial image and the transformed image through the fourth neural network, and obtain expression messages.
在一些实施例的步骤S310中,通过第三神经网络对面部图像进行自注意力筛选,得到变换参数,即参数θ。其中,第三神经网络在本申请实施例指的是自注意力网络,由两个卷积层和两个全连接层组成,可以定位到面部的重点区域。由于不同的表情拥有不同的重点区域,例如,用户在生气时,其面部表情的重点区域为眉眼;用户在开心时,其面部表情的重点区域在嘴巴;用户在惊讶时,其面部表情的重点区域在嘴巴、眼睛等,采用自注意力网络能够更精准地对面部图像进行表情分类。In step S310 of some embodiments, the face image is screened by self-attention through the third neural network to obtain a transformation parameter, namely parameter θ. Among them, the third neural network in the embodiment of this application refers to the self-attention network, which consists of two convolutional layers and two fully connected layers, and can locate key areas of the face. Since different expressions have different key areas, for example, when the user is angry, the key area of the facial expression is the eyebrows; when the user is happy, the key area of the facial expression is the mouth; when the user is surprised, the key area of the facial expression is the mouth, eyes, etc., using the self-attention network can more accurately classify the expression of facial images.
在一些实施例的步骤S320中,根据变换参数θ对面部图像进行扭曲变换,得到变换图像,具体地,变换参数可以为多种,例如,若变换参数为变换方向,且变换方向为顺时针旋转90度,则可根据该变换参数变换面部图像的方向;若变换参数为向上翻转,则可根据该变换参数将面部图像进行镜像翻转处理等,从而确定面部图像中的哪些区域属于和表情相关的重点区域。In step S320 of some embodiments, the facial image is distorted and transformed according to the transformation parameter θ to obtain a transformed image. Specifically, the transformation parameters can be various. For example, if the transformation parameter is the transformation direction, and the transformation direction is 90 degrees clockwise rotation, then the direction of the facial image can be transformed according to the transformation parameter;
在一些实施例的步骤S330中,对面部图像进行特征提取处理,得到对应的特征向量,将特征向量以及步骤S320得到的变换图像输入至第四神经网络,输出表情的分类结果。在实际应用中,可以采用VGG-19网络对面部图像进行特征提取。需要说明的是,本申请所述的分类消息体现为多种形式,例如输出为表情图像的形式,或者表情文字的形式,若为表情图像的形式,则将表情图像转换成向量,得到表情消息;若为表情文字的形式,则将表情文字转换 成向量,得到表情消息。In step S330 of some embodiments, the feature extraction process is performed on the face image to obtain the corresponding feature vector, and the feature vector and the transformed image obtained in step S320 are input to the fourth neural network, and the expression classification result is output. In practical applications, the VGG-19 network can be used to extract features from facial images. It should be noted that the classified messages described in this application are embodied in various forms, for example, the output is in the form of emoticon images or emoticon texts. If it is in the form of emoticon images, the emoticon images are converted into vectors to obtain emoticon messages; if it is in the form of emoticon text, the emoticon text is converted into vectors to obtain emoticon messages.
在一些实施例中,第四神经网络包括卷积层、全连接层和分类器,如图4所示,步骤S330具体包括但不限于步骤S331至步骤S333。In some embodiments, the fourth neural network includes a convolutional layer, a fully connected layer, and a classifier. As shown in FIG. 4 , step S330 specifically includes, but is not limited to, steps S331 to S333.
步骤S331,通过卷积层对面部图像和变换图像进行特征提取,得到多个图像特征向量;Step S331, performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors;
步骤S332,通过全连接层对多个图像特征向量进行拼接,得到图像拼接向量;Step S332, splicing a plurality of image feature vectors through a fully connected layer to obtain an image splicing vector;
步骤S333,通过分类器对图像拼接向量进行表情分类,得到表情消息。In step S333, the classifier performs expression classification on the image mosaic vector to obtain an expression message.
在一些实施例的步骤S331中,将面部图像和变换图像输入至第四神经网络的卷积层,通过卷积层对面部图像进行特征提取处理,得到多个图像特征向量。In step S331 of some embodiments, the facial image and the transformed image are input to the convolutional layer of the fourth neural network, and feature extraction processing is performed on the facial image through the convolutional layer to obtain a plurality of image feature vectors.
在一些实施例的步骤S332中,将多个图像特征向量输入至全连接层,通过全连接层对多个图像特征向量进行拼接,得到图像拼接向量。In step S332 of some embodiments, multiple image feature vectors are input to the fully connected layer, and the multiple image feature vectors are spliced through the fully connected layer to obtain an image splicing vector.
在一些实施例的步骤S333中,将图像拼接向量输入至分类器,由分类器输出表情的分类结果,根据分类结果得到表情消息。在实际应用中,本申请所指的分类器可以为Softmax分类器等。In step S333 of some embodiments, the image mosaic vector is input to the classifier, and the classifier outputs the classification result of the expression, and the expression message is obtained according to the classification result. In practical applications, the classifier referred to in this application may be a Softmax classifier or the like.
在一些实施例中,第三神经网络和第四神经网络可构成表情识别模型,表情识别模型能够实现面部图像的表情分类。为了进一步提高表情识别模型的训练效果,考虑使用损失函数对表情识别模型进行优化,例如采用交叉熵损失函数,如公式(2)所示,其中M为类别的数量,y ic为真实类别,p ic为观测样本i属于M个类别中的某一类别C的预测概率。 In some embodiments, the third neural network and the fourth neural network can constitute an expression recognition model, and the expression recognition model can realize expression classification of facial images. In order to further improve the training effect of the expression recognition model, consider using a loss function to optimize the expression recognition model, for example, using the cross-entropy loss function, as shown in formula (2), where M is the number of categories, y ic is the real category, and p ic is the predicted probability that the observation sample i belongs to a certain category C among the M categories.
Figure PCTCN2022090752-appb-000002
Figure PCTCN2022090752-appb-000002
在一些实施例中,通过VGG-19对人脸照片,即本申请实施例提到的面部图像进行特征提取,得到面部图像对应的图像特征向量;与此同时,将人脸照片输入到自注意力网络生成一个参数θ,根据参数θ得到T θ(G)。其中,T θ(G)相当于对输入作了一个仿射变换,θ就是变换的参数,相当于对输入的人脸照片生成一个扭曲变换后的样本,即变换图像,有助于神经网络在人脸中找到跟表情相关的重要区域。接着,对变换图像进行特征提取,得到变换图像对应的图像特征向量。最后,将面部图像对应的图像特征向量和变换图像对应的图像特征向量输入至两个全连接层,由全连接层输出表情的分类结果。本申请实施例引入了注意力机制,可以根据不同的表情,定位脸部不同的重点区域,使神经网络关注人脸中与表情相关的区域,使表情识别的效果更加精准。 In some embodiments, face photos, that is, the facial images mentioned in the embodiments of the present application, are subjected to feature extraction through VGG-19 to obtain the image feature vectors corresponding to the facial images; at the same time, the face photos are input to the self-attention network to generate a parameter θ, and T θ (G) is obtained according to the parameter θ. Among them, T θ (G) is equivalent to an affine transformation on the input, and θ is the parameter of the transformation, which is equivalent to generating a twisted and transformed sample of the input face photo, that is, the transformed image, which helps the neural network to find important areas related to expressions in the face. Next, feature extraction is performed on the transformed image to obtain an image feature vector corresponding to the transformed image. Finally, the image feature vector corresponding to the face image and the image feature vector corresponding to the transformed image are input to the two fully connected layers, and the fully connected layer outputs the classification result of the expression. The embodiment of the present application introduces an attention mechanism, which can locate different key areas of the face according to different expressions, so that the neural network can focus on areas related to expressions in the human face, so that the effect of expression recognition is more accurate.
在一些实施例中,如图5所示,步骤S500具体包括但不限于步骤S510至步骤S550。In some embodiments, as shown in FIG. 5 , step S500 specifically includes but not limited to step S510 to step S550.
步骤S510,对回答文本消息进行语音转换,得到初步语音消息;Step S510, performing voice conversion on the answering text message to obtain a preliminary voice message;
步骤S520,对初步语音消息进行变换,得到声谱图;Step S520, transforming the preliminary voice message to obtain a spectrogram;
步骤S530,提取声谱图的音频特征;Step S530, extracting audio features of the spectrogram;
步骤S540,通过第五神经网络模型对音频特征进行解码,得到每一帧对应的音频数据;Step S540, decoding the audio feature through the fifth neural network model to obtain the corresponding audio data of each frame;
步骤S550,将音频数据进行合成处理,得到对应的回答语音消息。In step S550, the audio data is synthesized to obtain a corresponding reply voice message.
在一些实施例的步骤S510中,对回答文本消息进行语音转换,得到初步语音消息,在实际应用中,可通过OCR文字识别等软件进行语音转换。In step S510 of some embodiments, voice conversion is performed on the reply text message to obtain a preliminary voice message. In practical applications, voice conversion can be performed by software such as OCR text recognition.
在一些实施例的步骤S520中,对初步语音消息进行变换,得到声谱图。具体地,初步语音消息指的是回答文本消息所对应的声音信号,通过STFT能够将声音信号转换为对应的二维信号,从而得到声谱图。具体地,STFT的原理为:把一段长信号分帧、加窗,再对每一帧做傅里叶变换(Fourier Transform,FFT),最后把每一帧的结果沿另一个维度堆叠起来,得到类似于一幅图的二维信号形式,从而得到对应的声谱图。In step S520 of some embodiments, the preliminary voice message is transformed to obtain a spectrogram. Specifically, the preliminary voice message refers to the sound signal corresponding to the text message, and the sound signal can be converted into a corresponding two-dimensional signal through STFT, so as to obtain a spectrogram. Specifically, the principle of STFT is: divide a long signal into frames and add windows, then perform Fourier Transform (FFT) on each frame, and finally stack the results of each frame along another dimension to obtain a two-dimensional signal form similar to a picture, so as to obtain the corresponding spectrogram.
在一些实施例的步骤S530中,利用编码器提取声谱图的MFCC音频特征。In step S530 of some embodiments, an encoder is used to extract MFCC audio features of the spectrogram.
在一些实施例的步骤S540中,使用基于自注意力机制的第五神经网络对音频特征进行解码,得到每一帧对应的音频数据。具体地,第五神经网络为RNN网络,具体由两层GRU网络层构成,其中,每一GRU网络层包含256个GRU单元。In step S540 of some embodiments, the fifth neural network based on the self-attention mechanism is used to decode the audio features to obtain the audio data corresponding to each frame. Specifically, the fifth neural network is an RNN network, specifically composed of two GRU network layers, wherein each GRU network layer includes 256 GRU units.
在一些实施例的步骤S550中,由于从频谱生成音频的时候,需要考虑连续帧之间相位变化的规律,所以得到了每一帧对应的音频之后,需要采用Griffin_lim重建算法去微调连续帧之间的相位变化,进而生成连续帧音频,得到对应的回答语音消息。需要说明的是,在连续帧之间的相位变化较大的情况下,需要求得一个中间相位,使得连续帧音频的相位变化不至于太大,从而影响回答语音消息生成的效果。此外,本申请实施例还能够根据不同的表情,对语音语调等输出音频参数进行变换,使机器人做出更加应景的回答。In step S550 of some embodiments, when generating audio from the frequency spectrum, it is necessary to consider the law of phase changes between consecutive frames, so after obtaining the audio corresponding to each frame, it is necessary to use the Griffin_lim reconstruction algorithm to fine-tune the phase changes between consecutive frames, and then generate consecutive frames of audio to obtain corresponding answer voice messages. It should be noted that, in the case of large phase changes between consecutive frames, an intermediate phase needs to be obtained so that the phase changes of consecutive frames of audio will not be too large, thereby affecting the effect of generating the reply voice message. In addition, the embodiment of the present application can also change the output audio parameters such as voice intonation according to different expressions, so that the robot can make more appropriate answers.
在一些实施例中,如图6所示,在步骤S400之前,还包括步骤:构建第一模型,具体包括但不限于步骤S610至步骤S650。In some embodiments, as shown in FIG. 6 , before step S400 , a step is further included: building a first model, specifically including but not limited to step S610 to step S650 .
步骤S610,获取消息数据集;Step S610, acquiring a message data set;
步骤S620,对多个问题样本数据进行分词,得到多个问题分词数据;Step S620, performing word segmentation on multiple question sample data to obtain multiple question word segmentation data;
步骤S630,对多个回答样本数据进行分词,得到多个回答分词数据;Step S630, performing word segmentation on multiple answer sample data to obtain multiple answer word segmentation data;
步骤S640,获取第一原始模型;Step S640, acquiring the first original model;
步骤S650,根据多个问题分词数据、多个回答分词数据和多个预设表情对第一原始模型进行训练,得到第一模型。In step S650, the first original model is trained according to a plurality of question word segmentation data, a plurality of answer word segmentation data and a plurality of preset expressions to obtain a first model.
在一些实施例的步骤S610中,获取用于进行模型训练的消息数据集。其中,消息数据集包括多个问题样本数据、多个预设表情和多个回答样本数据,问题样本数据和预设表情一一对应以形成绑定组,每个绑定组与回答样本数据具有映射关系;In step S610 of some embodiments, a message data set used for model training is obtained. Among them, the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, and the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each binding group has a mapping relationship with the answer sample data;
在一些实施例的步骤S620中,采用中文分词工具jieba或者Analyzer对多个问题样本数据进行分词处理,得到多个问题分词数据。In step S620 of some embodiments, a Chinese word segmentation tool jieba or Analyzer is used to perform word segmentation processing on a plurality of question sample data to obtain a plurality of question word segmentation data.
在一些实施例的步骤S630中,采用中文分词工具jieba或者Analyzer对多个回答样本数据进行分词处理,得到多个回答分词数据。In step S630 of some embodiments, a Chinese word segmentation tool jieba or Analyzer is used to perform word segmentation processing on multiple answer sample data to obtain multiple answer word segmentation data.
在一些实施例的步骤S640中,获取第一原始模型,其中第一原始模型具体可以为Seq2seq模型,该模型还没有经过训练。In step S640 of some embodiments, the first original model is obtained, where the first original model may specifically be a Seq2seq model, which has not been trained.
在一些实施例的步骤S650中,根据多个问题分词数据、多个回答分词数据和多个预设表情对第一原始模型进行训练,得到第一模型。In step S650 of some embodiments, the first original model is trained according to a plurality of question word segmentation data, a plurality of answer word segmentation data and a plurality of preset expressions to obtain a first model.
在一些实施例中,步骤S650还包括但不限于如下步骤:In some embodiments, step S650 also includes but not limited to the following steps:
将多个问题分词数据和多个回答分词数据输入至编码器进行第一编码,得到样本编码数据;Inputting a plurality of question word segmentation data and a plurality of answer word segmentation data into an encoder for first encoding to obtain sample encoding data;
将多个预设表情输入至编码器进行第二编码,得到表情编码数据;Inputting a plurality of preset expressions into the encoder for second encoding to obtain expression encoding data;
对样本编码数据和表情编码数据进行拼接,得到样本拼接数据;Splicing the sample coded data and the expression coded data to obtain the sample spliced data;
将样本拼接数据输入至解码器进行解码,得到样本解码数据;Input the sample splicing data to the decoder for decoding to obtain sample decoded data;
根据样本拼接数据和样本解码数据,计算第一原始模型的损失函数,得到损失值;Calculate the loss function of the first original model according to the sample splicing data and the sample decoding data to obtain a loss value;
根据损失值更新第一原始模型,得到第一模型。The first original model is updated according to the loss value to obtain the first model.
更具体地,将多个问题分词数据和多个回答分词数据输入至编码器进行第一编码,得到样本编码数据。其中,编码器指的是word2vec,所生成的样本编码数据为词嵌入向量。同时,将多个预设表情输入至word2vec进行第二编码,得到表情编码数据。接着,将样本编码数据和表情编码数据输入至Seq2seq模型中,进行训练。具体地,通过Seq2seq模型对样本编码数据和表情编码数据进行拼接,得到样本拼接数据,将样本拼接数据输入至解码器进行解码,得到样本解码数据;根据样本拼接数据和样本解码数据,计算第一原始模型的损失函数,例如交叉熵损失函数,得到损失值;根据损失值更新第一原始模型,得到第一模型。为了解决Seq2seq中解码器只接受编码器最后一个输出,而远离了之前的输出导致的信息丢失问题,本申请实施例还使用了attention模型,将注意力集中在问题的一些关键位置。More specifically, a plurality of question word segmentation data and a plurality of answer word segmentation data are input into an encoder for first encoding to obtain sample encoding data. Among them, the encoder refers to word2vec, and the generated sample encoding data is a word embedding vector. At the same time, a plurality of preset expressions are input into word2vec for second encoding to obtain expression encoding data. Next, input the sample coded data and expression coded data into the Seq2seq model for training. Specifically, the sample coded data and the expression coded data are spliced through the Seq2seq model to obtain the sample spliced data, and the sample spliced data is input to the decoder for decoding to obtain the sample decoded data; according to the sample spliced data and the sample decoded data, the loss function of the first original model is calculated, such as a cross-entropy loss function, to obtain a loss value; the first original model is updated according to the loss value to obtain the first model. In order to solve the problem of information loss caused by the decoder only accepting the last output of the encoder in Seq2seq and away from the previous output, the embodiment of the present application also uses an attention model to focus on some key positions of the problem.
在一些实施例中,如图7所示,本申请采用了多个模块以实现基于表情识别的语音消息生成方法的过程。具体地模块包括:语音识别模块、表情识别模块、文本理解模块和语音转换模块,具体地方法包括:语音识别模块识别用户对聊天机器人说话的语音信息,且将语音信息转换成对应的文本。与此同时,摄像头获取用户说话时的图像,并捕捉人脸区域,得到 人脸区域图像,将人脸区域图像输入至表情识别模块,由表情识别模块识别出对应的表情。将语音识别模块得到的文本以及表情识别模块得到的表情输入至文本理解模块中,由文本理解模块根据文本和表情生成文本回答。将文本输入至语音转换模块中,生成语音回答,由此完成基于表情识别的语音消息生成方法的过程。In some embodiments, as shown in FIG. 7 , the present application adopts multiple modules to realize the process of the voice message generation method based on facial expression recognition. Specifically, the modules include: a speech recognition module, an expression recognition module, a text understanding module, and a voice conversion module. The specific method includes: the speech recognition module recognizes the voice information of the user speaking to the chat robot, and converts the voice information into corresponding text. At the same time, the camera captures the image of the user speaking, and captures the face area to obtain an image of the face area, and then inputs the image of the face area to the expression recognition module, which recognizes the corresponding expression. The text obtained by the speech recognition module and the expression obtained by the expression recognition module are input into the text understanding module, and the text understanding module generates a text answer according to the text and the expression. The text is input into the voice conversion module to generate a voice answer, thereby completing the process of the expression recognition-based voice message generation method.
本申请实施例提出的基于表情识别的语音消息生成方法,通过获取语音数据及其对应的面部图像,对语音数据进行语音识别得到文本消息,并对面部图像进行表情识别得到表情消息;将文本消息和表情消息输入至第一模型,由第一模型根据文本消息和表情消息得到回答文本消息,最后对回答文本消息进行语音转换,得到对应的回答语音消息。本申请实施例将面部图像加入到聊天机器人中,通过对面部图像的识别,能够更加精准判断出当前的情景,并由第一模型根据文本消息和表情消息得到回答文本消息,且将回答文本消息转换成语音回复消息,进而提高语音回复消息的准确率。The voice message generation method based on expression recognition proposed in the embodiment of the present application obtains voice data and its corresponding facial image, conducts voice recognition on the voice data to obtain a text message, and performs expression recognition on the facial image to obtain an expression message; the text message and the expression message are input into the first model, and the first model obtains the answer text message according to the text message and the expression message, and finally performs voice conversion on the answer text message to obtain the corresponding answer voice message. In the embodiment of the present application, the face image is added to the chat robot. Through the recognition of the face image, the current situation can be judged more accurately, and the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.
本申请实施例还提供一种基于表情识别的语音消息生成装置,如图8所示,可以实现上述基于表情识别的语音消息生成方法,该基于表情识别的语音消息生成装置包括:数据采集模块710、语音识别模块720、表情识别模块730、文本消息获取模块740和语音消息获取模块750,其中数据采集模块710用于获取语音数据及其对应的面部图像:语音识别模块720用于对语音数据进行语音识别得到文本消息;表情识别模块730用于对面部图像进行表情识别得到表情消息:文本消息获取模块740用于将文本消息和表情消息输入至第一模型,第一模型根据文本消息和表情消息得到回答文本消息:语音消息获取模块750用于对回答文本消息进行语音转换,得到对应的回答语音消息。本申请实施例将面部图像加入到聊天机器人中,通过对面部图像的识别,能够更加精准判断出当前的情景,并由第一模型根据文本消息和表情消息得到回答文本消息,且将回答文本消息转换成语音回复消息,进而提高语音回复消息的准确率。The embodiment of the present application also provides a voice message generating device based on facial expression recognition. As shown in FIG. 8 , the voice message generating method based on facial expression recognition can be realized. The voice message generating device based on facial expression recognition includes: a data acquisition module 710, a voice recognition module 720, an facial expression recognition module 730, a text message acquisition module 740, and a voice message acquisition module 750, wherein the data acquisition module 710 is used to acquire voice data and corresponding facial images; the voice recognition module 720 is used to perform voice recognition on voice data to obtain text messages; Perform facial expression recognition on the facial image to obtain an emoticon message: the text message acquisition module 740 is used to input the text message and the emoticon message to the first model, and the first model obtains an answer text message according to the text message and the emoticon message: the voice message acquisition module 750 is used to perform speech conversion on the answer text message to obtain a corresponding answer voice message. In the embodiment of the present application, the face image is added to the chat robot. Through the recognition of the face image, the current situation can be judged more accurately, and the answer text message is obtained by the first model according to the text message and the emoticon message, and the answer text message is converted into a voice reply message, thereby improving the accuracy of the voice reply message.
本申请实施例的基于表情识别的语音消息生成装置用于执行上述实施例中的基于表情识别的语音消息生成方法,其具体处理过程与上述实施例中的基于表情识别的语音消息生成方法相同,此处不再一一赘述。The voice message generation device based on expression recognition in the embodiment of the present application is used to execute the method for generating a voice message based on expression recognition in the above embodiment, and its specific processing process is the same as the method for generating a voice message based on expression recognition in the above embodiment, and will not be repeated here.
本申请实施例还提供了一种计算机设备,包括:The embodiment of the present application also provides a computer device, including:
至少一个处理器,以及,at least one processor, and,
与至少一个处理器通信连接的存储器;其中,memory communicatively coupled to at least one processor; wherein,
存储器存储有指令,指令被至少一个处理器执行,以使至少一个处理器执行指令时实现一种基于表情识别的语音消息生成方法,该语音消息生成方法包括:获取语音数据及其对应的面部图像:对语音数据进行语音识别得到文本消息;对面部图像进行表情识别得到表情消息:将文本消息和表情消息输入至第一模型,第一模型根据文本消息和表情消息得到回答文本消息:对回答文本消息进行语音转换,得到对应的回答语音消息。The memory stores instructions, and the instructions are executed by at least one processor, so that when the at least one processor executes the instructions, a voice message generation method based on expression recognition is implemented. The voice message generation method includes: acquiring voice data and its corresponding facial image: performing voice recognition on the voice data to obtain a text message; performing facial expression recognition on the facial image to obtain an expression message: inputting the text message and the expression message to the first model, and the first model obtains an answer text message according to the text message and the expression message: performing voice conversion on the answer text message to obtain a corresponding answer voice message.
下面结合图9对计算机设备的硬件结构进行详细说明。该计算机设备包括:处理器810、存储器820、输入/输出接口830、通信接口840和总线850。The hardware structure of the computer device will be described in detail below in conjunction with FIG. 9 . The computer device includes: a processor 810 , a memory 820 , an input/output interface 830 , a communication interface 840 and a bus 850 .
处理器810,可以采用通用的中央处理器(Central Processin Unit,CPU)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请实施例所提供的技术方案;The processor 810 can be implemented by a general-purpose central processing unit (Central Processin Unit, CPU), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs to realize the technical solutions provided by the embodiments of the present application;
存储器820,可以采用只读存储器(Read Only Memory,ROM)、静态存储设备、动态存储设备或者随机存取存储器(Random Access Memory,RAM)等形式实现。存储器820可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器820中,并由处理器810来调用执行本申请实施例的基于表情识别的语音消息生成方法;The memory 820 may be implemented in the form of a read-only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM). The memory 820 can store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 820, and are called by the processor 810 to execute the expression recognition-based voice message generation method of the embodiments of the present application;
输入/输出接口830,用于实现信息输入及输出;The input/output interface 830 is used to realize information input and output;
通信接口840,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信;和The communication interface 840 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.); and
总线850,在设备的各个组件(例如处理器810、存储器820、输入/输出接口830和通信接口840)之间传输信息;bus 850, to transfer information between various components of the device (eg, processor 810, memory 820, input/output interface 830, and communication interface 840);
其中处理器810、存储器820、输入/输出接口830和通信接口840通过总线850实现彼此之间在设备内部的通信连接。The processor 810 , the memory 820 , the input/output interface 830 and the communication interface 840 are connected to each other within the device through the bus 850 .
本申请实施例还提供一种存储介质,该存储介质是计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令用于使计算机执行一种基于表情识别的语音消息生成方法,该语音消息生成方法包括:获取语音数据及其对应的面部图像:对语音数据进行语音识别得到文本消息;对面部图像进行表情识别得到表情消息:将文本消息和表情消息输入至第一模型,第一模型根据文本消息和表情消息得到回答文本消息:对回答文本消息进行语音转换,得到对应的回答语音消息。An embodiment of the present application further provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions. The computer-executable instructions are used to make the computer execute a voice message generation method based on facial expression recognition. answer the voice message.
所述计算机可读存储介质可以是非易失性,也可以是易失性。存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The computer-readable storage medium may be non-volatile or volatile. As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments described in the embodiments of the present application are to illustrate the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation to the technical solutions provided by the embodiments of the present application. Those skilled in the art know that with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of the present application are also applicable to similar technical problems.
本领域技术人员可以理解的是,图1至图7中示出的技术方案并不构成对本申请实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。Those skilled in the art can understand that the technical solutions shown in FIG. 1 to FIG. 7 do not limit the embodiments of the present application, and may include more or fewer steps than those shown in the illustrations, or combine some steps, or different steps.
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括多指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on such an understanding, the technical solution of the present application essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes multiple instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disks or optical discs and other media that can store programs.
以上参照附图说明了本申请实施例的优选实施例,并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本申请实施例的权利范围之内。The preferred embodiments of the embodiments of the present application have been described above with reference to the accompanying drawings, which does not limit the scope of rights of the embodiments of the present application. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present application shall fall within the scope of rights of the embodiments of the present application.

Claims (20)

  1. 一种基于表情识别的语音消息生成方法,其中,包括:A method for generating voice messages based on facial expression recognition, comprising:
    获取语音数据及其对应的面部图像:Get speech data and its corresponding face image:
    对所述语音数据进行语音识别得到文本消息;performing speech recognition on the speech data to obtain a text message;
    对所述面部图像进行表情识别得到表情消息:Carry out expression recognition to described facial image and obtain expression message:
    将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息:The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
    对所述回答文本消息进行语音转换,得到对应的回答语音消息。Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
  2. 根据权利要求1所述的方法,其中,所述对所述语音数据进行语音识别得到文本消息,包括:The method according to claim 1, wherein said performing speech recognition on said speech data to obtain a text message comprises:
    对所述语音数据的时域信号进行积分变换得到频域信号;Carrying out integral transformation to the time-domain signal of the speech data to obtain the frequency-domain signal;
    根据所述时域信号和所述频域信号,构建平面空间;constructing a planar space according to the time domain signal and the frequency domain signal;
    通过第一神经网络,在所述平面空间中对所述语音数据进行卷积运算,得到语音序列和序列长度;Carrying out a convolution operation on the voice data in the planar space through the first neural network to obtain a voice sequence and sequence length;
    根据所述序列长度对所述语音序列进行切片,得到多个切片序列;Slicing the speech sequence according to the sequence length to obtain a plurality of slice sequences;
    通过第二神经网络对多个所述切片序列进行文本转换,得到所述文本消息。performing text conversion on multiple slice sequences through the second neural network to obtain the text message.
  3. 根据权利要求1所述的方法,其中,所述对所述面部图像进行表情识别得到表情消息,包括:The method according to claim 1, wherein said performing expression recognition on said facial image to obtain an expression message comprises:
    通过第三神经网络对所述面部图像进行自注意力筛选,得到变换参数;Carry out self-attention screening to the facial image through the third neural network to obtain transformation parameters;
    根据所述变换参数对所述面部图像进行扭曲变换,得到变换图像;performing warping transformation on the facial image according to the transformation parameters to obtain a transformed image;
    通过第四神经网络对所述面部图像和所述变换图像进行表情识别,得到所述表情消息。Perform expression recognition on the facial image and the transformed image through the fourth neural network to obtain the expression message.
  4. 根据权利要求3所述的方法,其中,所述第四神经网络包括卷积层、全连接层和分类器;所述通过第四神经网络对所述面部图像和所述变换图像进行表情识别,得到表情消息,包括:The method according to claim 3, wherein the fourth neural network includes a convolutional layer, a fully connected layer and a classifier; the facial expression and the transformed image are recognized by the fourth neural network to obtain an expression message, including:
    通过所述卷积层对所述面部图像和所述变换图像进行特征提取,得到多个图像特征向量;performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors;
    通过所述全连接层对多个所述图像特征向量进行拼接,得到图像拼接向量;splicing a plurality of the image feature vectors through the fully connected layer to obtain an image splicing vector;
    通过所述分类器对所述图像拼接向量进行表情分类,得到所述表情消息。Using the classifier to classify the expression of the image mosaic vector to obtain the expression message.
  5. 根据权利要求1所述的方法,其中,在所述将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息之前,包括:The method according to claim 1, wherein, before said inputting said text message and said emoticon message into the first model, said first model getting a reply text message according to said text message and said emoticon message, comprising:
    获取消息数据集;其中,所述消息数据集包括多个问题样本数据、多个预设表情和多个回答样本数据,所述问题样本数据和所述预设表情一一对应以形成绑定组,每个所述绑定组与所述回答样本数据具有映射关系;Obtaining a message data set; wherein, the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each of the binding groups has a mapping relationship with the answer sample data;
    对多个所述问题样本数据进行分词,得到多个问题分词数据;Segmenting a plurality of question sample data to obtain a plurality of question word segmentation data;
    对多个所述回答样本数据进行分词,得到多个回答分词数据;Segmenting a plurality of said answer sample data to obtain a plurality of answer word segmentation data;
    获取第一原始模型;obtaining a first primitive model;
    根据多个所述问题分词数据、多个所述回答分词数据和多个所述预设表情对所述第一原始模型进行训练,得到所述第一模型。The first original model is obtained by training the first original model according to the plurality of word segmentation data of the question, the plurality of word segmentation data of the answer and the plurality of preset expressions.
  6. 根据权利要求5所述的方法,其中,所述第一原始模型包括编码器和解码器;所述根据多个所述问题分词数据、多个所述回答分词数据和多个所述预设表情对所述第一原始模型进行训练,得到第一模型,包括:The method according to claim 5, wherein the first original model includes an encoder and a decoder; the first original model is trained according to a plurality of the question word segmentation data, a plurality of the answer word segmentation data, and a plurality of preset expressions to obtain a first model, comprising:
    将多个所述问题分词数据和多个所述回答分词数据输入至所述编码器进行第一编码,得到样本编码数据;Inputting a plurality of the question word segmentation data and a plurality of the answer word segmentation data into the encoder for first encoding to obtain sample encoding data;
    将多个所述预设表情输入至所述编码器进行第二编码,得到表情编码数据;Inputting a plurality of preset expressions into the encoder for second encoding to obtain expression encoding data;
    对所述样本编码数据和所述表情编码数据进行拼接,得到样本拼接数据;Splicing the sample coded data and the expression coded data to obtain sample spliced data;
    将所述样本拼接数据输入至所述解码器进行解码,得到样本解码数据;inputting the sample mosaic data to the decoder for decoding to obtain sample decoded data;
    根据所述样本拼接数据和所述样本解码数据,计算所述第一原始模型的损失函数,得到损失值;calculating a loss function of the first original model according to the sample stitching data and the sample decoding data to obtain a loss value;
    根据所述损失值更新所述第一原始模型,得到第一模型。The first original model is updated according to the loss value to obtain a first model.
  7. 根据权利要求1至6任一项所述的方法,其中,所述对所述回答文本消息进行语音转换,得到对应的回答语音消息,包括:The method according to any one of claims 1 to 6, wherein said performing voice conversion on said answering text message to obtain a corresponding answering voice message comprises:
    对所述回答文本消息进行语音转换,得到初步语音消息;Carrying out voice conversion to the answering text message to obtain a preliminary voice message;
    对所述初步语音消息进行变换,得到声谱图;transforming the preliminary voice message to obtain a spectrogram;
    提取所述声谱图的音频特征;extracting audio features of the spectrogram;
    通过第五神经网络模型对所述音频特征进行解码,得到每一帧对应的音频数据;Decoding the audio feature by a fifth neural network model to obtain audio data corresponding to each frame;
    将所述音频数据进行合成处理,得到对应的回答语音消息。The audio data is synthesized to obtain a corresponding reply voice message.
  8. 一种基于表情识别的语音消息生成装置,其中,包括:A voice message generation device based on facial expression recognition, comprising:
    数据采集模块,用于获取语音数据及其对应的面部图像;Data collection module, is used for obtaining voice data and its corresponding facial image;
    语音识别模块,用于对所述语音数据进行语音识别得到文本消息;A voice recognition module, configured to perform voice recognition on the voice data to obtain a text message;
    表情识别模块,用于对所述面部图像进行表情识别得到表情消息;An expression recognition module, configured to perform expression recognition on the facial image to obtain an expression message;
    文本消息获取模块,用于将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息;A text message acquiring module, configured to input the text message and the emoticon message into the first model, and the first model obtains a reply text message according to the text message and the emoticon message;
    语音消息获取模块,用于对所述回答文本消息进行语音转换,得到对应的回答语音消息。The voice message acquisition module is configured to perform voice conversion on the answering text message to obtain a corresponding answering voice message.
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器,其中,所述存储器中存储有程序,所述程序被所述处理器执行时所述处理器用于执行一种基于表情识别的语音消息生成方法,所述语音消息生成方法包括:A computer device, wherein the computer device includes a memory and a processor, wherein a program is stored in the memory, and when the program is executed by the processor, the processor is used to perform a voice message generation method based on facial expression recognition, the voice message generation method comprising:
    获取语音数据及其对应的面部图像:Get speech data and its corresponding face image:
    对所述语音数据进行语音识别得到文本消息;performing speech recognition on the speech data to obtain a text message;
    对所述面部图像进行表情识别得到表情消息:Carry out expression recognition to described facial image and obtain expression message:
    将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息:The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
    对所述回答文本消息进行语音转换,得到对应的回答语音消息。Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
  10. 根据权利要求9所述的一种计算机设备,其中,所述对所述语音数据进行语音识别得到文本消息,包括:A kind of computer equipment according to claim 9, wherein, performing speech recognition on the speech data to obtain a text message comprises:
    对所述语音数据的时域信号进行积分变换得到频域信号;Carrying out integral transformation to the time-domain signal of the speech data to obtain the frequency-domain signal;
    根据所述时域信号和所述频域信号,构建平面空间;constructing a planar space according to the time domain signal and the frequency domain signal;
    通过第一神经网络,在所述平面空间中对所述语音数据进行卷积运算,得到语音序列和序列长度;Carrying out a convolution operation on the voice data in the planar space through the first neural network to obtain a voice sequence and sequence length;
    根据所述序列长度对所述语音序列进行切片,得到多个切片序列;Slicing the speech sequence according to the sequence length to obtain a plurality of slice sequences;
    通过第二神经网络对多个所述切片序列进行文本转换,得到所述文本消息。performing text conversion on multiple slice sequences through the second neural network to obtain the text message.
  11. 根据权利要求9所述的一种计算机设备,其中,所述对所述面部图像进行表情识别得到表情消息,包括:A kind of computer equipment according to claim 9, wherein, said performing expression recognition on said facial image to obtain an expression message comprises:
    通过第三神经网络对所述面部图像进行自注意力筛选,得到变换参数;Carry out self-attention screening to the facial image through the third neural network to obtain transformation parameters;
    根据所述变换参数对所述面部图像进行扭曲变换,得到变换图像;performing warping transformation on the facial image according to the transformation parameters to obtain a transformed image;
    通过第四神经网络对所述面部图像和所述变换图像进行表情识别,得到所述表情消息。Perform expression recognition on the facial image and the transformed image through the fourth neural network to obtain the expression message.
  12. 根据权利要求11所述的一种计算机设备,其中,所述第四神经网络包括卷积层、全连接层和分类器;所述通过第四神经网络对所述面部图像和所述变换图像进行表情识别,得到表情消息,包括:A computer device according to claim 11, wherein the fourth neural network includes a convolutional layer, a fully connected layer, and a classifier; the fourth neural network performs expression recognition on the facial image and the transformed image to obtain an expression message, including:
    通过所述卷积层对所述面部图像和所述变换图像进行特征提取,得到多个图像特征向量;performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors;
    通过所述全连接层对多个所述图像特征向量进行拼接,得到图像拼接向量;splicing a plurality of the image feature vectors through the fully connected layer to obtain an image splicing vector;
    通过所述分类器对所述图像拼接向量进行表情分类,得到所述表情消息。Using the classifier to classify the expression of the image mosaic vector to obtain the expression message.
  13. 根据权利要求9所述的一种计算机设备,其中,在所述将所述文本消息和所述表情 消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息之前,包括:A computer device according to claim 9, wherein, before said inputting said text message and said emoticon message into the first model, said first model obtaining an answer text message according to said text message and said emoticon message, including:
    获取消息数据集;其中,所述消息数据集包括多个问题样本数据、多个预设表情和多个回答样本数据,所述问题样本数据和所述预设表情一一对应以形成绑定组,每个所述绑定组与所述回答样本数据具有映射关系;Obtaining a message data set; wherein, the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each of the binding groups has a mapping relationship with the answer sample data;
    对多个所述问题样本数据进行分词,得到多个问题分词数据;Segmenting a plurality of question sample data to obtain a plurality of question word segmentation data;
    对多个所述回答样本数据进行分词,得到多个回答分词数据;Segmenting a plurality of said answer sample data to obtain a plurality of answer word segmentation data;
    获取第一原始模型;obtaining a first primitive model;
    根据多个所述问题分词数据、多个所述回答分词数据和多个所述预设表情对所述第一原始模型进行训练,得到所述第一模型。The first original model is obtained by training the first original model according to the multiple word segmentation data for the question, the multiple word segmentation data for the answer, and the multiple preset expressions.
  14. 根据权利要求13所述的一种计算机设备,其中,所述第一原始模型包括编码器和解码器;所述根据多个所述问题分词数据、多个所述回答分词数据和多个所述预设表情对所述第一原始模型进行训练,得到第一模型,包括:A computer device according to claim 13, wherein said first original model includes an encoder and a decoder; said first original model is trained according to a plurality of said question word segmentation data, a plurality of said answer word segmentation data, and a plurality of said preset expressions to obtain a first model, comprising:
    将多个所述问题分词数据和多个所述回答分词数据输入至所述编码器进行第一编码,得到样本编码数据;Inputting a plurality of the question word segmentation data and a plurality of the answer word segmentation data into the encoder for first encoding to obtain sample encoding data;
    将多个所述预设表情输入至所述编码器进行第二编码,得到表情编码数据;Inputting a plurality of preset expressions into the encoder for second encoding to obtain expression encoding data;
    对所述样本编码数据和所述表情编码数据进行拼接,得到样本拼接数据;Splicing the sample coded data and the expression coded data to obtain sample spliced data;
    将所述样本拼接数据输入至所述解码器进行解码,得到样本解码数据;inputting the sample mosaic data to the decoder for decoding to obtain sample decoded data;
    根据所述样本拼接数据和所述样本解码数据,计算所述第一原始模型的损失函数,得到损失值;calculating a loss function of the first original model according to the sample stitching data and the sample decoding data to obtain a loss value;
    根据所述损失值更新所述第一原始模型,得到第一模型。The first original model is updated according to the loss value to obtain a first model.
  15. 一种存储介质,所述存储介质为计算机可读存储介质,其中,所述计算机可读存储有计算机程序,在所述计算机程序被计算机执行时,所述计算机用于执行一种基于表情识别的语音消息生成方法,所述语音消息生成方法包括:A storage medium, the storage medium is a computer-readable storage medium, wherein the computer readable storage computer program, when the computer program is executed by the computer, the computer is used to perform a voice message generation method based on facial expression recognition, the voice message generation method includes:
    获取语音数据及其对应的面部图像:Get speech data and its corresponding face image:
    对所述语音数据进行语音识别得到文本消息;performing speech recognition on the speech data to obtain a text message;
    对所述面部图像进行表情识别得到表情消息:Carry out expression recognition to described facial image and obtain expression message:
    将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息:The text message and the emoticon message are input to the first model, and the first model obtains a reply text message according to the text message and the emoticon message:
    对所述回答文本消息进行语音转换,得到对应的回答语音消息。Voice conversion is performed on the answering text message to obtain a corresponding answering voice message.
  16. 根据权利要求15所述的一种存储介质,其中,所述对所述语音数据进行语音识别得到文本消息,包括:A storage medium according to claim 15, wherein said performing speech recognition on said speech data to obtain a text message comprises:
    对所述语音数据的时域信号进行积分变换得到频域信号;Carrying out integral transformation to the time-domain signal of the speech data to obtain the frequency-domain signal;
    根据所述时域信号和所述频域信号,构建平面空间;constructing a planar space according to the time domain signal and the frequency domain signal;
    通过第一神经网络,在所述平面空间中对所述语音数据进行卷积运算,得到语音序列和序列长度;Carrying out a convolution operation on the voice data in the planar space through the first neural network to obtain a voice sequence and sequence length;
    根据所述序列长度对所述语音序列进行切片,得到多个切片序列;Slicing the speech sequence according to the sequence length to obtain a plurality of slice sequences;
    通过第二神经网络对多个所述切片序列进行文本转换,得到所述文本消息。performing text conversion on multiple slice sequences through the second neural network to obtain the text message.
  17. 根据权利要求15所述的一种存储介质,其中,所述对所述面部图像进行表情识别得到表情消息,包括:A storage medium according to claim 15, wherein said expression recognition of said facial image to obtain an expression message comprises:
    通过第三神经网络对所述面部图像进行自注意力筛选,得到变换参数;Carry out self-attention screening to the facial image through the third neural network to obtain transformation parameters;
    根据所述变换参数对所述面部图像进行扭曲变换,得到变换图像;performing warping transformation on the facial image according to the transformation parameters to obtain a transformed image;
    通过第四神经网络对所述面部图像和所述变换图像进行表情识别,得到所述表情消息。Perform expression recognition on the facial image and the transformed image through the fourth neural network to obtain the expression message.
  18. 根据权利要求17所述的一种存储介质,其中,所述第四神经网络包括卷积层、全连接层和分类器;所述通过第四神经网络对所述面部图像和所述变换图像进行表情识别,得到表情消息,包括:A storage medium according to claim 17, wherein the fourth neural network includes a convolutional layer, a fully connected layer, and a classifier; the fourth neural network performs expression recognition on the facial image and the transformed image to obtain an expression message, including:
    通过所述卷积层对所述面部图像和所述变换图像进行特征提取,得到多个图像特征向量;performing feature extraction on the face image and the transformed image through the convolution layer to obtain a plurality of image feature vectors;
    通过所述全连接层对多个所述图像特征向量进行拼接,得到图像拼接向量;splicing a plurality of the image feature vectors through the fully connected layer to obtain an image splicing vector;
    通过所述分类器对所述图像拼接向量进行表情分类,得到所述表情消息。Using the classifier to classify the expression of the image mosaic vector to obtain the expression message.
  19. 根据权利要求15所述的一种存储介质,其中,在所述将所述文本消息和所述表情消息输入至第一模型,所述第一模型根据所述文本消息和所述表情消息得到回答文本消息之前,包括:A storage medium according to claim 15, wherein, before said inputting said text message and said emoticon message into the first model, said first model getting a reply text message according to said text message and said emoticon message, comprising:
    获取消息数据集;其中,所述消息数据集包括多个问题样本数据、多个预设表情和多个回答样本数据,所述问题样本数据和所述预设表情一一对应以形成绑定组,每个所述绑定组与所述回答样本数据具有映射关系;Obtaining a message data set; wherein, the message data set includes a plurality of question sample data, a plurality of preset emoticons and a plurality of answer sample data, the question sample data and the preset emoticons are in one-to-one correspondence to form a binding group, and each of the binding groups has a mapping relationship with the answer sample data;
    对多个所述问题样本数据进行分词,得到多个问题分词数据;Segmenting a plurality of question sample data to obtain a plurality of question word segmentation data;
    对多个所述回答样本数据进行分词,得到多个回答分词数据;Segmenting a plurality of said answer sample data to obtain a plurality of answer word segmentation data;
    获取第一原始模型;obtaining a first primitive model;
    根据多个所述问题分词数据、多个所述回答分词数据和多个所述预设表情对所述第一原始模型进行训练,得到所述第一模型。The first original model is obtained by training the first original model according to the multiple word segmentation data for the question, the multiple word segmentation data for the answer, and the multiple preset expressions.
  20. 根据权利要求19所述的一种存储介质,其中,所述第一原始模型包括编码器和解码器;所述根据多个所述问题分词数据、多个所述回答分词数据和多个所述预设表情对所述第一原始模型进行训练,得到第一模型,包括:A storage medium according to claim 19, wherein the first original model includes an encoder and a decoder; the first original model is trained according to a plurality of the question word segmentation data, a plurality of the answer word segmentation data, and a plurality of preset expressions to obtain a first model, comprising:
    将多个所述问题分词数据和多个所述回答分词数据输入至所述编码器进行第一编码,得到样本编码数据;Inputting a plurality of the question word segmentation data and a plurality of the answer word segmentation data into the encoder for first encoding to obtain sample encoding data;
    将多个所述预设表情输入至所述编码器进行第二编码,得到表情编码数据;Inputting a plurality of preset expressions into the encoder for second encoding to obtain expression encoding data;
    对所述样本编码数据和所述表情编码数据进行拼接,得到样本拼接数据;Splicing the sample coded data and the expression coded data to obtain sample spliced data;
    将所述样本拼接数据输入至所述解码器进行解码,得到样本解码数据;inputting the sample mosaic data to the decoder for decoding to obtain sample decoded data;
    根据所述样本拼接数据和所述样本解码数据,计算所述第一原始模型的损失函数,得到损失值;calculating a loss function of the first original model according to the sample stitching data and the sample decoding data to obtain a loss value;
    根据所述损失值更新所述第一原始模型,得到第一模型。The first original model is updated according to the loss value to obtain a first model.
PCT/CN2022/090752 2022-01-18 2022-04-29 Voice message generation method and apparatus, computer device and storage medium WO2023137922A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210057040.4 2022-01-18
CN202210057040.4A CN114400005A (en) 2022-01-18 2022-01-18 Voice message generation method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023137922A1 true WO2023137922A1 (en) 2023-07-27

Family

ID=81230639

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090752 WO2023137922A1 (en) 2022-01-18 2022-04-29 Voice message generation method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN114400005A (en)
WO (1) WO2023137922A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117122289A (en) * 2023-09-12 2023-11-28 中国人民解放军总医院第一医学中心 Pain assessment method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114400005A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Voice message generation method and device, computer equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127156A (en) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 Robot interactive method based on vocal print and recognition of face
CN108833941A (en) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 Man-machine dialogue system method, apparatus, user terminal, processing server and system
CN110457432A (en) * 2019-07-04 2019-11-15 平安科技(深圳)有限公司 Interview methods of marking, device, equipment and storage medium
CN110717514A (en) * 2019-09-06 2020-01-21 平安国际智慧城市科技股份有限公司 Session intention identification method and device, computer equipment and storage medium
CN112687260A (en) * 2020-11-17 2021-04-20 珠海格力电器股份有限公司 Facial-recognition-based expression judgment voice recognition method, server and air conditioner
CN113555027A (en) * 2021-07-26 2021-10-26 平安科技(深圳)有限公司 Voice emotion conversion method and device, computer equipment and storage medium
CN113704419A (en) * 2021-02-26 2021-11-26 腾讯科技(深圳)有限公司 Conversation processing method and device
CN114400005A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Voice message generation method and device, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106127156A (en) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 Robot interactive method based on vocal print and recognition of face
CN108833941A (en) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 Man-machine dialogue system method, apparatus, user terminal, processing server and system
CN110457432A (en) * 2019-07-04 2019-11-15 平安科技(深圳)有限公司 Interview methods of marking, device, equipment and storage medium
CN110717514A (en) * 2019-09-06 2020-01-21 平安国际智慧城市科技股份有限公司 Session intention identification method and device, computer equipment and storage medium
CN112687260A (en) * 2020-11-17 2021-04-20 珠海格力电器股份有限公司 Facial-recognition-based expression judgment voice recognition method, server and air conditioner
CN113704419A (en) * 2021-02-26 2021-11-26 腾讯科技(深圳)有限公司 Conversation processing method and device
CN113555027A (en) * 2021-07-26 2021-10-26 平安科技(深圳)有限公司 Voice emotion conversion method and device, computer equipment and storage medium
CN114400005A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Voice message generation method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117122289A (en) * 2023-09-12 2023-11-28 中国人民解放军总医院第一医学中心 Pain assessment method
CN117122289B (en) * 2023-09-12 2024-03-19 中国人民解放军总医院第一医学中心 Pain assessment method

Also Published As

Publication number Publication date
CN114400005A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
Fonseca et al. Unsupervised contrastive learning of sound event representations
CN111967266B (en) Chinese named entity recognition system, model construction method, application and related equipment
CN110427461B (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
US20230267916A1 (en) Text-based virtual object animation generation method, apparatus, storage medium, and terminal
CN111312245B (en) Voice response method, device and storage medium
WO2023137922A1 (en) Voice message generation method and apparatus, computer device and storage medium
CN113205817B (en) Speech semantic recognition method, system, device and medium
CN114676234A (en) Model training method and related equipment
CN113723166A (en) Content identification method and device, computer equipment and storage medium
CN111694940A (en) User report generation method and terminal equipment
EP4198807A1 (en) Audio processing method and device
Elshaer et al. Transfer learning from sound representations for anger detection in speech
CN113240115A (en) Training method for generating face change image model and related device
CN110851650B (en) Comment output method and device and computer storage medium
CN113392265A (en) Multimedia processing method, device and equipment
CN116564289A (en) Visual speech recognition for digital video using generative countermeasure learning
CN112580669B (en) Training method and device for voice information
CN116564338B (en) Voice animation generation method, device, electronic equipment and medium
WO2023159759A1 (en) Model training method and apparatus, emotion message generation method and apparatus, device and medium
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
Zhang et al. Knowledge-aware attentive wasserstein adversarial dialogue response generation
CN111599363B (en) Voice recognition method and device
CN111445545A (en) Text-to-map method, device, storage medium and electronic equipment
CN112417118A (en) Dialog generation method based on marked text and neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921353

Country of ref document: EP

Kind code of ref document: A1