CN110807388A

CN110807388A - Interaction method, interaction device, terminal equipment and storage medium

Info

Publication number: CN110807388A
Application number: CN201911024921.0A
Authority: CN
Inventors: 金益欣
Original assignee: Shenzhen Chase Technology Co Ltd
Current assignee: Shenzhen Chase Technology Co Ltd; Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2020-02-18
Anticipated expiration: 2039-10-25
Also published as: CN110807388B

Abstract

The embodiment of the application provides an interaction method, an interaction device, terminal equipment and a storage medium. The method comprises the following steps: when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in a video to be processed; recognizing sign language information to obtain text information, and performing emotion analysis on a face image sequence to obtain emotion characteristics; determining semantic information of a video to be processed based on the text information and the emotional characteristics, and acquiring reply sign language information corresponding to the semantic information; generating action parameters of the virtual intelligent customer service based on the reply sign language information; driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence; based on the reply image sequence, a reply video for the video to be processed is generated and output. According to the method and the device, the hand information and the face of the user are identified, and the semantic information is determined according to the identified text information and the emotion characteristics, so that the accuracy of user intention identification is improved.

Description

Interaction method, interaction device, terminal equipment and storage medium

Technical Field

The present application relates to the field of human-computer interaction technologies, and in particular, to an interaction method, an interaction apparatus, a terminal device, and a storage medium.

Background

Customer service is a main way for enterprises to obtain feedback opinions of users and solve product questions of the users. Traditional customer service is mainly handled by manual customer service staff, so that the investment of enterprises in customer service is linearly increased at a high speed along with the increase of the customer service volume, and the expenditure becomes considerable. Aiming at the problem, the current advanced scheme is to introduce a customer service robot so as to reduce the manual customer service amount and the cost required by enterprises. As is known, people with hearing and language disabilities in China exceed 2000 ten thousand, however, at present, users mainly facing to a customer service robot are normal people, and the number of the customer service robots facing to special groups is small, so that the users of the special groups are difficult to interact with the customer service robot, and the convenience of interaction between the users of the special groups and the customer service robot is reduced.

Disclosure of Invention

The embodiment of the application provides an interaction method, an interaction device, terminal equipment and a storage medium, so as to solve the problems.

In a first aspect, an embodiment of the present application provides an interaction method, which is applied to a terminal device, and the method includes: when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in a video to be processed; recognizing the sign language information to obtain text information, and performing emotion analysis on the face image sequence to obtain emotion characteristics; determining semantic information of the video to be processed based on the text information and the emotional features, and acquiring reply sign language information corresponding to the semantic information; generating action parameters of the virtual intelligent customer service based on the reply sign language information; driving the action of the virtual intelligent customer service based on the action parameter to generate a reply image sequence, wherein the reply image sequence is composed of a plurality of frames of continuous action images generated by driving the virtual intelligent customer service; and generating and outputting a reply video aiming at the video to be processed based on the reply image sequence.

Optionally, the determining semantic information of the video to be processed based on the text information and the emotional features includes: inputting the text information into a first machine learning model to obtain semantic information corresponding to the text information; inputting the emotional features into a second machine learning model to obtain semantic information corresponding to the emotional features; and determining semantic information of the video to be processed based on the semantic information corresponding to the text information and the semantic information corresponding to the emotional features.

Optionally, after acquiring sign language information and a face image sequence in a video to be processed, the method further includes: acquiring the quantity of sign language information in a video to be processed in a preset time period; calculating the change speed of sign language information in the video to be processed based on the preset time period and the number; the emotion analyzing of the face image sequence to obtain emotion characteristics comprises the following steps: and performing emotion analysis on the human face image sequence and the change speed to acquire the emotion characteristics.

Optionally, after acquiring sign language information and a face image sequence in a video to be processed, the method further includes: acquiring sign language information adjacent to the sign language information, and determining context semantic information based on the sign language information and the adjacent sign language information; the emotion analyzing of the face image sequence to obtain emotion characteristics comprises the following steps: and performing emotion analysis on the face image and the context semantic information to acquire the emotion characteristics.

Optionally, before the current mode of the terminal device is a sign language recognition mode and sign language information in a video to be processed is acquired, the method further includes: acquiring a video to be processed; if the current mode of the terminal equipment is a non-sign language recognition mode, judging whether the video to be processed contains sign language information or not based on a first neural network model; and when the video to be processed contains sign language information, switching the current mode of the terminal equipment into a sign language identification mode.

Optionally, the facial image sequence includes a plurality of facial images, and performing emotion analysis on the facial image sequence to obtain emotion characteristics includes: extracting a face key point corresponding to each face image in the face image sequence; obtaining a feature vector corresponding to each face image based on each face image in the face image sequence and a face key point corresponding to each face image; and determining the emotion characteristics corresponding to the feature vectors according to a preset mapping relation to obtain the emotion characteristics corresponding to each face image in the face image sequence, wherein the preset mapping relation comprises the corresponding relation between a plurality of feature vectors and a plurality of emotion characteristics.

Optionally, the acquiring the reply sign language information corresponding to the semantic information includes: searching corresponding reply text information based on the text information and semantic information corresponding to the video to be processed; and inputting the reply text information into a second neural network model to obtain reply sign language information corresponding to the reply text information, wherein the second neural network model is obtained by taking the sample reply text information as input and the reply sign language information corresponding to the sample reply text information as output and training based on a machine learning algorithm.

In a second aspect, an embodiment of the present application provides an interaction apparatus, which is applied to a terminal device, and the apparatus includes: the information acquisition module is used for acquiring sign language information and a face image sequence in a video to be processed when the current mode of the terminal equipment is a sign language identification mode; the information identification module is used for identifying the sign language information to obtain text information and carrying out emotion analysis on the face image sequence to obtain emotion characteristics; the information determining module is used for determining semantic information of the video to be processed based on the text information and the emotional characteristics and acquiring reply sign language information corresponding to the semantic information; the parameter generation module is used for generating action parameters of the virtual intelligent customer service based on the reply sign language information; the sequence generation module is used for driving the action of the virtual intelligent customer service based on the action parameter to generate a reply image sequence, and the reply image sequence is formed by a plurality of continuous action images generated by driving the virtual intelligent customer service; and the video generation module is used for generating and outputting a reply video aiming at the video to be processed based on the reply image sequence.

Optionally, the facial image sequence includes a plurality of facial images, and the information recognition module includes: the key point extraction submodule is used for extracting a face key point corresponding to each face image in the face image sequence; the vector obtaining submodule is used for obtaining a feature vector corresponding to each face image based on each face image in the face image sequence and a face key point corresponding to each face image; and the feature determination submodule is used for determining the emotion features corresponding to the feature vectors according to a preset mapping relation to obtain the emotion features corresponding to each face image in the face image sequence, wherein the preset mapping relation comprises the corresponding relation between a plurality of feature vectors and a plurality of emotion features.

Optionally, the information determining module includes: the first semantic information obtaining submodule is used for inputting the text information into a first machine learning model and obtaining semantic information corresponding to the text information; the second semantic information obtaining submodule is used for inputting the emotion characteristics into a second machine learning model to obtain semantic information corresponding to the emotion characteristics; and the semantic information determining submodule is used for determining the semantic information of the video to be processed based on the semantic information corresponding to the text information and the semantic information corresponding to the emotion characteristics.

Optionally, the information determining module further includes: the information searching submodule is used for searching corresponding reply text information based on the text information and the semantic information corresponding to the video to be processed; and the sign language information obtaining sub-module is used for inputting the reply text information into a second neural network model and obtaining reply sign language information corresponding to the reply text information, wherein the second neural network model is obtained by taking sample reply text information as input and reply sign language information corresponding to the sample reply text information as output and training based on a machine learning algorithm.

Optionally, the interaction device further comprises: the quantity acquisition module is used for acquiring the quantity of sign language information in the video to be processed in a preset time period; the speed calculation module is used for calculating the change speed of the sign language information in the video to be processed based on the preset time period and the number; and the first characteristic acquisition module is used for carrying out emotion analysis on the human face image sequence and the change speed to acquire the emotion characteristics.

Optionally, the interaction device further comprises: the semantic information determining module is used for acquiring sign language information adjacent to the sign language information and determining context semantic information based on the sign language information and the adjacent sign language information; and the second characteristic acquisition module is used for carrying out emotion analysis on the face image and the context semantic information to acquire the emotion characteristics.

Optionally, the interaction device further comprises: the video acquisition module is used for acquiring a video to be processed; the information judgment module is used for judging whether the video to be processed contains sign language information or not based on a first neural network model if the current mode of the terminal equipment is a non-sign language recognition mode; and the mode switching module is used for switching the current mode of the terminal equipment into a sign language identification mode when the video to be processed contains sign language information.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory and a processor, where the memory is coupled to the processor, and the memory stores instructions, and the processor executes the above method when the instructions are executed by the processor.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which program code is stored, and the program code can be called by a processor to execute the method according to the first aspect.

The embodiment of the application provides an interaction method, an interaction device, terminal equipment and a storage medium. When the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in a video to be processed; recognizing sign language information to obtain text information, and performing emotion analysis on a face image sequence to obtain emotion characteristics; determining semantic information of a video to be processed based on the text information and the emotional characteristics, and acquiring reply sign language information corresponding to the semantic information; generating action parameters of the virtual intelligent customer service based on the reply sign language information; driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence; based on the reply image sequence, a reply video for the video to be processed is generated and output. The method and the device have the advantages that the semantic information is determined according to the recognized text information and the recognized emotional characteristics by recognizing the hand language information and the face of the user, so that the accuracy of the user intention recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application environment suitable for use in embodiments of the present application;

FIG. 2 is a flow chart illustrating an interaction method provided by an embodiment of the present application;

FIG. 3 is a diagram illustrating an example display of a virtual intelligence customer service provided by an embodiment of the present application;

FIG. 4 is a flow chart illustrating another interaction method provided by an embodiment of the present application;

FIG. 5 is a flow chart illustrating a further interaction method provided by an embodiment of the present application;

FIG. 6 is a flow chart illustrating a further interaction method provided by an embodiment of the present application;

fig. 7 is a flowchart illustrating a further interaction method provided in an embodiment of the present application;

FIG. 8 is a flow chart illustrating yet another interaction method provided by an embodiment of the present application;

fig. 9 is a block diagram illustrating a structure of an interaction apparatus provided in an embodiment of the present application;

fig. 10 shows a block diagram of a terminal device for executing an interaction method according to an embodiment of the present application.

Fig. 11 illustrates a storage unit for storing or carrying program code for implementing an interaction method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

With the development of internet and social media, in addition to industries that have traditionally had a strong demand for customer service (e.g., operator customer service, bank customer service, online robots for government policy answers, etc.), some new industries include: cell phones, automobiles, express delivery industries, etc. are also beginning to try to introduce virtual customer service assistants (i.e., virtual smart customer service). When the virtual intelligent customer service is in conversation with the user, the reply content of the user consultation can be expressed in a voice mode through the virtual character image, so that the user can visually see that the virtual customer service assistant with the virtual character image speaks on a human-computer interaction interface, and the user and the virtual customer service assistant can communicate face to face.

However, at present, the customer service robot mainly faces users as normal people, and the customer service robots for providing services for special groups are fewer. Similarly, current customer service robots interact with users by recognizing sign language information input by the users, but cannot accurately determine the intentions of the users only by recognizing the sign language information.

In order to solve the above problems, the inventor proposes an interaction method, an interaction device, a terminal device, and a storage medium in the embodiments of the present application, which identify the information of the hand language and the face of the user, and determine semantic information according to the identified text information and emotional features, thereby improving the accuracy of the user intention identification.

In order to better understand the interaction method, the interaction apparatus, the terminal device, and the storage medium provided in the embodiments of the present application, an application environment suitable for the embodiments of the present application is described below.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The interaction method provided by the embodiment of the present application can be applied to the multi-state interaction system 100 shown in fig. 1. Polymorphic interaction system 100 includes a terminal device 110 and a server 120, server 120 being communicatively coupled to terminal device 110. The server 120 may be a conventional server or a cloud server, and is not limited herein.

The terminal device 110 may be various electronic devices having a display screen and supporting data input, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, a wearable electronic device, and the like. Specifically, the data input may be voice input based on a voice module provided on the terminal device 110, character input based on a character input module, or the like.

The terminal device 110 may have a client application installed thereon, and the user may communicate with the server 120 based on the client application (e.g., an Application (APP), a wechat applet, and the like). Specifically, the server 120 is installed with a corresponding server application, the user may register a user account in the server 120 based on the client application, and communicate with the server 120 based on the user account, for example, the user logs in the user account in the client application, and inputs a text message or a voice message through the client application based on the user account, and after receiving the information input by the user, the client application may send the information to the server 120, so that the server 120 may receive, process, and store the information, and the server 120 may also receive the information and return a corresponding output message to the terminal device 110 according to the information.

In some embodiments, a client application may be used to provide customer service to a user, in customer service communication with the user, and the client application may interact with the user based on a virtual robot. Specifically, the client application may receive information input by a user and respond to the information based on the virtual robot. The virtual robot is a software program based on visual graphics, and the software program can present robot forms simulating biological behaviors or ideas to a user after being executed. The virtual robot may be a robot simulating a real person, for example, a robot shaped like a real person built according to the shape of the user himself or other people, or a robot having an animation effect, for example, a robot shaped like an animal or a cartoon character, and is not limited herein.

In some embodiments, after acquiring the reply information corresponding to the information input by the user, the terminal device 110 may display a virtual robot image corresponding to the reply information on a display screen of the terminal device 110 or other image output devices connected thereto (wherein the virtual robot image characteristics may include a sex of the virtual robot, a reply emotion corresponding to the reply audio, and a character characteristic, etc.). As a mode, while the virtual robot image is played, the audio corresponding to the virtual robot image may be played through the speaker of the terminal device 110 or other audio output devices connected thereto, and the text or the graphic corresponding to the reply information may be displayed on the display screen of the terminal device 110, so that multi-state interaction with the user in multiple aspects of image, voice, text, and the like is realized.

In some embodiments, the means for processing the information input by the user may also be disposed on the terminal device 110, so that the terminal device 110 can interact with the user without relying on establishing communication with the server 120, and in this case, the polymorphic interaction system 100 may only include the terminal device 110.

The above application environments are only examples for facilitating understanding, and it is to be understood that the embodiments of the present application are not limited to the above application environments.

The interaction method, the interaction apparatus, the terminal device, and the storage medium provided in the embodiments of the present application are described in detail below with specific embodiments.

Referring to fig. 2, fig. 2 is a flowchart illustrating an interaction method according to an embodiment of the present disclosure. The interaction method provided by the embodiment can be applied to terminal equipment with a display screen or other image output devices, and the terminal equipment can be electronic equipment such as a smart phone, a tablet personal computer and a wearable intelligent terminal.

In a specific embodiment, the interaction method may be applied to the interaction apparatus 200 shown in fig. 8 and the terminal device 110 shown in fig. 9. The flow shown in fig. 2 will be described in detail below. The above-mentioned interaction method may specifically comprise the steps of:

step S110: and when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in the video to be processed.

In this embodiment, the terminal device may include multiple modes, where different modes correspond to different operations of the terminal device, for example, when the current mode of the terminal device is a voice recognition mode, the terminal device may collect voice information and recognize the voice information, so that a user may perform human-computer interaction through voice; when the current mode of the terminal equipment is a text recognition mode, the terminal equipment can acquire text information input by a user and interact with the user; when the current mode of the terminal equipment is a sign language recognition mode, sign language information and a face image sequence in a video to be processed can be acquired for recognition operation.

In some embodiments, the terminal device may select different modes by receiving an operation of a user. Specifically, the terminal device may select a corresponding mode based on a touch operation of the user on the interface, for example, when the user clicks an icon for voice recognition on the interface, the mode of the terminal device may be selected as the voice recognition mode. The terminal device may also determine a mode corresponding to the video by collecting the video containing the user and recognizing the video, for example, when it is recognized that the video contains sign language information, the mode of the terminal device may be selected as a sign language recognition mode.

As an implementation manner, when the current mode of the terminal device is a sign language recognition mode, in order to avoid false triggering operation caused by acquiring voice information, audio acquisition devices such as a microphone may be turned off, and only image acquisition devices such as a camera may be turned on to acquire sign language information and a face image sequence of a user, so that power consumption of the terminal device may be reduced.

In some embodiments, the video to be processed is a video stream including at least a hand of the user and a face of the user, and may be a video stream including only the upper body of the user or a video stream including the entire body of the user. The terminal equipment can acquire the video to be processed in various modes. In some embodiments, the video to be processed may be a video of the user, which is acquired by the terminal device in real time by using an image acquisition device such as a camera when the user interacts with the virtual smart customer service. Specifically, as a manner, when the application program corresponding to the virtual intelligent customer service is run in the system foreground of the terminal device, each hardware module of the terminal device may be called to collect the video of the user.

In some embodiments, after the terminal device acquires the video to be processed, and when the current mode of the terminal device is a sign language recognition mode, sign language information and a face image sequence in the video to be processed may be acquired. As an implementation, the video to be processed may be decomposed to extract sign language information and a sequence of face images. The sign language information may be a video image including a hand motion selected from the decomposed video images, and the face image sequence may be a video image including a face of the user selected from the decomposed video images.

Step S120: and recognizing the sign language information to obtain text information, and performing emotion analysis on the face image sequence to obtain emotion characteristics.

In some embodiments, the sign language information may be input into a recognition model corresponding to the sign language information, and the sign language information is recognized based on the recognition model, and the text information corresponding to the sign language information is acquired.

As an implementation manner, the text information may be text information corresponding to the sign language information, which is queried and obtained in a question-and-answer library based on the sign language information, where the question-and-answer library includes pre-stored sign language information and pre-stored text information corresponding to the sign language information, and each sign language information corresponds to its matched text information one to one. For example, the sign language information pre-stored in the question-and-answer library may be a complete question such as "do you shop for a bag and post? "so that based on the sign language information, text information corresponding to the sign language information can be acquired.

In one embodiment, the text information may be obtained based on a question-answer model, and specifically, sign language information may be input into the question-answer model, and text information corresponding to the sign language information may be obtained through the question-answer model. The question-answer model may be obtained based on a large number of question-answer pairs, for example, a large number of question-answer videos obtained from communication records of a large number of human customer services may be used as training samples, sign language information is used as input, text information corresponding to the sign language information is used as expected output, and the question-answer model is obtained based on machine learning method training, so that the text information corresponding to the sign language information is obtained through the question-answer model.

In some embodiments, after the terminal device acquires a face image sequence in a video to be processed, emotion analysis may be performed on the face image sequence to acquire an emotion feature of a user. The emotion characteristics can be used for representing the emotion of the person in the face image. In some embodiments, the emotion characterized by the emotional characteristics may include positive emotions such as excitement, pleasure, happiness, satisfaction, relaxation, coolness, and the like, and may also include negative emotions such as fatigue, boredom, depression, anger, tension, and the like, without limitation.

In some embodiments, the sequence of face images may be analyzed for mood by a deep learning technique. As one way, the face image sequence may be input into a trained emotion recognition model, to obtain emotion characteristics output by the emotion recognition model. Specifically, in some embodiments, the emotion recognition model may be obtained by training through a neural network in advance based on a human face image sequence when a large number of real persons speak and training samples of emotional features presented by human faces. The training samples can comprise input samples and output samples, the input samples can comprise a face image sequence, the output samples can be emotion characteristics of people in the images, and therefore the trained emotion recognition model can be used for outputting the emotion characteristics of the people in the images according to the obtained face image sequence.

The emotion recognition model may adopt machine learning models such as a Recurrent Neural Network (RNN) model, a Convolutional Neural Network (CNN) model, a bidirectional long-short-term memory neural network (BiLSTM) model, and a variational self-encoder (VAE) model, which are not limited herein. For example, the emotion recognition model may also be a variant or a combination of the above-described machine learning models, or the like.

Step S130: and determining semantic information of the video to be processed based on the text information and the emotional characteristics, and acquiring reply sign language information corresponding to the semantic information.

It is understood that the semantics of the same sentence characterized under different emotions are different. For example, the semantic information understood in a negative emotion of the same sentence "what this means" may be a question, a catharsis, or the like, and the semantic information understood in a positive emotion may be a question, a consultation, or the like. Therefore, in the embodiment of the application, semantic understanding can be performed according to the text information and the emotional characteristics of the user when the user expresses the sentence, so that the intention of the user can be accurately determined, and the virtual intelligent customer service can reply by adopting the corresponding reply sign language information. For example, the reply sign language information in the negative emotion may be "please disappear" or the like, and the reply sign language information in the positive emotion may be "XX means … …" or the like.

In some embodiments, after the terminal device obtains the text information and the emotional features, semantic information related to dialog, such as user intention and word slots, may be determined through a deep learning technique, so as to determine corresponding reply sign language information according to the semantic information. As one mode, the text information and the emotional features are input into the trained feature recognition model to obtain the semantic information output by the feature recognition model, and then the corresponding reply sign language information is generated according to the semantic information and the emotional features of the user. The feature recognition model can be obtained by taking a large amount of text information and emotional features as input samples and semantic information corresponding to the text information and the emotional features as output samples through neural network training.

Step S140: and generating action parameters of the virtual intelligent customer service based on the reply sign language information.

In some implementations, the action parameters for the virtual smart customer service can be generated based on the reply sign language information.

As an implementation manner, a large amount of training sign language information and action parameters corresponding to the training sign language information can be obtained in advance as a training sample set, and the training sample set is input into a machine learning model for training to obtain a neural network model corresponding to the action parameters, so that reply sign language information can be input into the neural network model to obtain the action parameters of the virtual intelligent customer service. The neural network model may be a neural network model such as a Recurrent Neural Network (RNN) or a long-term memory (LSTM).

Therefore, when receiving the inquiry of the user, the virtual intelligent customer service can inform the user of the reply content through the sign language. For example, when a user asks the direction of a store in sign language, the virtual smart customer service can inform the user of a specific route in sign language.

Step S150: and driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence, wherein the reply image sequence is formed by a plurality of continuous action images generated by driving the virtual intelligent customer service.

In the embodiment of the application, the action of the virtual intelligent customer service can be driven through the action parameters, that is, the human body model of the virtual intelligent customer service can be driven to present different actions (which can mainly mean that the upper half body and limb of the virtual intelligent customer service present different actions).

As an embodiment, the human body model of the virtual intelligent customer service may be a three-dimensional human body model manufactured by three-dimensional manufacturing software such as three-dimensional modeling, and therefore, the human body model of the virtual intelligent customer service may be driven based on the action parameters, so that the virtual intelligent customer service may present different actions. Specifically, information such as the rotation angle of each key joint can be analyzed according to the action parameters, and corresponding joints in the human body model are driven to act according to the information, so that the virtual intelligent customer service can present different actions. The action of the virtual intelligent customer service is driven through the action parameters, so that the behavior image of the virtual intelligent customer service can be obtained, and further, a reply image sequence can be generated according to continuous multi-frame behavior images.

Step S160: based on the reply image sequence, a reply video for the video to be processed is generated and output.

As an embodiment, the reply video may be a video obtained by the virtual smart customer service for the sign language information input by the user and used for performing a corresponding reply to the user. Further, a reply video for the video to be processed may be generated and output based on the reply image sequence, specifically, a preset video may be obtained, where the preset video may be a video prepared in advance for feeding back the user for the video to be processed, and the preset video includes the preset reply image sequence, and then the preset reply image sequence in the preset video may be replaced by the above reply image sequence, so as to generate the reply video for the video to be processed, and then the reply video is output and displayed to the user.

As an embodiment, the reply video may include a reply image sequence, that is, multiple frames of continuous behavior images generated by the virtual smart customer service are driven based on the motion parameter, for example, taking the display interface of the terminal device 110 shown in fig. 3 as an example, the user may initiate an inquiry at the terminal device 110 by sign language, after obtaining the sign language of the user, the customer service system identifies the inquiry content corresponding to the sign language and obtains corresponding reply sign language information, and then may generate the motion parameter of the virtual smart customer service 101 based on the reply sign language information, and drive the virtual smart customer service 101, so that the virtual smart customer service 101 replies to the user by the sign language.

As an embodiment, reply text information corresponding to the reply sign language information may be obtained, and video presentation information (e.g., subtitles in the video) may be obtained based on the reply text information, and then a reply video for the video to be processed may be generated and output based on the reply image sequence and the video presentation information. Further, when generating the reply video for the information to be processed, in order to synchronize the reply image sequence in the output reply video with the video presentation information, time stamp information may be respectively tagged to the reply image sequence and the video presentation information, so as to align the reply image sequence and the video presentation information based on the time stamp information when generating the reply video, thereby realizing content synchronization in the reply video.

It should be noted that the image of the virtual intelligent customer service in fig. 3 is only an example, and the image of the virtual intelligent customer service may be diversified in actual implementation. As one way, when the user has turned on the video service function button, the virtual smart service can be displayed at the user side of the video service. Optionally, a place for displaying the virtual intelligent customer service may not be limited, for example, the virtual intelligent customer service may be displayed on a display interface of an APP client of a mobile phone, or displayed on a page of a website of an operator, or displayed on a display interface of a terminal device such as a customer service machine of a bank, and the like, and is not particularly limited.

According to the interaction method provided by the embodiment, when the current mode of the terminal equipment is the sign language recognition mode, sign language information and a face image sequence in a video to be processed are obtained; recognizing sign language information to obtain text information, and performing emotion analysis on a face image sequence to obtain emotion characteristics; determining semantic information of a video to be processed based on the text information and the emotional characteristics, and acquiring reply sign language information corresponding to the semantic information; generating action parameters of the virtual intelligent customer service based on the reply sign language information; driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence; based on the reply image sequence, a reply video for the video to be processed is generated and output. The method and the device have the advantages that the semantic information is determined according to the recognized text information and the recognized emotional characteristics by recognizing the hand language information and the face of the user, so that the accuracy of the user intention recognition is improved.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating another interaction method provided in the embodiment of the present application, where the method includes:

step S210: and when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in the video to be processed.

For the detailed description of steps S210 to S220, refer to steps S110 to S120, which are not described herein again.

Step S220: and recognizing the sign language information to obtain text information, and performing emotion analysis on the face image sequence to obtain emotion characteristics.

In the embodiment of the application, the hand information can be identified to obtain the text information, and the emotion analysis is carried out on the face image sequence to obtain the emotion characteristics.

In some embodiments, the human face image sequence includes a plurality of human face images, and performing emotion analysis on the human face image sequence to obtain an emotional feature may include the following steps:

step S221: and extracting face key points corresponding to each face image in the face image sequence.

In the embodiment of the application, when the face has different expressions, the position distribution of the face key points is different, so that the face key points corresponding to each face image in the face image sequence can be extracted for emotion analysis, and the accuracy of emotion analysis is improved. The number of the face key points may be 68.

Step S222: and obtaining a feature vector corresponding to each face image based on each face image in the face image sequence and the face key point corresponding to each face image.

In some embodiments, a machine learning model may be used to obtain a feature vector corresponding to each face image based on each face image in the face image sequence and a face key point corresponding to each face image. Specifically, the machine learning model may encode the face image and the face key points corresponding to the face image respectively to obtain a first feature vector and a second feature vector. The machine learning model can then align and splice the two feature vectors to generate a third feature vector. The machine learning model performs the above processing on each face image in the face image sequence and the corresponding face key point, so as to obtain a feature sequence composed of third feature vectors, which is used as the real input of the machine learning model. For example, the machine learning model may encode the face image and 68 personal face key points corresponding to the face image into a feature vector a and a feature vector b, and then align and splice the feature vector a and the feature vector b into a feature vector c in the form of [ a, b ], so that after the above processing is repeated for a plurality of face images and corresponding face key points, a feature sequence composed of the feature vectors c may be obtained as the real input of the machine learning model.

After the terminal device inputs each face image and the face key point corresponding to each face image into the machine learning model, a two-dimensional feature vector corresponding to each face image output by the machine learning model can be obtained, and the two-dimensional feature vector can be used for analyzing the emotional state of the user in the image.

Step S223: and determining the emotion characteristics corresponding to the feature vectors according to a preset mapping relation to obtain the emotion characteristics corresponding to each face image in the face image sequence, wherein the preset mapping relation comprises the corresponding relation between a plurality of feature vectors and a plurality of emotion characteristics.

In some embodiments, the mapping relationship between the two-dimensional feature vector and the emotional feature may be embodied by an Arousal-positive-negative (Arousal-value) emotion model. Wherein, two dimensions in the emotion feature vector correspond to and are fixed with an Arousal axis and a Valence axis respectively. Specifically, the Arousal-value emotion space mapped by the two-dimensional feature vector may be divided into 12 equally-divided subspaces according to a designed method, and the 12 equally-divided subspaces correspond to 12 emotion states respectively. Of these, 12 emotional states are divided into 6 positive emotions (excited, happy, satisfied, relaxed, calm) and 6 negative emotions (tired, bored, depressed, angry, tense). Then, a coordinate point in the emotion space can be uniquely determined according to the value of each dimension in the two-dimensional characteristic vector, and the emotion state corresponding to the two-dimensional characteristic vector can be determined by acquiring the emotion state corresponding to the subspace where the coordinate point falls.

Step S230: and inputting the text information into the first machine learning model to obtain semantic information corresponding to the text information.

In this embodiment of the application, the first machine learning model may be obtained by training through a neural network based on a training sample that may be based on a large amount of text information and semantic information corresponding to the text information. It is understood that the first machine learning model is a model for converting textual information into corresponding semantic information. By inputting the previously acquired text information into the first machine learning model, semantic information corresponding to the text information can be output by the first machine learning model.

Step S240: and inputting the emotional features into the second machine learning model to obtain semantic information corresponding to the emotional features.

In the embodiment of the present application, the second machine learning model may be obtained by training through a neural network based on a training sample that may be based on a large number of emotional features and semantic information corresponding to the emotional features. It is to be understood that the second machine learning model is a model for converting emotional features into corresponding semantic information. By inputting the previously acquired emotional features into the second machine learning model, semantic information corresponding to the emotional features can be output by the second machine learning model.

Step S250: and determining semantic information of the video to be processed based on the semantic information corresponding to the text information and the semantic information corresponding to the emotional characteristics.

In the embodiment of the application, the semantic information of the video to be processed can be determined based on the semantic information corresponding to the text information and the semantic information corresponding to the emotion characteristics. The semantic information corresponding to the text information may be content corresponding to the sign language information, for example, "the piece of clothing has shipped," and the semantic information corresponding to the emotion feature may be an emotion of the user when the sign language information is input, for example, a language atmosphere of inquiry or an angry language atmosphere.

In some embodiments, semantic information corresponding to the text information and semantic information corresponding to the emotional features, which are both consistent with each other, that is, semantic information of the video to be processed, may be searched in a preset semantic recognition library, and then corresponding reply sign language information may be obtained according to the semantic information. For example, semantic information corresponding to both "how the piece of clothing was shipped" and angry tone can be found in the semantic recognition library, and then reply sign language information expressing a calming tone can be acquired from the semantic information.

Step S260: and acquiring reply sign language information corresponding to the semantic information.

Step S270: and generating action parameters of the virtual intelligent customer service based on the reply sign language information.

Step S280: and driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence, wherein the reply image sequence is formed by a plurality of continuous action images generated by driving the virtual intelligent customer service.

Step S290: based on the reply image sequence, a reply video for the video to be processed is generated and output.

For detailed description of steps S260 to S290, please refer to steps S130 to S160, which are not described herein again.

According to the interaction method provided by the embodiment, when the current mode of the terminal equipment is the sign language recognition mode, sign language information and a face image sequence in a video to be processed are obtained; recognizing sign language information to obtain text information, and performing emotion analysis on a face image sequence to obtain emotion characteristics; inputting the text information into a first machine learning model to obtain semantic information corresponding to the text information; inputting the emotional features into a second machine learning model, obtaining semantic information corresponding to the emotional features, and determining the semantic information of the video to be processed based on the semantic information corresponding to the text information and the semantic information corresponding to the emotional features; acquiring reply sign language information corresponding to the semantic information; generating action parameters of the virtual intelligent customer service based on the reply sign language information; driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence; based on the reply image sequence, a reply video for the video to be processed is generated and output. Semantic information corresponding to the text information and text information corresponding to the emotion characteristics are obtained through the two machine learning models respectively, and the semantic information of the video to be processed is determined based on the semantic information corresponding to the text information and the text information corresponding to the emotion characteristics, so that the accuracy of determining the semantic information is improved, and the accuracy of sign language recognition is improved.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating a further interaction method provided in the embodiment of the present application, where the method includes:

step S310: and when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in the video to be processed.

For detailed description of step S310, please refer to step S110, which is not described herein again.

Step S320: and acquiring the quantity of sign language information in the video to be processed in a preset time period.

It will be appreciated that the different pace of the hand at which the user uses sign language represents different emotions of the user. Therefore, in the embodiment of the application, the change speed of the sign language information in the video to be processed can be acquired. Specifically, the number of sign language information in the video to be processed in a preset time period may be acquired, for example, the number of sign language information in thirty seconds may be acquired.

Step S330: and calculating the change speed of the sign language information in the video to be processed based on the preset time period and the number.

In some embodiments, based on the number of the obtained sign language information in the video to be processed within a preset time period and the preset time period, the change speed of the sign language information in the video to be processed may be calculated. For example, if the number of pieces of sign language information acquired in thirty seconds is sixty, the rate of change of sign language information can be calculated to be two sign language information per second.

Step S340: and recognizing the sign language information to obtain text information, and performing emotion analysis on the face image sequence and the change speed to obtain emotion characteristics.

In the embodiment of the application, the hand information can be identified to obtain the text information. It can be understood that the speeds of inputting sign language information are different when the emotions of the users are different, and when the speed of inputting sign language information is higher, the emotion of the user at the moment can be shown to be worried, so that the corresponding intentions of the users are correspondingly different, and in order to determine the intention of the user more accurately, the emotion analysis can be performed on the face image sequence and the change speed, so that more accurate emotional characteristics can be obtained, and the intention of the user can be determined more accurately.

In some embodiments, the sequence of face images and the variation speed may be input into a trained emotion recognition model, resulting in emotion characteristics output by the emotion recognition model. The sequence of the face images and the change speed can also be respectively input into two different emotion recognition models to obtain two corresponding emotion characteristics output by the two different emotion recognition models, and then the two corresponding emotion characteristics are synthesized to obtain the final emotion characteristics.

In other embodiments, the emotion of the user may be determined according to the change speed, and when the change speed is high, the emotion of the user may be anxious or angry, and at this time, more accurate emotional characteristics may be obtained by screening in combination with analysis of key points of the face in the face image sequence.

Step S350: and determining semantic information of the video to be processed based on the text information and the emotional characteristics, and acquiring reply sign language information corresponding to the semantic information.

Step S360: and generating action parameters of the virtual intelligent customer service based on the reply sign language information.

Step S370: and driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence, wherein the reply image sequence is formed by a plurality of continuous action images generated by driving the virtual intelligent customer service.

Step S380: based on the reply image sequence, a reply video for the video to be processed is generated and output.

For the detailed description of steps S350 to S380, refer to steps S130 to S160, which are not described herein again.

According to the interaction method provided by the embodiment, when the current mode of the terminal equipment is the sign language recognition mode, sign language information and a face image sequence in a video to be processed are obtained; acquiring the quantity of sign language information in a video to be processed in a preset time period; calculating the change speed of sign language information in the video to be processed based on the preset time period and the number; recognizing sign language information to obtain text information, performing emotion analysis on a face image sequence and a change speed to obtain emotion characteristics, determining semantic information of a video to be processed based on the text information and the emotion characteristics, and obtaining reply sign language information corresponding to the semantic information; generating action parameters of the virtual intelligent customer service based on the reply sign language information; driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence; based on the reply image sequence, a reply video for the video to be processed is generated and output. The emotion characteristics are obtained by performing emotion analysis on the face image sequence and the change speed, so that more accurate emotion characteristics are obtained, and the accuracy of semantic information identification of the video to be processed is improved.

Referring to fig. 6, fig. 6 is a schematic flowchart illustrating a further interaction method according to an embodiment of the present application, where the method includes:

step S410: and when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in the video to be processed.

For detailed description of step S410, please refer to step S110, which is not described herein again.

Step S420: acquiring sign language information adjacent to the sign language information, and determining context semantic information based on the sign language information and the adjacent sign language information.

It is understood that the corresponding mood of the user will be different due to different context, e.g. also "when to ship? "if the previous sentence is" is the clothing bag mailed? ", the corresponding mood is a peaceful mood, the mood of the user may be calm; if the previous sentence is "i have placed an order for a few days, the corresponding tone may be an discontent tone, and the user's mood may be angry. Therefore, context semantics can be determined according to the context content, and more accurate emotional characteristics can be obtained. Therefore, in the embodiment of the application, sign language information adjacent to the sign language information can be acquired, and context semantic information is determined based on the sign language information and the adjacent sign language information. The sign language information adjacent to the sign language information may be previous sign language information adjacent to the sign language information, or subsequent sign language information, or previous sign language information and subsequent sign language information adjacent to the sign language information, which is not limited herein.

In some embodiments, the sign language information and sign language information adjacent to the sign language information may be input into a semantic recognition model, resulting in contextual semantic information output by the semantic recognition model. The semantic recognition model can be obtained by training a neural network model based on a large number of training samples of whole sign language information and context semantic information corresponding to the whole sign language information, wherein the whole sign language information comprises sign language information and sign language information adjacent to the sign language information.

Step S430: and recognizing the sign language information to obtain text information, and performing emotion analysis on the face image sequence and the context semantic information to obtain emotion characteristics.

In the embodiment of the application, the hand information can be identified to obtain the text information. It can be understood that, because the context content is different, the corresponding emotion of the user is also different, and then the emotion analysis can be performed on the face image sequence and the context semantic information to obtain more accurate emotion characteristics.

In some embodiments, the emotion characteristics output by the emotion recognition model can be obtained for the case where the face image sequence and the context semantic information are input into the trained emotion recognition model. The face image sequence and the context semantic information can also be input into two different emotion recognition models to obtain two corresponding emotion characteristics output by the two different emotion recognition models, and the two corresponding emotion characteristics are integrated to obtain the final emotion characteristics.

In some embodiments, the emotion analysis may be performed on the face image sequence to obtain an emotion feature corresponding to the face image sequence, then the emotion feature corresponding to the context semantic information is obtained according to the context semantic information, and the emotion feature corresponding to the face image sequence is adjusted according to the emotion feature corresponding to the context semantic information, so as to obtain the integrated face image sequence and the emotion feature obtained by the context semantic information analysis.

Step S440: and determining semantic information of the video to be processed based on the text information and the emotional characteristics, and acquiring reply sign language information corresponding to the semantic information.

Step S450: and generating action parameters of the virtual intelligent customer service based on the reply sign language information.

Step S460: and driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence, wherein the reply image sequence is formed by a plurality of continuous action images generated by driving the virtual intelligent customer service.

Step S470: based on the reply image sequence, a reply video for the video to be processed is generated and output.

For the detailed description of steps S440 to S470, please refer to steps S130 to S150, which are not described herein again.

According to the interaction method provided by the embodiment, when the current mode of the terminal equipment is the sign language recognition mode, sign language information and a face image sequence in a video to be processed are obtained; the method comprises the steps of obtaining sign language information adjacent to sign language information, determining context semantic information based on the sign language information and the adjacent sign language information, identifying the sign language information to obtain text information, conducting emotion analysis on a face image sequence and the context semantic information to obtain emotion characteristics, determining semantic information of a video to be processed based on the text information and the emotion characteristics, obtaining reply sign language information corresponding to the semantic information, generating action parameters of a virtual intelligent customer service based on the reply sign language information, driving actions of the virtual intelligent customer service based on the action parameters to generate a reply image sequence, and generating and outputting a reply video aiming at the video to be processed based on the reply image sequence. Context semantic information is determined based on sign language information and adjacent sign language information, and emotion characteristics are obtained according to the face image sequence and the context semantic information, so that more accurate emotion characteristics are obtained, and accuracy of user intention identification is improved.

Referring to fig. 7, fig. 7 is a schematic flowchart illustrating a further interaction method provided in an embodiment of the present application, where the method includes:

step S510: and acquiring a video to be processed.

The video to be processed is a video stream including at least the hand of the user, and may be a video stream including only the upper body of the user or a video stream including the entire body of the user. The terminal equipment can acquire the video to be processed in various modes. In some embodiments, the video to be processed may be a video of the user, which is acquired by the terminal device in real time by using an image acquisition device such as a camera when the user interacts with the virtual smart customer service. Specifically, as a manner, when the application program corresponding to the virtual intelligent customer service is run in the system foreground of the terminal device, each hardware module of the terminal device may be called to collect the video of the user.

In other embodiments, the video to be processed may also be a recorded video, and the recorded video needs to satisfy the people in the video and keep consistent with the current interactive object of the virtual intelligent customer service. As a mode, when the system foreground of the terminal device runs the application corresponding to the virtual intelligent customer service, the recorded video input by the user at the application interface corresponding to the virtual intelligent customer service can be acquired through the background of the application. The recorded video may be a video acquired from a third-party client program, or a recorded video downloaded from the internet or remotely. It can be understood that the source of the to-be-processed video is not limited, and only the to-be-processed video includes the user who is currently interacting with the virtual intelligent customer service, which is not listed here.

Step S520: and if the current mode of the terminal equipment is the non-sign language recognition mode, judging whether the video to be processed contains sign language information or not based on the first neural network model.

The terminal device comprises multiple modes, and if the current mode of the terminal device is a non-sign language recognition mode (such as a voice recognition mode, an image recognition mode and the like), whether the video to be processed contains sign language information can be judged by recognizing the acquired video to be processed. Specifically, whether the video to be processed contains sign language information or not can be judged according to the first neural network model, and the video to be processed can be decomposed into a plurality of images, wherein the first neural network model can be trained by taking a training image as input and taking sign language information corresponding to the training image as output. Therefore, a plurality of images decomposed by the video to be processed can be respectively input into the first neural network model, and whether sign language information is output correspondingly to each image or not is judged, so that whether the video to be processed contains the sign language information or not is judged. Wherein the first neural network model may be an LSTM model.

As an embodiment, when the current mode of the terminal device is the non-sign language recognition mode and the video to be processed does not include the speech information, that is, when the video to be processed is understood to be silent, it may be determined whether the video to be processed includes sign language information based on the first neural network model.

As an embodiment, the video to be processed may include voice information, and before performing step S520, or while performing step S520, the voice information in the video to be processed may be recognized, and it is determined whether the recognized content corresponding to the voice information is meaningless content. The content can be determined to be nonsense content by comparing the same content in the nonsense word stock. The voice information may be detected by a noise detection tool, or whether the voice information is noise may be determined by detecting whether the volume of the voice information is less than a certain threshold, and when the voice information is determined to be noise, the recognition content corresponding to the voice information may be determined to be meaningless content. Furthermore, whether a valid voice segment exists in the voice information can be detected through audio endpoint detection, and whether the voice segment is meaningless content is judged.

As an implementation manner, if the collected voice information is meaningful content, the voice information may be recognized, and a voice interaction manner is adopted to interact with the user.

As an implementation manner, in order to avoid a situation that a certain motion information in a to-be-processed video is similar to sign language information and causes a false triggering of a sign language recognition mode, whether the to-be-processed video contains a plurality of sign language information or a plurality of continuous sign language information within a period of time may be detected, so as to more accurately determine whether a current user is a deaf-mute, and thus determine whether to switch the current mode of the terminal device to the sign language recognition mode. Furthermore, the sign language information acquired within the period of time can be stored, and when the current mode of the terminal device is switched to the sign language identification mode, the sign language information acquired within the period of time can be identified.

Step S530: and when the video to be processed contains sign language information, switching the current mode of the terminal equipment into a sign language identification mode.

In the embodiment of the application, when the video to be processed contains sign language information, the current mode of the terminal device can be switched to a sign language identification mode.

In some embodiments, if a plurality of images decomposed from the video to be processed are respectively input into the first neural network model, and when sign language information is output correspondingly to each image, it can be determined that the video to be processed contains sign language information, and then it can be seen that the current mode of the terminal device is switched to the sign language recognition mode. As an implementation manner, the to-be-processed video includes voice information, and after the voice information is recognized, when the recognition content corresponding to the voice information is meaningless content, and the to-be-processed video includes sign language information, the current mode of the terminal device may be switched to the sign language recognition mode.

As an embodiment, when it is detected that the video to be processed contains a plurality of sign language information or a plurality of continuous sign language information within a period of time, it may be determined that the current user is a deaf-mute, and the current mode of the terminal device may be switched to the sign language identification mode.

Furthermore, after the current mode of the terminal device is switched to the sign language recognition mode, in order to avoid false triggering operation caused by collecting voice information, audio collecting devices such as a microphone can be turned off, and only image collecting devices such as a camera are turned on to collect sign language information of a user, so that the power consumption of the terminal device can be reduced.

In some embodiments, after step S630, step S110 to step S160 may be performed, step S210 to step S290 may be performed, step S310 to step S380 may be performed, and step S410 to step S470 may be performed, which is not limited herein.

The interaction method provided by the embodiment obtains the video to be processed, judges whether the video to be processed contains sign language information based on the first neural network model if the current mode of the terminal device is the non-sign language recognition mode, and switches the current mode of the terminal device into the sign language recognition mode when the video to be processed contains the sign language information. When the sign language information is contained in the video to be processed based on the neural network model, the current mode of the terminal device is switched to the sign language recognition mode, so that the sign language recognition mode can be opened by recognizing the video to be processed, the sign language recognition mode is not required to be manually switched by a user, the user operation is reduced, and the convenience of using the terminal device by the user is improved.

Referring to fig. 8, fig. 8 is a schematic flowchart illustrating a further interaction method according to an embodiment of the present application, where the method includes:

step S610: and when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in the video to be processed.

Step S620: and recognizing the sign language information to obtain text information, and performing emotion analysis on the face image sequence to obtain emotion characteristics.

For detailed description of steps S610 to S620, refer to steps S110 to S120, which are not described herein again

Step S630: and determining semantic information of the video to be processed based on the text information and the emotional characteristics, and acquiring reply sign language information corresponding to the semantic information.

In the embodiment of the application, the semantic information of the video to be processed can be determined based on the text information and the emotional characteristics, and the reply sign language information corresponding to the semantic information is obtained.

In some embodiments, corresponding reply text information may be searched based on the text information and semantic information corresponding to the video to be processed, and then the reply text information may be input into a second neural network model, and reply sign language information corresponding to the reply text information may be obtained, wherein the second neural network model is obtained by training based on a machine learning algorithm with the sample reply text information as input and the reply sign language information corresponding to the sample reply text information as output.

In some embodiments, the reply text information may be the reply text information corresponding to the text information and the semantic information, which is searched and obtained in a question-and-answer library based on the text information and the semantic information, where the question-and-answer library includes pre-stored text information, semantic information, and pre-stored reply text information corresponding to the text information and the semantic information, and each text information and semantic information corresponds to the reply text information matched therewith one by one. For example, the pre-stored text information in the question-and-answer library may be a complete question such as "do you shop for a bag and post? ", semantic information may be a mood of a query, so that reply text information corresponding to both the text information and the semantic information may be acquired based on the text information and the semantic information.

In other embodiments, the reply text information may also be obtained based on a question-answer model, and specifically, the text information and the semantic information may be input into the question-answer model, and the reply text information corresponding to both the text information and the semantic information may be obtained through the question-answer model. The question-answer model may be obtained by training based on a large number of question-answer pairs, for example, a large number of question-answer videos obtained from communication records of a large number of human customer services may be used as training samples, text information and semantic information are used as inputs, reply text information corresponding to both the text information and the semantic information is used as an expected output, and the question-answer model is obtained by training based on a machine learning method, so that reply text information corresponding to both the text information and the semantic information is obtained through the question-answer model.

In other embodiments, the second neural network model may be a training sample based on a plurality of text messages and sign language information corresponding to the text messages, a plurality of real-person sign language videos and corresponding text messages, and is obtained by training the neural network (specifically, may be an LSTM model). It will be appreciated that the second neural network model is a model for converting the response text information into corresponding response sign language information. By inputting the previously acquired reply text information into the second neural network model, the reply sign language information corresponding to the reply text information can be output by the second neural network model.

Step S640: and generating action parameters of the virtual intelligent customer service based on the reply sign language information.

Step S650: and driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence, wherein the reply image sequence is formed by a plurality of continuous action images generated by driving the virtual intelligent customer service.

Step S660: based on the reply image sequence, a reply video for the video to be processed is generated and output.

For detailed description of steps S640-S660, refer to steps S140-S160, which are not described herein again

In the interaction method provided by the embodiment, when the current mode of the terminal device is a sign language recognition mode, sign language information and a face image sequence in a video to be processed are obtained, the sign language information is recognized to obtain text information, emotion analysis is performed on the face image sequence to obtain emotion characteristics, semantic information of the video to be processed is determined based on the text information and the emotion characteristics, reply sign language information corresponding to the semantic information is obtained, action parameters of the virtual intelligent customer service are generated based on the reply sign language information, actions of the virtual intelligent customer service are driven based on the action parameters to generate a reply image sequence, the reply image sequence is composed of multiple frames of continuous action images generated by driving the virtual intelligent customer service, and the reply video for the video to be processed is generated and output based on the reply image sequence. The method comprises the steps of identifying the sequence of the hand information and the face image, and determining the semantics of a video to be processed according to the identified text information and emotional characteristics, so that the intention of a user can be identified more accurately.

Referring to fig. 9, fig. 9 is a block diagram illustrating a structure of an interaction device 200 according to an embodiment of the present disclosure. As will be explained below with respect to the block diagram shown in fig. 9, the interaction device 200 includes: an information obtaining module 210, an information identifying module 220, an information determining module 230, a parameter generating module 240, a sequence generating module 250, and a video generating module 260, wherein:

the information obtaining module 210 is configured to obtain sign language information and a face image sequence in a video to be processed when a current mode of the terminal device is a sign language recognition mode.

And the information identification module 220 is configured to identify the sign language information to obtain text information, and perform emotion analysis on the face image sequence to obtain emotion characteristics.

Further, the facial image sequence includes a plurality of facial images, and the information recognition module 220 includes: the system comprises a key point extraction submodule, a vector obtaining submodule and a feature determination submodule, wherein:

and the key point extraction submodule is used for extracting the face key points corresponding to each face image in the face image sequence.

And the vector obtaining submodule is used for obtaining a feature vector corresponding to each face image based on each face image in the face image sequence and the face key point corresponding to each face image.

And the feature determination submodule is used for determining the emotion features corresponding to the feature vectors according to a preset mapping relation to obtain the emotion features corresponding to each face image in the face image sequence, wherein the preset mapping relation comprises the corresponding relation between a plurality of feature vectors and a plurality of emotion features.

And the information determining module 230 is configured to determine semantic information of the video to be processed based on the text information and the emotional features, and acquire reply sign language information corresponding to the semantic information.

Further, the information determining module 230 includes: a first semantic information obtaining submodule, a second semantic information obtaining submodule and a semantic information determining submodule, wherein:

and the first semantic information obtaining submodule is used for inputting the text information into the first machine learning model and obtaining semantic information corresponding to the text information.

And the second semantic information obtaining submodule is used for inputting the emotional characteristics into the second machine learning model to obtain semantic information corresponding to the emotional characteristics.

And the semantic information determining submodule is used for determining the semantic information of the video to be processed based on the semantic information corresponding to the text information and the semantic information corresponding to the emotion characteristics.

Further, the information determining module 230 further includes: the information search submodule and the sign language information acquisition submodule, wherein:

and the information searching submodule is used for searching corresponding reply text information based on the text information and the semantic information corresponding to the video to be processed.

And the sign language information obtaining submodule is used for inputting the reply text information into a second neural network model and obtaining the reply sign language information corresponding to the reply text information, wherein the second neural network model is obtained by taking the sample reply text information as input, taking the reply sign language information corresponding to the sample reply text information as output and training based on a machine learning algorithm.

And a parameter generating module 240, configured to generate an action parameter of the virtual smart customer service based on the reply sign language information.

And the sequence generating module 250 is used for driving the action of the virtual intelligent customer service based on the action parameter to generate a reply image sequence, wherein the reply image sequence is formed by a plurality of continuous action images generated by driving the virtual intelligent customer service.

And the video generating module 260 is used for generating and outputting a reply video aiming at the video to be processed based on the reply image sequence.

Further, the interaction apparatus 200 further includes: quantity acquisition module, speed calculation module and first characteristic acquisition module, wherein:

and the quantity acquisition module is used for acquiring the quantity of the sign language information in the video to be processed in a preset time period.

And the speed calculation module is used for calculating the change speed of the sign language information in the video to be processed based on the preset time period and the number.

And the first characteristic acquisition module is used for carrying out emotion analysis on the face image sequence and the change speed to acquire emotion characteristics.

Further, the interaction apparatus 200 further includes: semantic information determining module and second characteristic obtaining module, wherein:

and the semantic information determining module is used for acquiring sign language information adjacent to the sign language information and determining context semantic information based on the sign language information and the adjacent sign language information.

And the second characteristic acquisition module is used for carrying out emotion analysis on the face image and the context semantic information to acquire emotion characteristics.

Further, the interaction apparatus 200 further includes: video acquisition magic cube is fast, information judgment module and mode switching module, wherein:

and the video acquisition module is used for acquiring the video to be processed.

And the information judgment module is used for judging whether the video to be processed contains sign language information or not based on the first neural network model if the current mode of the terminal equipment is the non-sign language recognition mode.

And the mode switching module is used for switching the current mode of the terminal equipment into a sign language identification mode when the video to be processed contains sign language information.

It can be clearly understood by those skilled in the art that the interaction device provided in the embodiment of the present application can implement each process in the foregoing method embodiments, and for convenience and simplicity of description, the specific working processes of the device and the module described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.

In addition, each functional module in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 10, a block diagram of a terminal device 110 according to an embodiment of the present disclosure is shown. The terminal device 110 may be a terminal device capable of running an application, such as a smart phone, a tablet computer, an electronic book, or the like. Terminal device 110 in the present application may include one or more of the following components: a processor 111, a memory 112, and one or more applications, wherein the one or more applications may be stored in the memory 112 and configured to be executed by the one or more processors 111, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

Processor 111 may include one or more processing cores. The processor 111 connects various parts within the overall terminal device 110 using various interfaces and lines, and performs various functions of the terminal device 110 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 112 and calling data stored in the memory 112. Alternatively, the processor 111 may be implemented in hardware using at least one of Digital Signal Processing (DSP), field-programmable gate array (FPGA), and Programmable Logic Array (PLA). The processor 111 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 111, but may be implemented by a communication chip.

The memory 112 may include a Random Access Memory (RAM) or a read-only memory (ROM). The memory 112 may be used to store instructions, programs, code sets, or instruction sets. The memory 112 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the terminal device 110 during use (e.g., phone book, audio-video data, chat log data), etc.

Referring to fig. 11, a block diagram of a computer-readable storage medium according to an embodiment of the present disclosure is shown. The computer-readable storage medium 300 has stored therein program code that can be called by a processor to execute the methods described in the above-described method embodiments.

The computer-readable storage medium 300 may be an electronic memory such as a flash memory, an electrically-erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a hard disk, or a ROM. Alternatively, the computer-readable storage medium 300 includes a non-volatile computer-readable medium (non-transitory-readable storage medium). The computer readable storage medium 300 has storage space for program code 310 for performing any of the method steps described above. The program code can be read from or written to one or more computer program products. The program code 310 may be compressed, for example, in a suitable form.

To sum up, the interaction method, the interaction device, the terminal device and the storage medium provided by the embodiments of the present application include: when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in a video to be processed; recognizing sign language information to obtain text information, and performing emotion analysis on a face image sequence to obtain emotion characteristics; determining semantic information of a video to be processed based on the text information and the emotional characteristics, and acquiring reply sign language information corresponding to the semantic information; generating action parameters of the virtual intelligent customer service based on the reply sign language information; driving the action of the virtual intelligent customer service based on the action parameters to generate a reply image sequence; based on the reply image sequence, a reply video for the video to be processed is generated and output. According to the method and the device, the hand information and the face of the user are identified, and the semantic information is determined according to the identified text information and the emotion characteristics, so that the accuracy of user intention identification is improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An interaction method is applied to a terminal device, and comprises the following steps:

when the current mode of the terminal equipment is a sign language recognition mode, acquiring sign language information and a face image sequence in a video to be processed;

recognizing the sign language information to obtain text information, and performing emotion analysis on the face image sequence to obtain emotion characteristics;

determining semantic information of the video to be processed based on the text information and the emotional features, and acquiring reply sign language information corresponding to the semantic information;

generating action parameters of the virtual intelligent customer service based on the reply sign language information;

driving the action of the virtual intelligent customer service based on the action parameter to generate a reply image sequence, wherein the reply image sequence is composed of a plurality of frames of continuous action images generated by driving the virtual intelligent customer service;

and generating and outputting a reply video aiming at the video to be processed based on the reply image sequence.

2. The method according to claim 1, wherein the determining semantic information of the video to be processed based on the text information and the emotional features comprises:

inputting the text information into a first machine learning model to obtain semantic information corresponding to the text information;

inputting the emotional features into a second machine learning model to obtain semantic information corresponding to the emotional features;

and determining semantic information of the video to be processed based on the semantic information corresponding to the text information and the semantic information corresponding to the emotional features.

3. The method of claim 1, after obtaining sign language information and a sequence of facial images in a video to be processed, the method further comprising:

acquiring the quantity of sign language information in a video to be processed in a preset time period;

calculating the change speed of sign language information in the video to be processed based on the preset time period and the number;

the emotion analyzing of the face image sequence to obtain emotion characteristics comprises the following steps:

and performing emotion analysis on the human face image sequence and the change speed to acquire the emotion characteristics.

4. The method of claim 1, wherein after acquiring sign language information and a sequence of facial images in the video to be processed, the method further comprises:

acquiring sign language information adjacent to the sign language information, and determining context semantic information based on the sign language information and the adjacent sign language information;

and performing emotion analysis on the face image and the context semantic information to acquire the emotion characteristics.

5. The method according to any one of claims 1-4, wherein before the current mode of the terminal device is a sign language recognition mode and sign language information in the video to be processed is acquired, the method further comprises:

acquiring a video to be processed;

if the current mode of the terminal equipment is a non-sign language recognition mode, judging whether the video to be processed contains sign language information or not based on a first neural network model;

and when the video to be processed contains sign language information, switching the current mode of the terminal equipment into a sign language identification mode.

6. The method according to claim 1 or 2, wherein the face image sequence comprises a plurality of face images, and performing emotion analysis on the face image sequence to obtain emotion characteristics comprises:

extracting a face key point corresponding to each face image in the face image sequence;

obtaining a feature vector corresponding to each face image based on each face image in the face image sequence and a face key point corresponding to each face image;

and determining the emotion characteristics corresponding to the feature vectors according to a preset mapping relation to obtain the emotion characteristics corresponding to each face image in the face image sequence, wherein the preset mapping relation comprises the corresponding relation between a plurality of feature vectors and a plurality of emotion characteristics.

7. The method according to claim 1, wherein said obtaining reply sign language information corresponding to said semantic information comprises:

searching corresponding reply text information based on the text information and semantic information corresponding to the video to be processed;

and inputting the reply text information into a second neural network model to obtain reply sign language information corresponding to the reply text information, wherein the second neural network model is obtained by taking the sample reply text information as input and the reply sign language information corresponding to the sample reply text information as output and training based on a machine learning algorithm.

8. An interaction device, applied to a terminal device, the device comprising:

the information acquisition module is used for acquiring sign language information and a face image sequence in a video to be processed when the current mode of the terminal equipment is a sign language identification mode;

the information identification module is used for identifying the sign language information to obtain text information and carrying out emotion analysis on the face image sequence to obtain emotion characteristics;

the information determining module is used for determining semantic information of the video to be processed based on the text information and the emotional characteristics and acquiring reply sign language information corresponding to the semantic information;

the parameter generation module is used for generating action parameters of the virtual intelligent customer service based on the reply sign language information;

the sequence generation module is used for driving the action of the virtual intelligent customer service based on the action parameter to generate a reply image sequence, and the reply image sequence is formed by a plurality of continuous action images generated by driving the virtual intelligent customer service;

and the video generation module is used for generating and outputting a reply video aiming at the video to be processed based on the reply image sequence.

9. A terminal device comprising a memory and a processor, the memory coupled to the processor, the memory storing instructions that, when executed by the processor, the processor performs the method of any of claims 1-7.

10. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 7.