CN110148406B

CN110148406B - Data processing method and device for data processing

Info

Publication number: CN110148406B
Application number: CN201910295565.XA
Authority: CN
Inventors: 樊博; 孟凡博; 刘恺; 段文君; 陈汉英; 陈曦; 陈伟; 王砚峰
Original assignee: Beijing Sogou Technology Development Co Ltd; Sogou Hangzhou Intelligent Technology Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2022-03-04
Anticipated expiration: 2039-04-12
Also published as: CN110148406A

Abstract

The embodiment of the invention provides a data processing method and device and a device for data processing, wherein the method is used for processing question-answer interaction and specifically comprises the following steps: determining a target voice sequence and a target image sequence corresponding to a target entity image; the mode corresponding to the target image sequence comprises: listening mode, or answering mode; in the input process of the problem, the mode corresponding to the target image sequence is an listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence is an answer mode; and fusing the target voice sequence and the target image sequence to obtain a corresponding target video so as to output the target video to a user. The embodiment of the invention can save labor cost, improve the working efficiency of related industries and improve the intelligence of the target image sequence in a video interaction scene.

Description

Data processing method and device for data processing

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus, and an apparatus for data processing.

Background

With the development of communication technology, users have become an important means for communicating via a network. At present, the video customer service can realize remote face-to-face customer service, and smooth communication of voice and video can be realized between customer service personnel and clients; the video customer service can be applied to application scenes such as electronic commerce websites, enterprise websites, remote education and training websites, video shopping guide, website monitoring and the like.

In practical application, the video customer service needs to consume more labor cost of customer service personnel, so that the working efficiency of the customer service industry is lower.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present invention provide a data processing method, a data processing apparatus, and a device for data processing, which overcome the foregoing problems or at least partially solve the foregoing problems.

In order to solve the above problems, the present invention discloses a data processing method for processing question-answer interaction, wherein the method comprises:

determining a target voice sequence and a target image sequence corresponding to a target entity image; the mode corresponding to the target image sequence comprises: listening mode, or answering mode; in the input process of the problem, the mode corresponding to the target image sequence is an listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence is an answer mode;

and fusing the target voice sequence and the target image sequence to obtain a corresponding target video so as to output the target video to a user.

In another aspect, the present invention discloses a data processing apparatus for processing question-answer interaction, the apparatus comprising:

the determining module is used for determining a target voice sequence and a target image sequence corresponding to the target entity image; the mode corresponding to the target image sequence comprises: listening mode, or answering mode; in the input process of the problem, the mode corresponding to the target image sequence is an listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence is an answer mode; and

and the fusion module is used for fusing the target voice sequence and the target image sequence to obtain a corresponding target video so as to output the target video to a user.

In yet another aspect, the present invention discloses an apparatus for data processing for processing of question and answer interactions, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for:

The embodiment of the invention has the following advantages:

the target voice sequence of the embodiment of the invention can be matched with the tone of the target sound-producing body, and the target image sequence can be obtained on the basis of the target entity image, so that the interaction of the target entity image according to the tone of the target sound-producing body can be realized through the obtained target video in the video interaction process; because the target video can be generated by a machine, compared with a manual video customer service, the method can save labor cost and improve the working efficiency of related industries.

In addition, in the embodiment of the present invention, in the input process of the problem, the mode corresponding to the target image sequence is an listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence can be an answer mode; therefore, the intelligence of the target image sequence in the video interaction scene can be improved.

Drawings

FIG. 1 is a flow chart of steps of a first embodiment of a data processing method of the present invention;

FIG. 2 is a flow chart of the steps of a mode switching method of the present invention;

FIG. 3 is a flowchart illustrating steps of a second embodiment of a data processing method according to the present invention;

FIG. 4 is a block diagram of an embodiment of a data processing apparatus of the present invention;

FIG. 5 is a block diagram of an apparatus for data processing according to the present invention as a device; and

fig. 6 is a block diagram of a server in some embodiments of the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Aiming at the technical problem that the video customer service needs to consume more labor cost of customer service personnel, the embodiment of the invention provides a scheme for generating a target video through a machine, wherein the scheme is used for processing question-answer interaction and specifically comprises the following steps: determining a target voice sequence and a target image sequence corresponding to a target entity image; the mode corresponding to the target image sequence may specifically include: listening mode, or answering mode; in the input process of the problem, the mode corresponding to the target image sequence is an listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence is an answer mode; and fusing the target voice sequence and the target image sequence to obtain a corresponding target video so as to output the target video to a user.

The embodiment of the invention can be applied to video interaction scenes and is used for saving labor cost. The video interaction scene may include: video conference scenes, video customer service scenes, and the like. The video customer service can be applied to application scenes such as electronic commerce websites, enterprise websites, remote education and training websites, video shopping guide, website monitoring and the like.

In other words, the embodiment of the present invention may assign an image feature (entity state feature) corresponding to a mode to the target entity image to obtain the target image sequence.

In this embodiment of the present invention, the mode corresponding to the target image sequence may include: the answer mode, or listening mode, can improve the intelligence of the target image sequence in the video interaction scene.

The answer mode may refer to a mode of answering a question through a target video, which may correspond to a first entity state. In the answer mode, the target entity image corresponding to the target video can read the answer text corresponding to the question through the target voice sequence, and the emotion in the process of reading the answer text is expressed through the first entity state corresponding to the target image sequence.

The listening mode may refer to a mode in which the user is listening for input questions, which may correspond to the second entity state. In the listening mode, the target entity image corresponding to the target video can express the emotion in the listening process through the second entity state corresponding to the target image sequence. The second entity state may include: nodding characteristics, etc. Alternatively, in the listening mode, the listening state text such as "kayen", "please continue", etc. may also be expressed by the target voice sequence.

In the embodiment of the invention, in the input process of the problem, the mode corresponding to the target image sequence is a listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence may be an answer mode.

The embodiment of the invention can switch the mode corresponding to the target image sequence according to whether the problem is input or not. Optionally, if the input of the user is not received within the preset time period, the input of the problem may be considered to be completed.

In practical application, a TTS (Text To Speech) technology may be used To convert a Text into a target Speech corresponding To a target Speech sequence, where the target Speech sequence may be characterized as a waveform. It can be understood that a target speech sequence meeting the requirements can be obtained according to the speech synthesis parameters.

Alternatively, the speech synthesis parameters may include: at least one of a timbre parameter, a pitch parameter and a loudness parameter.

The tone parameters may refer to distinctive characteristics of different sound frequencies expressed in terms of waveforms, and different sound generators generally correspond to different tones, so that a target speech sequence matching the tone of the target sound generator may be obtained according to the tone parameters, and the target sound generator may be specified by a user, for example, the target sound generator may be a specified media worker. In practical application, the tone parameters of the target sounding body can be obtained according to the audio frequency with the preset length of the target sounding body.

The pitch parameter may characterize the tone, measured in frequency. The loudness parameter, also known as sound intensity or volume, may refer to the magnitude of sound, measured in decibels (dB).

The embodiment of the invention can adopt the following determination mode to determine the target voice sequence corresponding to the target language feature, wherein the target language feature corresponds to the question related text:

determining a mode 1, searching a first voice unit matched with the target language characteristics in a first voice library, and splicing the first voice unit to obtain a target voice sequence.

And determining a mode 2, determining a target acoustic feature corresponding to the target language feature, searching a second voice unit matched with the target acoustic feature in a second voice library, and splicing the second voice unit to obtain a target voice sequence.

The acoustic features may characterize speech from a perspective of the utterance.

The acoustic features may include, but are not limited to, the following:

prosodic features (super-syllabic features/super-linguistic features) specifically including duration-related features, fundamental-frequency-related features, energy-related features, and the like;

the characteristics of tone quality;

the spectral-based correlation analysis features, which are the manifestation of the correlation between vocal tract shape changes and vocal movement, mainly include the following features: linear Predictive Cepstrum Coefficients (LPCC), Mel-Frequency cepstral coefficients (MFCC), and the like.

Determining mode 3, adopting an end-to-end speech synthesis method, where a source end of the end-to-end speech synthesis method may include: the text or the target language feature corresponding to the text, and the target end can be a target voice sequence in a waveform form.

In an alternative embodiment of the present invention, the end-to-end speech synthesis method may employ a neural network, which may include: a single-layer RNN (Recurrent Neural Network) and a dual-layer activation layer for predicting 16-bit speech output. The state of the RNN is divided into two parts: a first (upper 8-bit) state and a second (lower 8-bit) state. The first state and the second state are respectively input into the corresponding active layer, the second state is obtained based on the first state, and the first state is obtained based on 16 bits at the previous moment. The first state and the second state are designed in a network structure by the neural network, so that the training speed can be increased, the training process can be simplified, the calculation amount of the neural network can be reduced, and the end-to-end voice synthesis method is suitable for mobile terminals with limited calculation resources, such as mobile phones and the like.

It can be understood that, according to the practical application requirement, a person skilled in the art may adopt any one or a combination of the above determination manners 1 to 3, and the embodiment of the present invention does not limit the specific process for determining the target speech sequence corresponding to the target language feature.

The sequence of target images may be used to characterize an entity (entity) image. Entities are distinguishable and independent things, and entities may include: humans, robots, animals, plants, etc. The embodiment of the invention mainly takes a human example to explain the target image sequence, and the target image sequences corresponding to other entities are mutually referred. The physical image corresponding to a person may be referred to as a portrait.

From an entity state perspective, the image features may include entity state features, which may reflect features of the image sequence in terms of the entity state.

Optionally, the entity status feature may include at least one of the following features:

an expression characteristic;

a lip feature; and

a limb characteristic.

Expression, expression of emotion, and meaning may refer to the thought and emotion expressed on the face.

Expressive features are typically directed to the entire face. The lip characteristics can be specially used for lips, and have a relation with text contents, voice, pronunciation modes and the like of the text, so that the naturalness of the corresponding expression of the image sequence can be improved.

The body characteristics can convey the thought of people through the coordination activities of human body parts such as head, eyes, neck, hands, elbows, arms, body, crotch, feet and the like, so as to vividly express and reach the intention. The limb characteristics may include: turning head, shrugging shoulder, gesture, etc., can improve the richness of the corresponding expression of the image sequence. For example, at least one arm naturally droops when speaking, and at least one arm naturally rests on the abdomen when not speaking, etc.

The data processing method provided by the embodiment of the invention can be applied to application environments corresponding to the client and the server, wherein the client and the server are positioned in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.

Optionally, the client may run on a terminal, and the terminal specifically includes but is not limited to: smart phones, tablet computers, electronic book readers, MP3 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, and the like.

The client is a program corresponding to the server and providing local service for the user. The client in the embodiment of the present invention may provide a target video, and the target video may be generated by the client or the server.

In one embodiment of the invention, the client can determine the target sound-producing body information and the target entity image information selected by a user through man-machine interaction operation, and upload the target sound-producing body information and the target entity image information to the server so that the server generates a target video corresponding to the target sound-producing body and the target entity image; and, the client may output the target video to the user.

Method embodiment one

Referring to fig. 1, a flowchart of a first embodiment of a data processing method according to the present invention is shown, and is used for processing question-answer interaction, where the method specifically includes the following steps:

step 101, determining a target voice sequence and a target image sequence corresponding to a target entity image; the mode corresponding to the target image sequence may include: listening mode, or answering mode; in the input process of the question, the mode corresponding to the target image sequence can be a listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence may be an answer mode;

and 102, fusing the target voice sequence and the target image sequence to obtain a corresponding target video so as to output the target video to a user.

In this embodiment of the present invention, the target entity image may be specified by a user, for example, the target entity image may be an image of a target entity, and the target entity may include: a known person (e.g., a host), although the target entity may be any entity, such as a robot, or a general person, etc.

The target sounding body and the target entity in the embodiment of the present invention may be the same, for example, the user uploads a first video, and the first video may include: the voice of the target sounding body and the target entity image. Alternatively, the target sound-generating body and the target entity of the embodiment of the present invention may be different, for example, the user uploads a second video and a first audio, and the second video may include: the target entity image and the first audio may comprise: speech of the target utterance.

In an optional embodiment of the present invention, the mode corresponding to the target image sequence may be switched according to the connected image sample, so as to improve the smoothness of switching.

Engaging the image sample may include: the first stitched image sample. The first joining image sample may include: the image samples corresponding to the listening mode and the image samples corresponding to the answering mode which sequentially appear can obtain a rule of switching from the listening mode to the answering mode by learning the first connected image samples, so that the fluency of switching from the listening mode to the answering mode can be improved.

Engaging the image sample may include: and a second splicing image sample. The second stitched image sample may include: the image samples corresponding to the answer mode and the image samples corresponding to the listening mode which sequentially appear can obtain the rule of switching from the answer mode to the listening mode by learning the second connected image samples, so that the fluency of switching from the answer mode to the listening mode can be improved.

Referring to fig. 2, a flowchart illustrating steps of a mode switching method according to the present invention is shown, and is used for processing question-answer interaction, where the method specifically includes the following steps:

step 201, in a listening mode, playing a first target video and receiving a question input by a user;

the first target video may correspond to a listening mode, which may be derived from a first target speech sequence and a first target image sequence, which may correspond to a listening mode.

Step 202, judging whether the input of the problem is finished, if so, executing step 203, otherwise, returning to step 201;

step 203, setting a mode corresponding to the target image sequence as an answer mode, and playing a second target video;

and step 204, after the second target video is played, setting the mode corresponding to the target image sequence as a listening mode.

The second target video may correspond to an answer mode, which may be derived from a second target speech sequence and a second target image sequence, which may correspond to the answer mode.

It is understood that the above-mentioned output of the target video is only an alternative embodiment, and actually, the embodiment of the present invention may output a link of the target video to the user so that the user can determine whether to play the above-mentioned target video.

Optionally, the embodiment of the present invention may further output the target speech sequence or the link of the target speech sequence to the user.

Optionally, the embodiment of the present invention may further output a text related to the question to the user. The question-related text may include: answer text, or listening status text. The answer text may correspond to an answer mode, and the listening status text may correspond to a listening mode.

In an optional embodiment of the present invention, the question-answer interaction may correspond to a communication window, and at least one of the following information may be displayed in the communication window: a link to a target speech sequence, a text of a question answer, and a link to a target video. Wherein, the link of the target video can be displayed in the identification area of the communication terminal. The identification area can be used for displaying information such as a nickname, an ID (identification, Identity), a head portrait and the like of the communication terminal.

In an optional embodiment of the present invention, the determining, in step 101, a target voice sequence and a target image sequence corresponding to a target entity image may specifically include: and determining a target voice sequence and a target image sequence corresponding to the target entity image according to the question related text.

In practical applications, the questions input by the user can be in a voice form or a text form or a picture form. Speech recognition techniques may be employed to convert speech-form questions to text-form questions. Alternatively, optical character recognition techniques may be employed to convert a problem in the form of a picture to a problem in the form of text.

Optionally, the answer text determination process may include: determining a first expression vector corresponding to the problem; determining a target preset problem corresponding to the problem according to the matching degree between the first expression vector and a second expression vector corresponding to the preset problem; and determining an answer corresponding to the question according to the answer corresponding to the target preset question.

According to the embodiment of the invention, the target preset problem can be determined according to the matching degree between the first expression vector corresponding to the problem and the second expression vector corresponding to the preset problem, and the answer text corresponding to the problem is further determined according to the answer corresponding to the target preset problem.

Because the target preset question is an existing question, the corresponding answer of the target preset question is usually reasonable and effective, and the target preset question is matched with the question, the answer corresponding to the target preset question can be used as the basis for determining the answer corresponding to the question, and the accuracy of the answer corresponding to the question can be improved.

In the embodiment of the present invention, optionally, the preset questions and the corresponding answers thereof may be stored in a knowledge base. The first representative vector may be matched with a second representative vector corresponding to a preset problem in the knowledge base to obtain a corresponding degree of matching.

Optionally, embodiments of the present invention may convert text into a fixed-length vector representation, thereby facilitating processing. The first representation vector may be used to represent a question and the second representation vector may be used to represent a pre-set question. The dimensions of the first representation vector or the second representation vector may be one-dimensional or two-dimensional or three-dimensional.

The type of the first representation vector or the second representation vector may comprise: one-hot (one-hot) vectors, word embedding vectors (WordEmbelling), or high-level representation vectors. word embedding is to find a mapping or a function to generate an expression in a new space, and the expression is word representation.

In an alternative embodiment of the present invention, the determining of the first representative vector may comprise: and determining a word embedding vector corresponding to the problem, and processing the word embedding vector by using a neural network to obtain a high-level expression vector corresponding to the problem. The word embedding vector is processed by utilizing the neural network, and the deep-level features of the word embedding vector can be extracted, so that the richness of the first expression vector can be improved. Alternatively, the word embedding vector may be processed by using a Neural Network such as CNN (Convolutional Neural Network) or LSTM (Long Short-Term Memory). For the determination process of the second expression vector, since it is similar to the determination process of the first expression vector, it is not repeated herein and it is sufficient to refer to each other.

Similarity measures between the vectors can be adopted to determine matching degrees between the first expression vector and a second expression vector corresponding to a preset problem. The similarity measure may include: cosine of included angle, Euclidean distance, etc.

The embodiment of the invention can take one or more preset problems with the maximum matching degree as target preset problems.

In an optional embodiment of the present invention, the first keyword corresponding to the question is matched with the second keyword corresponding to the preset question. Because the number of preset problems in the knowledge base is usually more, the embodiment of the invention can screen the preset problems in the knowledge base based on the matching between the first keyword and the second keyword; then, aiming at the preset problems passing the screening, determining a target preset problem. The screening can reduce the calculation amount and further improve the calculation speed.

In one embodiment, preset questions in the knowledge base can be screened based on matching between the first keyword and the second keyword, and the preset questions which pass the screening are assumed to be the first preset questions; the target preset problem corresponding to the problem can be determined according to the matching degree between the first expression vector and the second expression vector corresponding to the first preset problem. The matching condition of the first preset problem and the problem may include: the domain keywords match, and/or, the intent keywords match, and/or, the slot keywords match.

In this embodiment of the present invention, optionally, the first keyword corresponding to the question may specifically include:

a domain keyword; and/or

An intention keyword; and/or

Slot position key words.

In embodiments of the present invention, a domain may refer to a range of data. Alternatively, a domain may refer to an application scenario or category of data. Areas may include, but are not limited to: printers, computers, encyclopedias, news, music, video, movies, games, sports, e-commerce, educational learning, FM (Frequency Modulation), SMS (Short Messaging Service), controls, travel, books, weather, galleries, and the like. It is to be appreciated that the domain can be subdivided to obtain subdivided domains. For example, a subdivision of the encyclopedia domain may include: the meaning items corresponding to the encyclopedia meaning words respectively, and the like. Optionally, a domain may be related to a corresponding APP or service, and the embodiment of the present invention does not limit a specific domain.

The embodiment of the invention can identify the domain keywords from the texts corresponding to the problems. Optionally, the text corresponding to the question may be segmented, and the segmentation result may be matched with the domain keyword. Alternatively, a classification model may be utilized to determine the domain to which the problem belongs.

The classification model may be a machine learning model. Broadly speaking, machine learning is a method that can give the machine learning ability to perform functions that cannot be performed by direct programming. However, in a practical sense, machine learning is a method of training a model by using data and then predicting using the model. The machine learning method may include: a decision tree method, a linear regression method, a logistic regression method, a neural network method, a k-nearest neighbor method, and the like, it is to be understood that the specific machine learning method is not limited in the embodiments of the present invention. The classification model described above may have domain classification capabilities.

Intent (Intent) is a determination of a sentence expressed by the user to determine what task the user wishes to accomplish. Optionally, the intent keywords corresponding to the question may be determined using a classification model.

Slot (Slot) is a definition for key information in a user expression. In the expression of an air ticket booking, for example, the slot positions may include: "departure time", "origin", "destination", etc. As another example, in the expression of a computer failure, a slot may include: a "blue screen", etc.

In the embodiment of the present invention, optionally, an intention keyword corresponding to the problem may be determined by using an intention extraction technology. Optionally, a slot filling technique may be used to determine a slot keyword corresponding to the problem. And will not be described in detail herein.

Any one of the domain keyword, the intention keyword and the slot keyword can reflect the information of the problem, so any one or combination of the domain keyword, the intention keyword and the slot keyword can be used as the first keyword corresponding to the problem.

Similarly, the second keyword may specifically include:

a domain keyword; and/or

An intention keyword; and/or

Slot position key words.

The embodiment of the invention can save the corresponding second keyword aiming at the preset problem in the knowledge base.

In the embodiment of the present invention, the text may relate to at least two languages, such as at least two languages of chinese, japanese, korean, english, french, german, arabic, and the like. The target speech sequence, as well as the target image sequence, may also relate to at least two languages, and thus embodiments of the present invention may be applicable to multilingual video interaction scenarios.

For example, in a video service scenario, the text may be a question text input by the user, and the question text may include: a first language that is a native language and a second language that is a non-native language. For example, the question text relates to a computer failure, and the question text may include: english text corresponding to computer failure and Chinese text summarized and summarized by users.

As another example, in a video conference scenario, the text may be a conference presentation, and the conference presentation may include: multiple languages corresponding to the multilingual user.

It is understood that the text relating to at least two languages may be applied to any video interaction scenario, and the embodiment of the present invention is not limited to a specific video interaction scenario.

According to an embodiment, the determining the answer target speech sequence and the answer target image sequence corresponding to the target entity image may specifically include: determining a target voice sequence corresponding to the question related text; determining a target image sequence corresponding to the target voice sequence according to a mapping relation between the voice characteristic sequence and the image characteristic sequence; the voice feature sequence and the image feature sequence in the mapping relation are aligned on a time axis; the mapping relation is obtained according to the voice sample and the image sample aligned by the time axis.

The speech feature sequence may include: language features and/or acoustic features.

The language features may include: and (4) phoneme characteristics. Phonemes are the smallest units of speech that are divided according to the natural properties of the speech, and are analyzed according to the pronunciation actions in the syllables, with one action constituting a phoneme. The phonemes may include: vowels and consonants.

The acoustic features may include, but are not limited to, the following:

the characteristics of tone quality;

The embodiment of the invention can obtain the mapping relation between the voice characteristic sequence and the image characteristic sequence according to the voice sample and the image sample aligned by the time shaft.

The speech feature sequence and the image feature sequence are regularly circulated. For example, a particular phoneme feature corresponds to a particular lip feature; as another example, a particular prosodic feature corresponds to a particular expressive feature; alternatively, the particular phoneme features may correspond to particular limb features, etc.

Therefore, the embodiment of the invention can obtain the mapping relation according to the voice sample and the image sample aligned by the time axis, so as to reflect the rule between the voice characteristic sequence and the image characteristic sequence through the mapping relation.

The rule between the voice feature sequence and the image feature sequence reflected by the mapping relation can be suitable for any language, and therefore the method can be suitable for texts corresponding to at least two languages.

The embodiment of the invention can utilize an end-to-end machine learning method to learn the voice sample and the image sample which are aligned with the time axis so as to obtain the mapping relation. The input of the end-to-end machine learning method can be a voice sequence, the output can be an image sequence, and the method can obtain the rule between the input characteristic and the output characteristic through the learning of training data.

Broadly speaking, machine learning is a method that can give the machine learning ability to perform functions that cannot be performed by direct programming. However, in a practical sense, machine learning is a method of training a model by using data and then predicting using the model. The machine learning method may include: a decision tree method, a linear regression method, a logistic regression method, a neural network method, and the like, it is to be understood that the specific machine learning method is not limited in the embodiments of the present invention.

The alignment of the voice sample and the image sample on the time axis can improve the synchronism between the voice feature and the image feature.

In one embodiment of the invention, the voice sample and the image sample may originate from the same video file, whereby an alignment of the voice sample and the image sample on the time axis may be achieved. For example, a recorded video file may be collected, which may include: the voice of the sounding body and the video picture of the sounding body.

In another embodiment of the invention, the voice sample and the image sample may originate from different files, in particular, the voice sample may originate from an audio file, the image sample may originate from a video file or an image file, and the image file may include: a plurality of frames of images. In this case, the voice samples and the image samples may be time-axis aligned to obtain time-axis aligned voice samples and image samples.

It should be understood that the end-to-end machine learning method is only an optional embodiment of the determination method of the mapping relationship, and actually, a person skilled in the art may determine the mapping relationship by using other methods according to the actual application requirements, for example, the other methods may be statistical methods, and the embodiment of the present invention does not limit the specific determination method of the mapping relationship.

In other words, the embodiment of the present invention may assign an image feature (entity state feature) corresponding to the target voice sequence to the target entity image to obtain the target image sequence. The target entity image may be specified by a user, for example, the target entity image may be an image of a known person (e.g., a host).

In summary, the target image sequence corresponding to the target speech sequence in the embodiment of the present invention is obtained according to a mapping relationship, and the rule between the speech feature sequence and the image feature sequence reflected by the mapping relationship can be applied to any language, so that the method and the device can be applied to texts corresponding to at least two languages.

According to another embodiment, the determining a target voice sequence and a target image sequence corresponding to a target entity image may specifically include: determining a duration characteristic corresponding to the question related text; determining a target voice sequence corresponding to the question related text according to the duration characteristics; determining a target image sequence corresponding to the question related text according to the duration characteristics; the target image sequence is obtained according to the text sample and the image sample corresponding to the text sample.

The text feature sequence and the image feature sequence are regularly circulated. The text features may include: phoneme features, and/or semantic features, etc.

Phonemes are the smallest units of speech that are divided according to the natural properties of the speech, and are analyzed according to the pronunciation actions in the syllables, with one action constituting a phoneme. The phonemes may include: vowels and consonants. Optionally, the specific phoneme feature corresponds to a specific lip feature, an expressive feature or a limb feature, etc.

The semantics is the meaning of the concept represented by the object in the real world corresponding to the text to be processed, and the relationship between the meanings, which is the explanation and the logical representation of the text to be processed in a certain field. Optionally, the particular semantic features correspond to particular limb features, etc.

Therefore, the embodiment of the invention can obtain the mapping relation between the text feature sequence and the image feature sequence according to the text sample and the image sample corresponding to the text sample, so as to reflect the rule between the text feature sequence and the image feature sequence through the mapping relation.

The image samples corresponding to the text samples may include: a multi-frame image in the case of a text sample (e.g., a spoken text sample) is expressed. The image samples corresponding to the text samples may be carried in video samples, or the image samples corresponding to the text samples may be carried in multi-frame images. The image sample may correspond to a target entity image, and the target entity image may be specified by a user, for example, the target entity image may be an image of a known person (e.g., a host), and of course, the target entity image may be an image of any entity, such as a robot or an image of a general person.

The text sample may include: therefore, the target image sequence obtained according to the text sample and the image sample thereof can be suitable for the texts to be processed corresponding to at least two languages.

The embodiment of the invention can utilize an end-to-end machine learning method to learn the text sample and the corresponding image sample so as to obtain the mapping relation. The input of the end-to-end machine learning method can be a text to be processed, the output can be a target image sequence, and the method can obtain rules between the input features and the output features through the learning of training data.

In the embodiment of the invention, the time length characteristics corresponding to the text to be processed are respectively utilized in the determination process of the target voice sequence and the target image sequence, and the time length characteristics can improve the synchronism between the target voice sequence and the target image sequence.

The duration feature may be used to characterize the duration of the phoneme to which the text corresponds. The time length characteristic can depict the suppression of the rising and the pause and the lightness and the slowness in the voice, and further the expressive force and the naturalness of the synthesized voice can be improved. Optionally, a duration model may be used to determine the duration feature corresponding to the answer text. The inputs to the duration model may be: and outputting the phoneme characteristics with the accent labels as phoneme duration. The duration model may be obtained by learning a speech sample with time and duration information, and the embodiment of the present invention does not limit a specific duration model.

The expression signatures in different languages are usually different. The expression characteristics may include: vocal method characteristics, effort and gas consumption, lip characteristics (such as mouth shape and mouth shape posture), and the like. For example, the phonation method features of chinese may include: the front oral cavity sounding method is characterized in that the front part of the oral cavity is relatively forceful, and the sounding position is in front of the oral cavity. As another example, the Chinese phonation method features may include: the rear oral cavity sounding method is characterized in that the rear part of the oral cavity is relatively forceful and relatively wide, and the sounding position is at the rear part of the oral cavity.

In step 103, the target image sequence corresponding to the answer text is obtained according to the text sample and the image sample corresponding to the text sample, and the language corresponding to the text sample may include: therefore, according to the text sample and the target image sequence obtained by the image sample, the expression characteristics corresponding to the target image sequence can be matched with at least two languages corresponding to the answer text. For example, the to-be-processed sample relates to a first language and a second language, the text sample relates to the first language, the second language, a third language, and the like.

In an optional embodiment of the present invention, determining the target image sequence corresponding to the answer text may specifically include: and determining a target image characteristic sequence corresponding to the target text characteristic sequence according to the target text characteristic sequence corresponding to the answer text and the mapping relation between the text characteristic sequence and the image characteristic sequence, so as to determine a target image sequence corresponding to the target image characteristic sequence.

The mapping relation between the text characteristic sequence and the image characteristic sequence can reflect the rule between the text characteristic sequence and the image characteristic sequence.

The text features may include: a language feature and a duration feature. The image features are used for characterizing the target entity image, and may specifically include: the foregoing physical status characteristics.

In an optional embodiment of the present invention, the determining a target image sequence corresponding to the target image feature sequence specifically includes: and synthesizing the target entity image and the target image characteristic sequence to obtain a target image sequence, and giving the target entity image the target image characteristic sequence.

The target entity image may be specified by a user, for example, the target entity image may be an image of a known person (e.g., a host).

The target entity image does not carry an entity state, and the target entity image and the target image feature sequence are synthesized, so that the target image sequence carries an entity state matched with the text, and the naturalness and the richness of the entity state in the target video can be improved.

In the embodiment of the present invention, optionally, the three-dimensional model corresponding to the target entity image and the target image feature sequence may be synthesized to obtain the target image sequence. The three-dimensional model can be obtained by performing three-dimensional reconstruction on multi-frame target entity images.

In practical applications, the entities are usually present in the form of three-dimensional geometric entities. The traditional two-dimensional plane image causes visual space stereoscopic impression through light-dark contrast and perspective relation, and cannot generate attractive natural stereoscopic impression. The three-dimensional image has a spatial modeling similar to that of a prototype, not only has the geometrical shape characteristics of height, width and depth three-dimensional space, but also has real and vivid state information, changes reality sense which cannot be given by a plane photo, and can give a feeling of intimacy and vividness to people.

In computer graphics, entities are typically modeled with three-dimensional models, i.e., entities in a corresponding spatial entity, which may be displayed by a computer or other video device.

The corresponding features of the three-dimensional model may include: geometric features, texture states, entity state features, etc., and the entity state features may include: expressive features, lip features, limb features, etc. The geometric features are usually represented by polygons or voxels, and for example, polygons are used to express geometric parts of the three-dimensional model, i.e., polygons are used to express or approximate curved surfaces of a solid. The basic object of the method is the vertex in the three-dimensional space, the straight line connecting two vertexes is called as a side, the three vertexes are connected into a triangle through three sides, and the triangle is the simplest polygon in the Euclidean space. Multiple triangles may compose more complex polygons or generate a single entity with more than three vertices. Quadrangles and triangles are the most commonly used shapes in three-dimensional models expressed by polygons, and in the aspect of expression of the three-dimensional models, the three-dimensional model of the triangulation network becomes a popular choice for expression of the three-dimensional models due to the characteristics that the data structure is simple, the three-dimensional model is easy to draw by all graphic hardware devices and the like, wherein each triangle is a surface, and therefore the triangle is also called a triangular patch.

The three-dimensional model can be point cloud data which is provided with preset entity states and is densely aligned, and the preset entity states can include: neutral expression, closed lip, and drop arm, etc.

Synthesizing a three-dimensional model corresponding to a target entity image and a target image feature sequence, wherein the synthesis can be realized by modifying vertex positions on the three-dimensional model and the like, and the adopted synthesis method specifically comprises the following steps: keyframe interpolation, parameterization, and the like. The key frame interpolation method may perform a difference on image features of the key frame. The parameterization method can describe the change of the entity state through the parameters of the three-dimensional model, and different entity states are obtained by adjusting the parameters.

Under the condition of adopting a key frame interpolation method, the embodiment of the invention can obtain the difference vector according to the target image feature sequence. Under the condition of adopting a parameterization method, the embodiment of the invention can obtain the parameter vector according to the target image feature sequence.

It should be understood that the above-mentioned key frame interpolation method and parameterization method are only used as alternative embodiments of the synthesis method, and in fact, those skilled in the art can adopt the required synthesis method according to the actual application requirements, and the embodiment of the present application does not impose any limitation on the specific synthesis method.

In the embodiment of the invention, in the process of determining the image characteristics corresponding to the target image sequence, the rule between the text characteristic sequence and the image characteristic sequence is utilized. The image features may include: at least one of an expressive feature, a lip feature, and a limb feature.

In order to improve the accuracy of the image features corresponding to the target image sequence, the embodiment of the invention can also expand or adjust the image features corresponding to the target image sequence.

In an optional embodiment of the present invention, the body feature corresponding to the target image sequence may be obtained according to a semantic feature corresponding to the text. The embodiment of the invention adopts the semantic features corresponding to the text in the process of determining the limb features, so that the accuracy of the limb features can be improved.

In this embodiment of the present invention, optionally, any one of the direction, the position, the speed, and the strength of the limb feature is related to a semantic feature corresponding to the text.

Alternatively, the semantic features may be associated with emotional features. The body features can be classified according to the emotional features to obtain the body features corresponding to the emotional features.

Optionally, the emotional features may include: positive, negative, or neutral, etc.

The location area of the limb feature may include: upper zone, middle zone, lower zone. The upper region above the shoulder can express the positive and positive emotional characteristics of ideal, hope, joy, congratulation and the like. The middle part refers to the part from the shoulder to the waist, and can narrate things and explain arrangement to express neutral emotion. The lower region means a region below the waist, and negative emotions such as hate, objection, criticism, disappointment, and the like can be expressed.

In addition to the location area, the limb characteristics may include: and (4) direction. For example, with the palm facing up, positive emotional signatures may be expressed. As another example, a negative emotion can be expressed with the palm down.

In the embodiment of the present invention, the type of the semantic features may include: keywords, one-hot vectors, word embedding vectors (WordEmbedding), and the like. word embedding is to find a mapping or a function to generate an expression in a new space, and the expression is word representation.

According to the embodiment of the invention, the body characteristic corresponding to the semantic characteristic corresponding to the text can be determined through the mapping relation between the semantic characteristic and the body characteristic. The mapping relation between the semantic features and the limb features can be obtained through a statistical method or an end-to-end method.

The embodiment of the invention can realize the alignment of the target voice sequence and the target image sequence on the time axis through the voice sample and the image sample aligned on the time axis; or, the embodiment of the present invention may implement alignment of the target voice sequence and the target image sequence on the time axis through the duration feature. On the basis that the target voice sequence and the target image sequence are aligned on the time axis, the target voice sequence and the target image sequence can be fused to obtain a target video. Alternatively, a multi-modal fusion technique may be employed to fuse the target speech sequence and the target image sequence. It is understood that the embodiment of the present invention does not limit the specific fusion method.

After the target video is obtained, the target video can be saved or output. For example, the server may send the target video to the client, cause the client to output the target video to the user, and so on.

To sum up, in the data processing method of the embodiment of the present invention, the target voice sequence may be matched with the tone of the target utterance, and the target image sequence may be obtained on the basis of the target entity image, so that the answer text may be expressed by the target entity image according to the tone of the target utterance through the obtained target video; the target video can be generated by a machine, so that the generation time of the target video can be shortened, the timeliness of the target video can be further improved, and the target video can be suitable for video interaction scenes with high timeliness, such as breaking news scenes and the like

In addition, the target video target entity image expresses the answer text according to the tone of the target sounding body, so that compared with the method of expressing the answer text manually, the method can save labor cost and improve the working efficiency of related industries.

In addition, the text sample may include: therefore, the target image sequence obtained according to the text sample and the image sample thereof can be suitable for answer texts corresponding to at least two languages.

In addition, the time length characteristics corresponding to the answer texts are respectively utilized in the determination process of the target voice sequence and the target image sequence, and the time length characteristics can improve the synchronism between the target voice sequence and the target image sequence.

Method embodiment two

Referring to fig. 3, a flowchart illustrating steps of a second embodiment of the data processing method according to the present invention is shown, and the processing for question-answer interaction may specifically include the following steps:

step 301, determining a target voice sequence and a target image sequence corresponding to a target entity image; the mode corresponding to the target image sequence may include: listening mode, or answering mode; in the input process of the question, the mode corresponding to the target image sequence can be a listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence may be an answer mode;

step 302, compensating the boundary of a preset area in the target image sequence;

and 303, fusing the target voice sequence and the compensated target image sequence to obtain a corresponding target video so as to output the target video to a user.

In the embodiment of the present invention, in the process of determining the target image sequence corresponding to the answer text, the three-dimensional model of the target entity image is usually used, and due to limitations of a reconstruction method of the three-dimensional model and a synthesis method of the three-dimensional model and the image feature sequence, a problem of missing details of a polygon of the three-dimensional model is easily caused, which causes incomplete problems, such as missing of part of teeth, missing of some parts of a nose, and the like, of the target entity image corresponding to the target image sequence.

The embodiment of the invention compensates the boundary of the preset area in the target image sequence, and can improve the integrity of the preset area.

The preset region may represent a part of a solid, such as a face or a limb, and accordingly, the preset region may specifically include at least one of the following regions:

a facial region;

a clothing region; and

a limb area.

In an embodiment of the present invention, the boundaries of the tooth region in the target image sequence are compensated to repair an incomplete tooth or supplement an absent tooth, so that the integrity of the tooth region can be improved.

In practical applications, a target entity image including a complete preset region may be referred to, and a boundary of the preset region in the target image sequence is compensated.

It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.

Device embodiment

Referring to fig. 4, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, and is used for processing question-answer interaction, where the processing may specifically include:

a determining module 401, configured to determine a target voice sequence and a target image sequence corresponding to a target entity image; the mode corresponding to the target image sequence may include: listening mode, or answering mode; in the input process of the problem, the mode corresponding to the target image sequence is an listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence is an answer mode; and

a fusion module 402, configured to fuse the target voice sequence and the target image sequence to obtain a corresponding target video, so as to output the target video to a user.

Optionally, the image feature corresponding to the target image sequence may include at least one of the following features:

an expression characteristic;

a lip feature; and

a limb characteristic.

Optionally, the determining module 401 may include:

and the question voice image sequence determining module is used for determining a target voice sequence and a target image sequence corresponding to the target entity image according to the question related text.

Optionally, the question voice image sequence determination module may include:

the first voice sequence determining module is used for determining a target voice sequence corresponding to the question related text;

the first image sequence determining module is used for determining a target image sequence corresponding to the target voice sequence according to the mapping relation between the voice characteristic sequence and the image characteristic sequence; the voice feature sequence and the image feature sequence in the mapping relation are aligned on a time axis; the mapping relation is obtained according to the voice sample and the image sample aligned by the time axis.

Optionally, the question voice image sequence determination module may include:

the time length characteristic determining module is used for determining time length characteristics corresponding to the question related texts;

the second voice sequence determination module is used for determining a target voice sequence corresponding to the question related text according to the duration characteristics;

the second image sequence determining module is used for determining a target image sequence corresponding to the question related text according to the duration characteristics; the target image sequence is obtained according to the text sample and the image sample corresponding to the text sample.

Optionally, the body feature corresponding to the target image sequence is obtained according to the semantic feature corresponding to the question-related text.

Optionally, the apparatus may further include:

and the boundary compensation module is used for compensating the boundary of a preset area in the target image sequence before the fusion module fuses the target voice sequence and the target image sequence.

Optionally, the apparatus may further include:

the first output module is used for outputting the target video to a user; or

The second output module is used for outputting the link of the target video to a user; or

A third output module, configured to output the target voice sequence or a link of the target voice sequence to a user; or

And the fourth output module is used for outputting the question related text to the user.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 is a block diagram illustrating a structure of an apparatus for data processing as a device according to an exemplary embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, apparatus 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, incoming calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia component 908 comprises a screen providing an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when apparatus 900 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, the sensor assembly 914 may detect an open/closed state of the device 900, the relative positioning of the components, such as a display and keypad of the apparatus 900, the sensor assembly 914 may also detect a change in the position of the apparatus 900 or a component of the apparatus 900, the presence or absence of user contact with the apparatus 900, orientation or acceleration/deceleration of the apparatus 900, and a change in the temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the apparatus 900 and other devices in a wired or wireless manner. The apparatus 900 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the apparatus 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 6 is a block diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (device or server), enable the apparatus to perform a data processing method, the method comprising: determining a target voice sequence and a target image sequence corresponding to a target entity image; the mode corresponding to the target image sequence comprises: listening mode, or answering mode; in the input process of the problem, the mode corresponding to the target image sequence is an listening mode; or after the input of the question is completed, the mode corresponding to the target image sequence is an answer mode; and fusing the target voice sequence and the target image sequence to obtain a corresponding target video so as to output the target video to a user.

The embodiment of the invention discloses A1 and a data processing method, which is used for processing question-answer interaction and comprises the following steps:

A2, according to the method in A1, the image features corresponding to the target image sequence include at least one of the following features:

an expression characteristic;

a lip feature; and

a limb characteristic.

A3, according to the method in A1, the determining the target voice sequence and the target image sequence corresponding to the target entity image includes:

and determining a target voice sequence and a target image sequence corresponding to the target entity image according to the question related text.

A4, according to the method in A3, the determining the target voice sequence and the target image sequence corresponding to the target entity image includes:

determining a target voice sequence corresponding to the question related text;

determining a target image sequence corresponding to the target voice sequence according to a mapping relation between the voice characteristic sequence and the image characteristic sequence; the voice feature sequence and the image feature sequence in the mapping relation are aligned on a time axis; the mapping relation is obtained according to the voice sample and the image sample aligned by the time axis.

A5, according to the method in A3, the determining the target voice sequence and the target image sequence corresponding to the target entity image includes:

determining a duration characteristic corresponding to the question related text;

determining a target voice sequence corresponding to the question related text according to the duration characteristics;

determining a target image sequence corresponding to the question related text according to the duration characteristics; the target image sequence is obtained according to the text sample and the image sample corresponding to the text sample.

A6, according to the method in A3, the corresponding limb features of the target image sequence are obtained according to the semantic features corresponding to the question-related text.

A7, the method according to any of A1 to A6, further comprising, before the fusing the target speech sequence and the target image sequence:

and compensating the boundary of a preset area in the target image sequence.

A8, the method of any one of A1 to A6, the method further comprising:

outputting the target video to a user; or

Outputting a link to the target video to a user; or

Outputting the target voice sequence or the link of the target voice sequence to a user; or

The question-related text is output to the user.

The embodiment of the invention discloses B9 and a data processing device, which is used for processing question-answer interaction, and the device comprises:

B10, according to the device of B9, the image features corresponding to the target image sequence comprise at least one of the following features:

an expression characteristic;

a lip feature; and

a limb characteristic.

B11, the apparatus of B9, the means for determining comprising:

B12, the apparatus of B11, the question speech image sequence determination module comprising:

B13, the apparatus of B11, the question speech image sequence determination module comprising:

B14, according to the device of B11, the corresponding limb feature of the target image sequence is obtained according to the semantic feature corresponding to the question related text.

B15, the apparatus according to any one of B9 to B14, further comprising:

B16, the apparatus according to any one of B9 to B14, further comprising:

the first output module is used for outputting the target video to a user; or

The embodiment of the invention discloses C17, an apparatus for data processing, for processing of question-answer interaction, the apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

C18, according to the device of C17, the image features corresponding to the target image sequence include at least one of the following features:

an expression characteristic;

a lip feature; and

a limb characteristic.

C19, the device according to C17, the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

C20, the device according to C19, the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

determining a target voice sequence corresponding to the question related text;

C21, the device according to C19, the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

And C22, according to the device of C19, obtaining the corresponding limb characteristics of the target image sequence according to the semantic characteristics corresponding to the question related text.

C23, the apparatus according to any of C17 to C22, further comprising, before said fusing the target speech sequence and the target image sequence:

and compensating the boundary of a preset area in the target image sequence.

C24, the apparatus according to any one of C17 to C22, the apparatus further comprising:

outputting the target video to a user; or

Outputting a link to the target video to a user; or

The question-related text is output to the user.

Embodiments of the present invention disclose D25, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a data processing method as described in one or more of a 1-a 8.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The data processing method, the data processing apparatus and the apparatus for data processing provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A data processing method for processing question-answer interaction, the method comprising:

determining a target voice sequence and a target image sequence corresponding to a target entity image; the mode corresponding to the target image sequence comprises: listening mode, or answering mode; in the input process of the problem, the mode corresponding to the target image sequence is a listening mode, and the target image sequence corresponds to a second entity state and is used for representing the entity state under the condition of outputting a listening state text; or after the input of the question is completed, the mode corresponding to the target image sequence is an answer mode, and the target image sequence corresponds to a first entity state and is used for representing the entity state under the condition of outputting an answer text;

fusing the target voice sequence and the target image sequence to obtain a corresponding target video so as to output the target video to a user; the target video includes: a first target video corresponding to the listening mode and a second target video corresponding to the answering mode;

according to the learning of the linked image sample, switching the mode corresponding to the target image sequence; the joining image sample includes: sequentially generating an image sample corresponding to the listening mode and an image sample corresponding to the answering mode; the stitching image samples further comprises: the image samples corresponding to the answering mode and the image samples corresponding to the listening mode appear in sequence.

2. The method of claim 1, wherein the image features corresponding to the target image sequence comprise at least one of:

an expression characteristic;

a lip feature; and

a limb characteristic.

3. The method of claim 1, wherein the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

4. The method of claim 3, wherein the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

determining a target voice sequence corresponding to the question related text;

5. The method of claim 3, wherein the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

6. The method according to claim 3, wherein the body feature corresponding to the target image sequence is obtained according to a semantic feature corresponding to the question-related text.

7. The method according to any one of claims 1 to 6, wherein prior to said fusing said target speech sequence and said target image sequence, said method further comprises:

and compensating the boundary of a preset area in the target image sequence.

8. The method according to any one of claims 1 to 6, further comprising:

outputting the target video to a user; or

Outputting a link to the target video to a user; or

The question-related text is output to the user.

9. A data processing apparatus for processing a question-answer interaction, the apparatus comprising:

the determining module is used for determining a target voice sequence and a target image sequence corresponding to the target entity image; the mode corresponding to the target image sequence comprises: listening mode, or answering mode; in the input process of the problem, the mode corresponding to the target image sequence is a listening mode, and the target image sequence corresponds to a second entity state and is used for representing the entity state under the condition of outputting a listening state text; or after the input of the question is completed, the mode corresponding to the target image sequence is an answer mode, and the target image sequence corresponds to a first entity state and is used for representing the entity state under the condition of outputting an answer text; and

the fusion module is used for fusing the target voice sequence and the target image sequence to obtain a corresponding target video so as to output the target video to a user; the target video includes: a first target video corresponding to the listening mode and a second target video corresponding to the answering mode;

10. The apparatus of claim 9, wherein the image features corresponding to the target image sequence comprise at least one of:

an expression characteristic;

a lip feature; and

a limb characteristic.

11. The apparatus of claim 9, wherein the determining module comprises:

12. The apparatus of claim 11, wherein the question speech image sequence determination module comprises:

13. The apparatus of claim 11, wherein the question speech image sequence determination module comprises:

14. The apparatus according to claim 11, wherein the body feature corresponding to the target image sequence is obtained according to a semantic feature corresponding to the question-related text.

15. The apparatus of any of claims 9 to 14, further comprising:

16. The apparatus of any of claims 9 to 14, further comprising:

the first output module is used for outputting the target video to a user; or

17. An apparatus for data processing, wherein processing for question-answer interaction comprises a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by the one or more processors comprises instructions for:

18. The apparatus of claim 17, wherein the image features corresponding to the target image sequence comprise at least one of:

an expression characteristic;

a lip feature; and

a limb characteristic.

19. The apparatus of claim 17, wherein the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

20. The apparatus of claim 19, wherein the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

determining a target voice sequence corresponding to the question related text;

21. The apparatus of claim 19, wherein the determining the target voice sequence and the target image sequence corresponding to the target entity image comprises:

22. The apparatus according to claim 19, wherein the body feature corresponding to the target image sequence is obtained according to a semantic feature corresponding to the question-related text.

23. The apparatus of any of claims 17-22, wherein prior to the fusing the target speech sequence and the target image sequence, the apparatus is further configured to execute the one or more programs by one or more processors including instructions for:

and compensating the boundary of a preset area in the target image sequence.

24. The apparatus of any of claims 17-22, wherein the apparatus is further configured to execute the one or more programs by one or more processors includes instructions for:

outputting the target video to a user; or

Outputting a link to the target video to a user; or

The question-related text is output to the user.

25. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a data processing method as claimed in one or more of claims 1 to 8.