CN110148406A

CN110148406A - A kind of data processing method and device, a kind of device for data processing

Info

Publication number: CN110148406A
Application number: CN201910295565.XA
Authority: CN
Inventors: 樊博; 孟凡博; 刘恺; 段文君; 陈汉英; 陈曦; 陈伟; 王砚峰
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2019-08-20
Anticipated expiration: 2039-04-12
Also published as: CN110148406B

Abstract

The embodiment of the invention provides a kind of data processing method and device, a kind of device for data processing, method therein is used for the processing of question and answer interaction, specifically includes: determining the corresponding target voice sequence of target entity image and target image sequence；The corresponding mode of the target image sequence includes: the mode of listening attentively to or answering model；In the input process of problem, the corresponding mode of the target image sequence is to listen attentively to mode；Alternatively, after the completion of the input of problem, the corresponding mode of the target image sequence is answering model；The target voice sequence and the target image sequence are merged, corresponding target video is obtained, to export the target video to user.The embodiment of the present invention can save human cost, and the working efficiency of relevant industries can be improved, and the intelligence of target image sequence under video interactive scene can be improved.

Description

A kind of data processing method and device, a kind of device for data processing

Technical field

The present invention relates to field of computer technology, more particularly to a kind of data processing method and device, one kind for counting According to the device of processing.

Background technique

With the development of communication technology, user, which carries out communication by network, becomes important means.Currently, video customer service can be with The customer service for realizing long-range " face-to-face ", can realize that the unimpeded of voice and video exchanges between contact staff and client； Video customer service can be applied to e-commerce website, enterprise web site, long-distance education, training web site, video shopping, video shopping guide, The application scenarios such as website monitoring.

In practical applications, video customer service needs to expend the more human cost of contact staff, so that the work of customer service industry It is lower to make efficiency.

Summary of the invention

In view of the above problems, the embodiment of the present invention proposes one kind and overcomes the above problem or at least be partially solved above-mentioned Data processing method, data processing equipment and the device for data processing of problem, the embodiment of the present invention can save manpower Cost, can be improved the working efficiency of relevant industries, and the intelligence of target image sequence under video interactive scene can be improved.

To solve the above-mentioned problems, the invention discloses a kind of data processing methods, described for the processing of question and answer interaction Method includes:

Determine the corresponding target voice sequence of target entity image and target image sequence；The target image sequence is corresponding Mode include: the mode of listening attentively to or answering model；In the input process of problem, the corresponding mould of the target image sequence Formula is to listen attentively to mode；Alternatively, after the completion of the input of problem, the corresponding mode of the target image sequence is answering model；

The target voice sequence and the target image sequence are merged, corresponding target video is obtained, with to User exports the target video.

On the other hand, the invention discloses a kind of data processing equipments, for the processing of question and answer interaction, described device packet It includes:

Determining module, for determining the corresponding target voice sequence of target entity image and target image sequence；The mesh The corresponding mode of logo image sequence includes: the mode of listening attentively to or answering model；In the input process of problem, the target figure As the corresponding mode of sequence is to listen attentively to mode；Alternatively, after the completion of the input of problem, the corresponding mode of the target image sequence For answering model；And

Fusion Module obtains corresponding for merging to the target voice sequence and the target image sequence Target video, to export the target video to user.

In another aspect, the invention discloses a kind of device for data processing, for the processing of question and answer interaction, the dress Set includes that perhaps more than one program one of them or more than one program is stored in storage by memory and one In device, and be configured to be executed by one or more than one processor the one or more programs include for into The following instruction operated of row:

The embodiment of the present invention includes following advantages:

The target voice sequence of the embodiment of the present invention can match with the tone color of target utterance body, and target image sequence can To be obtained on the basis of target entity image, it is possible thereby to be realized during video interactive, through obtained target video It is interacted by target entity image according to the tone color of target utterance body；Since above-mentioned target video can be generated by machine, therefore phase Video customer service for manual type, can save human cost, and the working efficiency of relevant industries can be improved.

Also, in the embodiment of the present invention, in the input process of problem, the corresponding mode of the target image sequence is to incline Listen mode；Or after the completion of the input of problem, the corresponding mode of the target image sequence can be answering model；It therefore can To improve the intelligence of target image sequence under video interactive scene.

Detailed description of the invention

Fig. 1 is a kind of step flow chart of data processing method embodiment one of the invention；

Fig. 2 is the step flow chart of one mode switching method of the invention；

Fig. 3 is a kind of step flow chart of data processing method embodiment two of the invention；

Fig. 4 is a kind of structural block diagram of data processing equipment embodiment of the invention；

Fig. 5 be a kind of device for data processing of the invention as equipment when structural block diagram；And

Fig. 6 is the structural block diagram of server-side in some embodiments of the present invention.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

The technical issues of needing to expend contact staff's more human cost for video customer service, the embodiment of the present invention provide A kind of scheme generating target video by machine, the program are used for the processing of question and answer interaction, can specifically include: determining mesh Mark the corresponding target voice sequence of solid images and target image sequence；The corresponding mode of the target image sequence specifically can be with It include: the mode of listening attentively to or answering model；In the input process of described problem, the corresponding mode of the target image sequence For the mode of listening attentively to；Alternatively, after the completion of the input of described problem, the corresponding mode of the target image sequence is answering model； The target voice sequence and the target image sequence are merged, corresponding target video is obtained, to be exported to user The target video.

The embodiment of the present invention can be applied in video interactive scene, for saving human cost.Video interactive scene can To include: video conference scene, video customer service scene etc..Video customer service can be applied to e-commerce website, enterprise web site, remote Cheng Jiaoyu, training web site, the application scenarios such as video shopping, video shopping guide, website monitoring.

The target image sequence of the embodiment of the present invention can obtain on the basis of target entity image, in other words, this hair Bright embodiment can assign mode corresponding characteristics of image (entity state feature) for target entity image, to obtain target image Sequence.

In the embodiment of the present invention, the corresponding mode of target image sequence may include: answering model or listen attentively to mode, The intelligence of target image sequence under video interactive scene can be improved.

Answering model can refer to the mode answered a question by target video, can correspond to first instance state.It is returning It answers under mode, the corresponding target entity image of target video can read aloud the corresponding answer text of problem by target voice sequence This, and pass through the emotion during answer text is read aloud in the corresponding first instance state expression of target image sequence.

The mode of listening attentively to can refer to the mode that user inputs problem of listening attentively to, and can correspond to second instance state.Listen attentively to mould Under formula, the corresponding target entity image of target video can be listened attentively to by the corresponding second instance state expression of target image sequence Emotion in the process.Second instance state may include: feature etc. of nodding.Optionally, in the listen mode, mesh can also be passed through The expression of poster sound sequence " uh ", " continuing with " etc. listen attentively to state text.

In the embodiment of the present invention, in the input process of described problem, the corresponding mode of the target image sequence is to incline Listen mode；Or after the completion of the input of described problem, the corresponding mode of the target image sequence can be answering model.

Whether the embodiment of the present invention can input completion according to problem, cut to the corresponding mode of target image sequence It changes.Optionally, if not receiving the input of user in preset duration, it is believed that the input of problem is completed.

In practical applications, it can use TTS (speech synthesis, Text To Speech) technology, convert text to mesh The corresponding target voice of voice sequence is marked, target voice sequence can be characterized as the form of waveform.It is appreciated that can be according to language Sound synthetic parameters obtain the target voice sequence for meeting demand.

Optionally, speech synthesis parameter may include: at least one of tamber parameter, pitch parameter and loudness parameter.

Wherein, tamber parameter can refer to the distinguished characteristic in terms of the frequency of different sound shows waveform, lead to The sounding body of Chang Butong corresponds to different tone colors, therefore can obtain the tone color phase with target utterance body according to tamber parameter The target voice sequence matched, target utterance body can be specified by user, for example, target utterance body can be specified media worker Deng.In practical applications, the tamber parameter of target utterance body can be obtained according to the audio of the preset length of target utterance body.

Pitch parameter can characterize tone, be measured with frequency.Loudness parameter, the also referred to as sound intensity or volume can refer to The size of sound is measured with decibel (dB).

The embodiment of the present invention can use following method of determination, determine the corresponding target voice sequence of object language feature, Wherein, object language feature is corresponding to problem related text:

Method of determination 1 searches the first voice unit to match with object language feature in the first sound bank, to first Voice unit is spliced, to obtain target voice sequence.

Method of determination 2 determines the corresponding target acoustical feature of object language feature, lookup and target in the second sound bank The second voice unit that acoustic feature matches, splices the second voice unit, to obtain target voice sequence.

Acoustic feature can characterize the feature of voice from sounding angle.

Acoustic feature can include but is not limited to following feature:

It is related special to specifically include duration correlated characteristic, fundamental frequency for prosodic features (super-segmental feature/paralinguistics feature) Sign, energy correlated characteristic etc.；

Sound quality feature；

Correlation analysis feature based on spectrum is the embodiment of correlation between vocal tract shape variation and sound generating movements, mesh The preceding correlated characteristic based on spectrum specifically include that linear prediction residue error (LPCC, LinearPredictionCoefficients), mel-frequency cepstrum coefficient (MFCC, Mel Frequency Cepstrum Coefficient) etc..

Method of determination 3, using phoneme synthesizing method end to end, the source of phoneme synthesizing method can wrap end to end Include: text or the corresponding object language feature of text, target side can be the target voice sequence of wave form.

In an alternative embodiment of the invention, phoneme synthesizing method can use neural network, the mind end to end It may include: single layer RNN (Recognition with Recurrent Neural Network, Recurrent Neural Network) and the double-deck active coating through network, it is double Layer active coating is for predicting 16 voice outputs.The state demarcation of RNN is at two parts: the first (most-significant byte) state and second (low 8 Position) state.First state and the second state input corresponding active coating respectively, and the second state is obtained based on first state, 16 based on previous moment of first state obtain.The neural network is by first state and the second Design of State in a network knot In structure, training speed can be accelerated and simplify training process, therefore the operand of neural network can be reduced, and then end can be made to arrive The phoneme synthesizing method at end is suitable for the limited mobile terminal of calculation resources, such as mobile phone.

It is appreciated that those skilled in the art can be according to practical application request, using above-mentioned method of determination 1 to determination side Any or combination in formula 3, the embodiment of the present invention is for determining the specific of the corresponding target voice sequence of object language feature Process is without restriction.

Target image sequence can be used for characterizing entity (entity) image.Entity is that have distinguishability and self-existent thing Object, entity may include: people, robot, animal, plant etc..The embodiment of the present invention is mainly taking human as example to target image sequence It is illustrated, the corresponding target image sequence of other entities is cross-referenced.The corresponding solid images of people are properly termed as portrait.

For entity state angle, above-mentioned characteristics of image may include entity state feature, and entity state feature can be with Reflect feature of the image sequence in terms of entity state.

Optionally, above-mentioned entity state feature may include at least one of following feature:

Expressive features；

Lip feature；And

Limbs feature.

Expression gives expression to one's sentiment, affection, can refer to the thoughts and feelings for showing face.

Expressive features are usually to be directed to entire face.Lip feature can be specifically for lip, and with the text of text This content, voice, articulation type etc. have relationship, therefore the corresponding naturalness expressed of image sequence can be improved.

Limbs feature can be conveyed by the Coordinating Activity of the human bodies such as head, eye, neck, hand, elbow, arm, body, hip, foot The thought of personage, visually so as to communicating views.Limbs feature may include: rotary head, shrug, gesture etc., and image can be improved The corresponding richness expressed of sequence.For example, at least one arm naturally droops when speaking, when silent at least one arm from So it is placed on abdomen etc..

Data processing method provided in an embodiment of the present invention can be applied in client and the corresponding application environment of server-side, Client and server-side are located in wired or wireless network, and by the wired or wireless network, client is counted with server-side According to interaction.

Optionally, client may operate in terminal, and above-mentioned terminal specifically includes but unlimited: smart phone, plate electricity Brain, E-book reader, MP3 (dynamic image expert's compression standard audio level 3, Moving Picture Experts Group Audio Layer III) player, MP4 (dynamic image expert's compression standard audio level 4, Moving Picture Experts Group Audio Layer IV) player, pocket computer on knee, vehicle-mounted computer, desktop computer, machine top Box, intelligent TV set, wearable device etc..

Client refers to corresponding with server-side, provides the program of local service for user.Visitor in the embodiment of the present invention Family end can provide target video, and target video can be generated by client or server-side, and the embodiment of the present invention is for specific visitor Family end is without restriction.

In an embodiment of the present invention, client can determine the target of user's selection by man-machine interactive operation Sounding body information and target entity image information, and target utterance body information and target entity image information are uploaded to server-side, So that server-side generates target utterance body and the corresponding target video of target entity image；Also, client can be defeated to user The target video out.

Embodiment of the method one

Referring to Fig.1, a kind of step flow chart of data processing method embodiment one of the invention is shown, is handed over for question and answer Mutual processing, can specifically include following steps:

Step 101 determines the corresponding target voice sequence of target entity image and target image sequence；The target image The corresponding mode of sequence may include: the mode of listening attentively to or answering model；In the input process of problem, the target image The corresponding mode of sequence can be the mode of listening attentively to；Alternatively, after the completion of the input of problem, the corresponding mould of the target image sequence Formula can be answering model；

Step 102 merges the target voice sequence and the target image sequence, obtains corresponding target view Frequently, to export the target video to user.

In the embodiment of the present invention, target entity image can be specified by user, for example, target entity image can be real for target The image of body, target entity may include: star personality (such as host), and certainly, target entity can be arbitrary entity, Such as robot or ordinary people.

The target utterance body and target entity of the embodiment of the present invention can be identical, for example, user uploads the first video, the It may include: the voice and target entity image of target utterance body in one video.Alternatively, the target utterance body of the embodiment of the present invention Can be different with target entity, for example, user uploads the second video and the first audio, it may include: target in the second video Solid images may include: the voice of target utterance body in the first audio.

It in an alternative embodiment of the invention, can be corresponding to target image sequence according to linking image pattern Mode switches over, to improve the fluency of switching.

Being connected image pattern may include: the first linking image pattern.First linking image pattern may include: successively to go out Existing listens attentively to the corresponding image pattern of mode and the corresponding image pattern of answering model, can be by the first linking image Sample is learnt, and the rule switched from the mode of listening attentively to answering model is obtained, it is possible thereby to improve from the mode of listening attentively to answer The fluency of pattern switching.

Being connected image pattern may include: the second linking image pattern.Second linking image pattern may include: successively to go out The corresponding image pattern of existing answering model and the corresponding image pattern of mode is listened attentively to, it can be by the second linking image Sample is learnt, and is obtained from answering model to listening attentively to the rule of pattern switching, it is possible thereby to improve from answering model to listening attentively to The fluency of pattern switching.

Referring to Fig. 2, the step flow chart of one mode switching method of the invention is shown, the place for question and answer interaction Reason, can specifically include following steps:

Step 201, in the listen mode plays first object video, and the problem of receive user's input；

First object video can correspond to the mode of listening attentively to, can be by first object voice sequence and first object image sequence Column obtain, and first object image sequence can correspond to the mode of listening attentively to.

Whether step 202, decision problem input completion, if so, then follow the steps 203, otherwise return step 201；

The corresponding mode of target image sequence is set to answering model, and plays the second target video by step 203；

Step 204, after the second target video finishes, the corresponding mode of target image sequence is set to and listens attentively to mould Formula.

Second target video can correspond to answering model, can be by the second target voice sequence and the second target image sequence Column obtain, and the second target image sequence can correspond to answering model.

It is appreciated that above-mentioned output target video is intended only as alternative embodiment, in fact, the embodiment of the present invention can be to User exports the link of the target video, so that user determines whether to play above-mentioned target video.

Optionally, the embodiment of the present invention can also export the target voice sequence or the target voice to user The link of sequence.

Optionally, the embodiment of the present invention can also export problem related text to user.Problem related text may include: Answer text or listen attentively to state text.Wherein, answer text can correspond to answering model, and the state text of listening attentively to can correspond to Listen attentively to mode.

In an alternative embodiment of the invention, above-mentioned question and answer interaction can be corresponding with communication window, can communicate Show at least one of following information in window: the link of target voice sequence, problem answers text and target video Link.Wherein, the link of target video is displayed at the identified areas of communication terminal.Identified areas can be used for showing communication terminal The information such as the pet name, ID (mark, Identity), head portrait.

In an alternative embodiment of the invention, step 101 determines the corresponding target voice sequence of target entity image And target image sequence, it can specifically include: according to problem related text, determining the corresponding target voice sequence of target entity image Column and target image sequence.

In practical applications, the problem of user inputs can be speech form or textual form or graphic form.It can adopt With speech recognition technology, the problem of the problem of speech form is converted into textual form.Alternatively, optical character identification can be used Technology, the problem of the problem of graphic form is converted into textual form.

Optionally, the determination process of answer text may include: that determining described problem corresponding first indicates vector；Foundation Described first indicates that vector corresponding with preset problem second indicates the matching degree between vector, determines the corresponding mesh of described problem Mark preset problem；According to the corresponding answer of the preset problem of the target, the corresponding answer of described problem is determined.

The embodiment of the present invention can according to problem it is corresponding first expression vector corresponding with preset problem second indicate to Matching degree between amount determines the preset problem of target, and then according to the corresponding answer of the preset problem of the target, determines problem pair The answer text answered.

Due to the preset problem of target be it is problematic, corresponding answer often has reasonability and validity, and target Preset problem matches with problem, therefore can be by the corresponding answer of the preset problem of target, really as the corresponding answer of problem Determine foundation, and then the accuracy of the corresponding answer of problem can be improved.

In the embodiment of the present invention, optionally, problem and its corresponding answer can be saved presets by knowledge base.Then may be used To indicate that vector corresponding with problem preset in knowledge base second indicates that vector matches for first, to obtain corresponding matching Degree.

Optionally, the vector that the embodiment of the present invention can convert text to regular length indicates, consequently facilitating processing.The One expression vector can be used for indicating problem, and the second expression vector can be used for indicating preset problem.First indicates vector or second Indicate that the dimension of vector can be one-dimensional or two-dimentional or three-dimensional.

The type of first expression vector or the second expression vector may include: one-hot encoding (one-hot) vector, word insertion Vector (WordEmbedding) or advanced expression vector.Word embedding exactly finds a mapping or function, The expression on a new space is generated, which is exactly word representation.

In an alternative embodiment of the invention, first indicate that the determination process of vector may include: determining problem pair The word insertion vector answered is handled word insertion vector using neural network, and to obtain, problem is corresponding advanced to indicate vector. Wherein, word insertion vector is handled using neural network, the profound feature of word insertion vector can be extracted, therefore can be with Improving first indicates the rich of vector.It is alternatively possible to utilize CNN (convolutional neural networks, Convolutional Neural Network) or the neural networks such as LSTM (shot and long term memory network, Long Short-Term Memory), to word be embedded in Amount is handled.For second indicates the determination process of vector, since it indicates that the determination process of vector is similar with first, And therefore not to repeat here, cross-referenced.

It can determine that described first indicates vector corresponding with preset problem second using the similarity measurement between vector Indicate the matching degree between vector.Above-mentioned similarity measurement may include: included angle cosine, Euclidean distance etc..

The embodiment of the present invention can be by one or more maximum preset problem of matching degree, as the preset problem of target.

In an alternative embodiment of the invention, corresponding first keyword of described problem is corresponding with the preset problem The second keyword match.Since the quantity of problem preset in knowledge base is usually more, therefore the embodiment of the present invention can be first Based on the matching between the first keyword and the second keyword, problem preset in knowledge base is screened；Then, for screening By preset problem, determine the preset problem of target.Above-mentioned screening can reduce operand, and then arithmetic speed can be improved.

It in one embodiment, can be based on the matching between the first keyword and the second keyword, to pre- in knowledge base The problem of setting is screened, it is assumed that screening the preset problem passed through is the first preset problem；Then can according to described first indicate to Amount indicates the matching degree between vector with the first preset problem corresponding second, determines that the corresponding target of described problem is preset and asks Topic.Wherein, the first preset problem match condition corresponding with problem may include: that field keyword matches, and/or is intended to Keyword matches, and/or slot position keyword matches.

In the embodiment of the present invention, optionally, corresponding first keyword of described problem be can specifically include:

Field keyword；And/or

It is intended to keyword；And/or

Slot position keyword.

In the embodiment of the present invention, field can be with the range of index evidence.Optionally, field can with the application scenarios of index evidence or Person's classification.Field can include but is not limited to: printer, computer, encyclopaedia, news, music, video, video display, game, sport, Electric business, education and study, FM (frequency modulation, Frequency Modulation), SMS (short message service, Short Messaging Service), control, tourism, books, weather, picture library etc..It is appreciated that can be finely divided to field, to be segmented Field.For example, the subdivision field in encyclopaedia field may include: corresponding senses of a dictionary entry of encyclopaedia polysemant etc..Optionally, field can With related to corresponding APP or service, the embodiment of the present invention is without restriction for specific field.

The embodiment of the present invention can identify field keyword from the corresponding text of problem.It is alternatively possible to problem Corresponding text is segmented, and word segmentation result is matched with field keyword.Alternatively, can use disaggregated model, really Determine field belonging to problem.

Above-mentioned disaggregated model can be machine learning model.In broad terms, machine learning is that one kind can assign machine The ability of study, the method for allowing it to complete the impossible function of Direct Programming with this.But it is said in the sense that practice, engineering Habit is a kind of by training model using data, then uses a kind of method of model prediction.Machine learning method can wrap It includes: traditional decision-tree, linear regression method, logistic regression method, neural network method, k near neighbor method etc., it will be understood that this Inventive embodiments are without restriction for specific machine learning method.Above-mentioned disaggregated model can have the classification energy in field Power.

It is intended to (Intent), is that user, which wishes which type of task completed, to be judged to the judgement of a word of user's expression. It is alternatively possible to determine the corresponding intention keyword of problem using disaggregated model.

Slot position (Slot) be in user's expression, the definition of key message.Such as in the expression for ordering air ticket, slot position can To include: " departure time ", " starting point ", " destination " etc..For another example, in the expression of computer glitch, slot position may include: " blue screen " etc..

In the embodiment of the present invention, optionally, intention extractive technique can use, determine that the corresponding intention of problem is crucial Word.It is alternatively possible to determine the corresponding slot position keyword of problem using slot position filling technique.Therefore not to repeat here.

Field keyword, any being intended in keyword and slot position keyword, can reflect the information of problem, therefore can It is corresponding first crucial as problem with by field keyword, any or combination being intended in keyword and slot position keyword Word.

Similarly, the second keyword can specifically include:

Field keyword；And/or

It is intended to keyword；And/or

Slot position keyword.

The embodiment of the present invention can save corresponding second keyword for the preset problem in knowledge base.

The embodiment of the present invention, text can be related at least two language, such as Chinese, Japanese, Korean, English, French, moral At least two etc. in the language such as text, Arabic.Then target voice sequence and target image sequence also relate at least Bilingual, therefore the embodiment of the present invention can be adapted for multilingual video interactive scene.

For example, text can be the problem of user inputs text in video customer service scene, can wrap in the question text It includes: the first language as mother tongue and the second language as non-mother tongue.For example, question text is related to computer glitch, problem Text may include: the corresponding English text of computer glitch and the Chinese text that user concludes and summarizes.

For another example, in video conference scene, it may include: multi-lingual in the conference speech original text that text, which can be conference speech original text, Say the corresponding multilingual of user.

It is appreciated that the text for being related at least two language can be applied to arbitrary video interactive scene, the present invention is real It is without restriction for specific video interactive scene to apply example.

According to a kind of embodiment, the corresponding answer target voice sequence of the determining target entity image and target figure is answered As sequence, it can specifically include: determining the corresponding target voice sequence of described problem related text；According to phonetic feature sequence with Mapping relations between characteristics of image sequence determine the corresponding target image sequence of the target voice sequence；The mapping is closed Phonetic feature sequence described in system is aligned on a timeline with described image characteristic sequence；The mapping relations are according to time shaft The speech samples and image pattern of alignment obtain.

Phonetic feature sequence may include: language feature and/or acoustic feature.

Language feature may include: phoneme feature.Phoneme be marked off according to the natural quality of voice come minimum voice Unit is analyzed according to the articulation in syllable, and a movement constitutes a phoneme.Phoneme may include: vowel and consonant.

Acoustic feature can characterize the feature of voice from sounding angle.

Acoustic feature can include but is not limited to following feature:

Sound quality feature；

The embodiment of the present invention can according to time axis aligned speech samples and image pattern, obtain phonetic feature sequence with Mapping relations between characteristics of image sequence.

It is regular governed between phonetic feature sequence and characteristics of image sequence.For example, specific phoneme feature is corresponding Specific lip feature；For another example, specific prosodic features corresponds to specific expressive features；Alternatively, specific phoneme feature pair Answer specific limbs feature etc..

Therefore, the embodiment of the present invention can obtain mapping relations according to time axis aligned speech samples and image pattern, To reflect the rule between phonetic feature sequence and characteristics of image sequence by the mapping relations.

Rule between the phonetic feature sequence and characteristics of image sequence of mapping relations reflection can be adapted for arbitrary Language, therefore can be adapted for the corresponding text of at least two language.

The embodiment of the present invention can use machine learning method end to end, to time axis aligned speech samples and image Sample is learnt, to obtain above-mentioned mapping relations.The input of machine learning method end to end can be voice sequence, output It can be image sequence, this method can be by the study of training data, between the feature of the feature and output that are inputted Rule.

In broad terms, machine learning is a kind of ability that can assign machine learning, it is allowed to complete Direct Programming with this The method of impossible function.But it is said in the sense that practice, machine learning is a kind of by utilizing data, training depanning Then type uses a kind of method of model prediction.Machine learning method may include: traditional decision-tree, linear regression method, patrol Collect homing method, neural network method etc., it will be understood that the embodiment of the present invention does not limit specific machine learning method System.

The alignment of speech samples and image pattern on a timeline can be improved same between phonetic feature and characteristics of image Step property.

In an embodiment of the present invention, the speech samples and described image sample can be originated from same video text Part, it is possible thereby to realize the alignment of speech samples and image pattern on a timeline.For example, the video text of recording can be collected Part may include: the voice of sounding body and the video pictures of sounding body in the video file.

In another embodiment of the invention, speech samples and image pattern can be originated from different files, specifically, The speech samples can be originated from audio file, and described image sample can be originated from video file or image file, image text Part may include: multiple image.In such cases, time shaft alignment can be carried out to speech samples and image pattern, to obtain Time axis aligned speech samples and image pattern.

It is appreciated that above-mentioned machine learning method end to end is intended only as the optional implementation of method for confirming mapping relation Example, actually those skilled in the art can determine mapping relations using other methods according to practical application request, such as other Method can be statistical method etc., and the embodiment of the present invention is without restriction for the specific determining method of mapping relations.

The target image sequence of the embodiment of the present invention can obtain on the basis of target entity image, in other words, this hair Bright embodiment can assign target voice sequence corresponding characteristics of image (entity state feature) for target entity image, to obtain Target image sequence.Target entity image can be specified by user, for example, target entity image can be (such as main for star personality Hold people) image.

To sum up, the corresponding target image sequence of target voice sequence of the embodiment of the present invention is to obtain according to mapping relations, Rule between the phonetic feature sequence and characteristics of image sequence of mapping relations reflection, can be adapted for arbitrary language, because This can be adapted for the corresponding text of at least two language.

According to another embodiment, the corresponding target voice sequence of the determining target entity image and target image sequence Column, can specifically include: determining the corresponding duration characteristics of problem related text；According to the duration characteristics, described problem is determined The corresponding target voice sequence of related text；According to the duration characteristics, the corresponding target figure of described problem related text is determined As sequence；The target image sequence is to obtain according to samples of text and its corresponding image pattern.

It is regular governed between text feature sequence and characteristics of image sequence.Text feature may include: phoneme spy Sign, and/or semantic feature etc..

Phoneme be marked off according to the natural quality of voice come least speech unit, come according to the articulation in syllable Analysis, a movement constitute a phoneme.Phoneme may include: vowel and consonant.Optionally, specific phoneme feature is corresponding special Fixed lip feature, expressive features or limbs feature etc..

Semanteme be concept representated by things in the real world corresponding to text to be processed meaning and these contain Relationship between justice is explanation and logical expressions of the text to be processed on some field.Optionally, specific semantic feature pair Answer specific limbs feature etc..

Therefore, the embodiment of the present invention can obtain text feature according to according to samples of text and its corresponding image pattern Mapping relations between sequence and characteristics of image sequence, to reflect text feature sequence and characteristics of image sequence by the mapping relations Rule between column.

In the case that the corresponding image pattern of samples of text may include: expression samples of text (such as reading aloud samples of text) Multiple image.The corresponding image pattern of samples of text can be carried in video sample, alternatively, the corresponding image of samples of text Sample can be carried in multiple image.Above-mentioned image pattern can be corresponding with target entity image, and target entity image can be by User is specified, for example, target entity image can be the image of star personality (such as host), certainly, target entity image It can be the image of any entity, such as robot or the image of ordinary people.

Above-mentioned samples of text may include: all language that text to be processed is related to, therefore, according to above-mentioned samples of text and The target image sequence that its image pattern obtains can be adapted for the corresponding text to be processed of at least two language.

The embodiment of the present invention can use machine learning method end to end, to samples of text and its corresponding image pattern Learnt, to obtain above-mentioned mapping relations.The input of machine learning method end to end can be text to be processed, and output can Think target image sequence, this method can be by the study of training data, between the feature of the feature and output that are inputted Rule.

In the embodiment of the present invention, it has been utilized respectively during the determination of target voice sequence and target image sequence to be processed The synchronism between target voice sequence and target image sequence can be improved in the corresponding duration characteristics of text, the duration characteristics.

Duration characteristics can be used for characterizing the duration of phoneme corresponding to text.Duration characteristics can depict rising and falling in voice Pause and transition in rhythm or melody and the order of importance and emergency, and then the expressive force and naturalness of synthesis voice can be improved.It is alternatively possible to using duration modeling, Determine the corresponding duration characteristics of answer text.The input of duration modeling can be with are as follows: the phoneme feature with stress label exports and is Phoneme duration.Duration modeling can be learnt to obtain to the speech samples of duration information, the embodiment of the present invention for Specific duration modeling is without restriction.

The expression characteristic of different language is usually different.Above-mentioned expression characteristic may include: vocal technique feature, exert oneself With with gas and lip feature (such as shape of the mouth as one speaks and shape of the mouth as one speaks posture).For example, the vocal technique feature of Chinese may include: prosopyle Chamber vocal technique, the front in oral cavity are compared firmly, front of the sounding position in oral cavity.For another example, Chinese vocal technique feature can To include: rear oral cavity vocal technique, firmly and openr, rear portion of the sounding position in oral cavity is compared at the rear portion in oral cavity.

In step 103, the corresponding target image sequence of answer text is according to samples of text and its corresponding image pattern It obtains, the corresponding language of above-mentioned samples of text may include: all language that answer text is related to, therefore, according to above-mentioned text The target image sequence that sample and its image pattern obtain can make the corresponding expression characteristic of target image sequence and answer text Corresponding at least two language is adapted.For example, above-mentioned sample to be processed is related to first language and second language, above-mentioned text sample Originally it is related to first language, second language and third language etc..

In an alternative embodiment of the invention, it determines the corresponding target image sequence of the answer text, specifically may be used To include: according to the corresponding target text characteristic sequence of the answer text and text feature sequence and characteristics of image sequence Between mapping relations, determine the corresponding target image characteristics sequence of target text characteristic sequence, and then can determine target figure As the corresponding target image sequence of characteristic sequence.

Mapping relations between text feature sequence and characteristics of image sequence, can reflect text feature sequence and image is special Levy the rule between sequence.

Text feature may include: language feature and duration characteristics.Characteristics of image has for characterizing target entity image Body may include: entity state feature above-mentioned.

In an alternative embodiment of the invention, the corresponding target image sequence of above-mentioned determining target image characteristics sequence Column, can specifically include: synthesizing to target entity image and target image characteristics sequence, to obtain target image sequence, Target image characteristics sequence can be assigned for target entity image.

Target entity image can be specified by user, for example, target entity image can be star personality (such as host) Image.

Target entity image can not carry entity state, close to target entity image and target image characteristics sequence At target image sequence being made to carry the entity state to match with text, and then entity in target video can be improved The naturalness and richness of state.

It, optionally, can be special to the corresponding threedimensional model of target entity image and target image in the embodiment of the present invention Sign sequence is synthesized, and target image sequence is obtained.Threedimensional model can be for multiframe target entity image progress three-dimensional reconstruction It obtains.

In practical applications, entity exists usually in the form of three-dimensional geometry entity.Traditional two-dimensional image passes through Comparison of light and shade and perspective relation cause visual space multistory sense, can not generate spectacular naturally three-dimensional perception.And The spatial modelling and prototype of 3-dimensional image are close, not only have height, width, depth three-dimensional space geometrical body feature, but also With true status information true to nature, the sense of reality that planar picture can not provide is changed, warm, sense true to nature can be given Feel.

In computer graphics, usually with threedimensional model come to solid modelling, threedimensional model is corresponded in spatial entities Entity, can be shown by computer or other video equipments.

The corresponding feature of threedimensional model may include: geometrical characteristic, texture phase, entity state feature etc., entity state Feature may include: expressive features, lip feature, limbs feature etc..Wherein, geometrical characteristic usually with polygon come or voxel It indicates, for geometric part of the polygon to express threedimensional model, i.e., with Polygons Representation or approximate representation entity Curved surface.Its basic object is the vertex in three-dimensional space, and the straight line that two vertex connect is known as side, three vertex It connects through three sides as triangle, triangle is simplest polygon in Euclidean space.Multiple triangles can group At more complicated polygon, or generate the single entity on more than three vertex.Quadrangle and triangle are polygon expression Threedimensional model in most common shape, in terms of the expression of threedimensional model, triangulation network threedimensional model because its data structure is simple, It is easy a kind of prevalence that the features such as being drawn by all graphics hardware devices expresses as threedimensional model to select, wherein each triangle Shape is exactly a surface, therefore triangle is also known as tri patch.

Threedimensional model can be with the default entity state and point cloud data of dense correspondence, and default entity state can To include: neutral expression, lip closed state and arm droop state etc..

The corresponding threedimensional model of target entity image and target image characteristics sequence are synthesized, modification three can be passed through Vertex position on dimension module etc. realizes that the synthetic method of use can specifically include: keyframe interpolation method, parametric method Deng.Wherein, keyframe interpolation method can carry out difference to the characteristics of image of key frame.Parametric method can pass through threedimensional model Parameter the variation of entity state is described, by adjusting the different entity state of these gain of parameter.

Using keyframe interpolation method, the embodiment of the present invention can be obtained according to target image characteristics sequence Difference value vector.Using parametric method, the embodiment of the present invention can be joined according to target image characteristics sequence Number vector.

It is appreciated that above-mentioned keyframe interpolation method, parametric method is intended only as the alternative embodiment of synthetic method, practical On, those skilled in the art can be according to practical application request, and using required synthetic method, the embodiment of the present application is for specific Synthetic method it is without restriction.

Text feature sequence is utilized during determining target image sequence corresponding characteristics of image in the embodiment of the present invention Rule between column and characteristics of image sequence.Characteristics of image therein may include: expressive features, lip feature and limbs feature At least one of.

In order to improve the accuracy of the corresponding characteristics of image of target image sequence, the embodiment of the present invention can also be to target figure As the corresponding characteristics of image of sequence is extended or adjusts.

In an alternative embodiment of the invention, the corresponding limbs feature of the target image sequence can be for according to institute The corresponding semantic feature of text is stated to obtain.The embodiment of the present invention uses text corresponding language during determining limbs feature Adopted feature, therefore the accuracy of limbs feature can be improved.

In the embodiment of the present invention, optionally, the direction of limbs feature, position, any parameter in speed and strength with The corresponding semantic feature of text is related.

Optionally, above-mentioned semantic feature can be related to affective characteristics.Limbs feature can be carried out according to affective characteristics Classification, to obtain the corresponding limbs feature of a kind of affective characteristics.

Optionally, affective characteristics may include: positive affirmative, passive negative or neutrality etc..

The band of position of limbs feature may include: Shang Qu, Zhong Qu, lower area.More than shoulder it is upper area, reason can be expressed Actively certainly the affective characteristics such as think, wish, is happy, congratulating.Middle Qu Zhicong shoulder can describe things and illustrate whole to waist Reason expresses neutral emotion.Lower area refer to waist hereinafter, can express abhor, oppose, criticizing, the emotion of the passive negative such as disappointment.

Other than the band of position, limbs feature can also include: direction.For example, palm turned upwards, it can express and actively agree Fixed affective characteristics.For another example, palm turned downwards, can express the emotion of passive negative.

In the embodiment of the present invention, the type of semantic feature may include: that keyword, one-hot encoding (one-hot) vector, word are embedding Incoming vector (WordEmbedding) etc..Word embedding exactly finds a mapping or function, generates new at one Expression spatially, which is exactly word representation.

The embodiment of the present invention can determine the corresponding language of text by the mapping relations between semantic feature and limbs feature The corresponding limbs feature of adopted feature.Mapping relations between semantic feature and limbs feature can be obtained by statistical method, It can be obtained by method end to end.

The embodiment of the present invention can by time axis aligned speech samples and image pattern, realize target voice sequence and The alignment of target image sequence on a timeline；Alternatively, the embodiment of the present invention can realize target voice sequence by duration characteristics Column and the alignment of target image sequence on a timeline.It is aligned on a timeline in target voice sequence and target image sequence On the basis of, target voice sequence and target image sequence can be merged, to obtain target video.It is alternatively possible to adopt With multi-modal fusion technology, target voice sequence and target image sequence are merged.It is appreciated that the embodiment of the present invention pair It is without restriction in specific fusion method.

After obtaining target video, target video can be saved or be exported.For example, server-side can be to client End sends target video, so that client exports target video etc. to user.

To sum up, the data processing method of the embodiment of the present invention, target voice sequence can be with the tone color phases of target utterance body Matching, target image sequence can obtain on the basis of target entity image, it is possible thereby to real by obtained target video Now expressed by target entity image according to the tone color answer case text of target utterance body；Since above-mentioned target video can be by machine Device generates, therefore can shorten the generation time of target video, and then the timeliness of target video can be improved, so that target video It can be adapted for the higher video interactive scene of timeliness, such as breaking news scene

Also, target video target entity image is expressed according to the tone color answer case text of target utterance body, relatively It is expressed according to manual type answer case text, human cost can be saved, and the work effect of relevant industries can be improved Rate.

In addition, above-mentioned samples of text may include: all language that answer text is related to, therefore, according to above-mentioned text sample The target image sequence that sheet and its image pattern obtain, can be adapted for the corresponding answer text of at least two language.

Also, be utilized respectively during the determination of target voice sequence and target image sequence answer text it is corresponding when The synchronism between target voice sequence and target image sequence can be improved in long feature, the duration characteristics.

Embodiment of the method two

Referring to Fig. 3, a kind of step flow chart of data processing method embodiment two of the invention is shown, is handed over for question and answer Mutual processing, can specifically include following steps:

Step 301 determines the corresponding target voice sequence of target entity image and target image sequence；The target image The corresponding mode of sequence may include: the mode of listening attentively to or answering model；In the input process of problem, the target image The corresponding mode of sequence can be the mode of listening attentively to；Alternatively, after the completion of the input of problem, the corresponding mould of the target image sequence Formula can be answering model；

Step 302 compensates the boundary of predeterminable area in the target image sequence；

Step 303 merges the target voice sequence and compensated target image sequence, obtains corresponding mesh Video is marked, to export the target video to user.

The embodiment of the present invention is during determining the answer text corresponding target image sequence, it will usually use mesh The threedimensional model of solid images is marked, and in the synthesis of the method for reconstructing of threedimensional model and threedimensional model and characteristics of image sequence The limitation of method is easy so that details missing problem occurs in the polygon of threedimensional model, this will be so that target image sequence pair There are certain place missings of imperfect problem, such as part absence of tooth, nose for the target entity image answered.

The embodiment of the present invention compensates the boundary of predeterminable area in the target image sequence, and preset areas can be improved The integrality in domain.

Above-mentioned predeterminable area can characterize the position of entity, such as face or limbs portion, correspondingly, above-mentioned preset areas Domain can specifically include at least one of following region:

Facial area；

Dress ornament region；And

Limbs region.

In an embodiment of the present invention, the boundary in target image sequence Tooth region is compensated, it can To repair incomplete tooth or supplement the tooth not occurred, therefore the integrality of tooth regions can be improved.

It in practical applications, can be with reference to the target entity image including complete predeterminable area, to the target image sequence The boundary of predeterminable area compensates in column, and the embodiment of the present invention is without restriction for specific compensation process.

It should be noted that for simple description, therefore, it is stated as a series of movement is dynamic for embodiment of the method It combines, but those skilled in the art should understand that, the embodiment of the present invention is not by the limit of described athletic performance sequence System, because according to an embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, art technology Personnel also should be aware of, and the embodiments described in the specification are all preferred embodiments, and related athletic performance is simultaneously different It surely is necessary to the embodiment of the present invention.

Installation practice

Referring to Fig. 4, a kind of structural block diagram of data processing equipment embodiment of the invention is shown, for question and answer interaction Processing, can specifically include:

Determining module 401, for determining the corresponding target voice sequence of target entity image and target image sequence；It is described The corresponding mode of target image sequence may include: the mode of listening attentively to or answering model；It is described in the input process of problem The corresponding mode of target image sequence is to listen attentively to mode；Alternatively, the target image sequence is corresponding after the completion of the input of problem Mode be answering model；And

Fusion Module 402 is corresponded to for merging to the target voice sequence and the target image sequence Target video, to export the target video to user.

Optionally, the corresponding characteristics of image of the target image sequence may include at least one of following feature:

Expressive features；

Lip feature；And

Limbs feature.

Optionally, the determining module 401 may include:

Problem phonetic image sequence determining module, for determining that target entity image is corresponding according to problem related text Target voice sequence and target image sequence.

Optionally, described problem phonetic image sequence determining module may include:

First voice sequence determining module, for determining the corresponding target voice sequence of described problem related text；

First image sequence determining module, for being closed according to the mapping between phonetic feature sequence and characteristics of image sequence System, determines the corresponding target image sequence of the target voice sequence；Phonetic feature sequence described in the mapping relations and institute Characteristics of image sequence is stated to be aligned on a timeline；The mapping relations are according to time axis aligned speech samples and image pattern It obtains.

Duration characteristics determining module, for determining the corresponding duration characteristics of problem related text；

Second voice sequence determining module, for determining that described problem related text is corresponding according to the duration characteristics Target voice sequence；

Second image sequence determining module, for determining that described problem related text is corresponding according to the duration characteristics Target image sequence；The target image sequence is to obtain according to samples of text and its corresponding image pattern.

Optionally, the corresponding limbs feature of the target image sequence is according to the corresponding semanteme of described problem related text Feature obtains.

Optionally, described device can also include:

Boundary compensation module, for the Fusion Module to the target voice sequence and the target image sequence into Before row fusion, the boundary of predeterminable area in the target image sequence is compensated.

Optionally, described device can also include:

First output module, for exporting the target video to user；Or

Second output module, for exporting the link of the target video to user；Or

Third output module, for exporting the chain of the target voice sequence or the target voice sequence to user It connects；Or

4th output module, for exporting problem related text to user.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.

Fig. 5 be a kind of device for data processing shown according to an exemplary embodiment as equipment when structural frames Figure.For example, device 900 can be mobile incoming call, computer, digital broadcasting terminal, messaging device, game console put down Panel device, Medical Devices, body-building equipment, personal digital assistant etc..

Referring to Fig. 5, device 900 may include following one or more components: processing component 902, memory 904, power supply Component 906, multimedia component 908, audio component 910, the interface 912 of input/output (I/O), sensor module 914, and Communication component 916.

The integrated operation of the usual control device 900 of processing component 902, such as with display, incoming call, data communication, phase Machine operation and record operate associated operation.Processing element 902 may include that one or more processors 920 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 902 may include one or more modules, just Interaction between processing component 902 and other assemblies.For example, processing component 902 may include multi-media module, it is more to facilitate Interaction between media component 908 and processing component 902.

Memory 904 is configured as storing various types of data to support the operation in equipment 900.These data are shown Example includes the instruction of any application or method for operating on device 900, contact data, and book data of sending a telegram here disappear Breath, picture, video etc..Memory 904 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 906 provides electric power for the various assemblies of device 900.Power supply module 906 may include power management system System, one or more power supplys and other with for device 900 generate, manage, and distribute the associated component of electric power.

Multimedia component 908 includes the screen of one output interface of offer between described device 900 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding motion The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, Multimedia component 908 includes a front camera and/or rear camera.When equipment 900 is in operation mode, as shot mould When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 910 is configured as output and/or input audio signal.For example, audio component 910 includes a Mike Wind (MIC), when device 900 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 904 or via communication set Part 916 is sent.In some embodiments, audio component 910 further includes a loudspeaker, is used for output audio signal.

I/O interface 912 provides interface between processing component 902 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Sensor module 914 includes one or more sensors, and the state for providing various aspects for device 900 is commented Estimate.For example, sensor module 914 can detecte the state that opens/closes of equipment 900, and the relative positioning of component, for example, it is described Component is the display and keypad of device 900, and sensor module 914 can be with 900 1 components of detection device 900 or device Position change, the existence or non-existence that user contacts with device 900,900 orientation of device or acceleration/deceleration and device 900 Temperature change.Sensor module 914 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 914 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 916 is configured to facilitate the communication of wired or wireless way between device 900 and other equipment.Device 900 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation In example, communication component 916 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 916 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 900 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 904 of instruction, above-metioned instruction can be executed by the processor 920 of device 900 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

Fig. 6 is the structural block diagram of server in some embodiments of the present invention.The server 1900 can be because of configuration or performance It is different and generate bigger difference, it may include one or more central processing units (central processing Units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage applications The storage medium 1930 (such as one or more mass memory units) of program 1942 or data 1944.Wherein, memory 1932 and storage medium 1930 can be of short duration storage or persistent storage.The program for being stored in storage medium 1930 may include one A or more than one module (diagram does not mark), each module may include to the series of instructions operation in server.More into One step, central processing unit 1922 can be set to communicate with storage medium 1930, execute storage medium on server 1900 Series of instructions operation in 1930.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium by device (equipment or Server) processor execute when, enable a device to execute a kind of data processing method, which comprises determine target The corresponding target voice sequence of solid images and target image sequence；The corresponding mode of the target image sequence includes: to listen attentively to Mode or answering model；In the input process of problem, the corresponding mode of the target image sequence is to listen attentively to mode；Or Person, after the completion of the input of problem, the corresponding mode of the target image sequence is answering model；To the target voice sequence It is merged with the target image sequence, obtains corresponding target video, to export the target video to user.

The embodiment of the invention discloses A1, a kind of data processing method, for the processing of question and answer interaction, the method packet It includes:

A2, method according to a1, the corresponding characteristics of image of the target image sequence include in following feature extremely Few one kind:

Expressive features；

Lip feature；And

Limbs feature.

A3, method according to a1, the corresponding target voice sequence of the determining target entity image and target image Sequence, comprising:

According to problem related text, the corresponding target voice sequence of target entity image and target image sequence are determined.

A4, method according to a3, the corresponding target voice sequence of the determining target entity image and target image Sequence, comprising:

Determine the corresponding target voice sequence of described problem related text；

According to the mapping relations between phonetic feature sequence and characteristics of image sequence, determine that the target voice sequence is corresponding Target image sequence；Phonetic feature sequence described in the mapping relations and described image characteristic sequence are right on a timeline Together；The mapping relations are to obtain according to time axis aligned speech samples and image pattern.

A5, method according to a3, the corresponding target voice sequence of the determining target entity image and target image Sequence, comprising:

Determine the corresponding duration characteristics of problem related text；

According to the duration characteristics, the corresponding target voice sequence of described problem related text is determined；

According to the duration characteristics, the corresponding target image sequence of described problem related text is determined；The target image Sequence is to obtain according to samples of text and its corresponding image pattern.

A6, method according to a3, the corresponding limbs feature of the target image sequence are related according to described problem The corresponding semantic feature of text obtains.

A7, according to A1 into A6 any method, described to the target voice sequence and the target image Before sequence is merged, the method also includes:

The boundary of predeterminable area in the target image sequence is compensated.

A8, according to A1 into A6 any method, the method also includes:

The target video is exported to user；Or

The link of the target video is exported to user；Or

The link of the target voice sequence or the target voice sequence is exported to user；Or

Problem related text is exported to user.

The embodiment of the invention discloses B9, a kind of data processing equipment, for the processing of question and answer interaction, described device packet It includes:

B10, the device according to B9, the corresponding characteristics of image of the target image sequence include in following feature extremely Few one kind:

Expressive features；

Lip feature；And

Limbs feature.

B11, the device according to B9, the determining module include:

B12, the device according to B11, described problem phonetic image sequence determining module include:

B13, the device according to B11, described problem phonetic image sequence determining module include:

B14, the device according to B11, the corresponding limbs feature of the target image sequence are according to described problem phase The corresponding semantic feature of text is closed to obtain.

B15, according to B9 into B14 any device, described device further include:

B16, according to B9 into B14 any device, described device further include:

First output module, for exporting the target video to user；Or

Second output module, for exporting the link of the target video to user；Or

4th output module, for exporting problem related text to user.

The embodiment of the invention discloses C17, a kind of device for data processing, described for the processing of question and answer interaction Device includes that perhaps more than one program one of them or more than one program is stored in and deposits by memory and one In reservoir, and it is configured to execute the one or more programs by one or more than one processor to include to be used for The instruction performed the following operation:

C18, the device according to C17, the corresponding characteristics of image of the target image sequence includes in following feature It is at least one:

Expressive features；

Lip feature；And

Limbs feature.

C19, the device according to C17, the corresponding target voice sequence of the determining target entity image and target figure As sequence, comprising:

C20, the device according to C19, the corresponding target voice sequence of the determining target entity image and target figure As sequence, comprising:

C21, the device according to C19, the corresponding target voice sequence of the determining target entity image and target figure As sequence, comprising:

Determine the corresponding duration characteristics of problem related text；

C22, the device according to C19, the corresponding limbs feature of the target image sequence are according to described problem phase The corresponding semantic feature of text is closed to obtain.

C23, according to C17 into C22 any device, described to the target voice sequence and the target figure Before being merged as sequence, described device further include:

C24, according to C17 into C22 any device, described device further include:

The target video is exported to user；Or

The link of the target video is exported to user；Or

Problem related text is exported to user.

The embodiment of the invention discloses D25, a kind of machine readable media, instruction are stored thereon with, when by one or more When processor executes, so that device executes the data processing method as described in A1 one or more into A8.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Above to a kind of data processing method provided by the present invention, a kind of data processing equipment and a kind of at data The device of reason, is described in detail, and specific case used herein explains the principle of the present invention and embodiment It states, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas；Meanwhile for this field Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute It states, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of data processing method, which is characterized in that the processing for question and answer interaction, which comprises

Determine the corresponding target voice sequence of target entity image and target image sequence；The corresponding mould of the target image sequence Formula includes: the mode of listening attentively to or answering model；In the input process of problem, the corresponding mode of the target image sequence is Listen attentively to mode；Alternatively, after the completion of the input of problem, the corresponding mode of the target image sequence is answering model；

The target voice sequence and the target image sequence are merged, corresponding target video is obtained, with to user Export the target video.

2. the method according to claim 1, wherein the corresponding characteristics of image of the target image sequence includes such as At least one of lower feature:

Expressive features；

Lip feature；And

Limbs feature.

3. the method according to claim 1, wherein the corresponding target voice sequence of the determining target entity image Column and target image sequence, comprising:

4. according to the method described in claim 3, it is characterized in that, the corresponding target voice sequence of the determining target entity image Column and target image sequence, comprising:

According to the mapping relations between phonetic feature sequence and characteristics of image sequence, the corresponding mesh of the target voice sequence is determined Logo image sequence；Phonetic feature sequence described in the mapping relations is aligned on a timeline with described image characteristic sequence；Institute Mapping relations are stated to obtain according to time axis aligned speech samples and image pattern.

5. according to the method described in claim 3, it is characterized in that, the corresponding target voice sequence of the determining target entity image Column and target image sequence, comprising:

Determine the corresponding duration characteristics of problem related text；

According to the duration characteristics, the corresponding target image sequence of described problem related text is determined；The target image sequence To be obtained according to samples of text and its corresponding image pattern.

6. according to the method described in claim 3, it is characterized in that, the corresponding limbs feature of the target image sequence is foundation The corresponding semantic feature of described problem related text obtains.

7. according to claim 1 to the method any in 6, which is characterized in that it is described to the target voice sequence and Before the target image sequence is merged, the method also includes:

8. a kind of data processing equipment, which is characterized in that for the processing of question and answer interaction, described device includes:

Determining module, for determining the corresponding target voice sequence of target entity image and target image sequence；The target figure As the corresponding mode of sequence includes: the mode of listening attentively to or answering model；In the input process of problem, the target image sequence Arranging corresponding mode is to listen attentively to mode；Alternatively, after the completion of the input of problem, the corresponding mode of the target image sequence is back Answer mode；And

Fusion Module obtains corresponding target for merging to the target voice sequence and the target image sequence Video, to export the target video to user.

9. a kind of device for data processing, which is characterized in that for the processing of question and answer interaction, described device includes storage Perhaps more than one program one of them or more than one program is stored in memory by device and one, and is configured It include for performing the following operation to execute the one or more programs by one or more than one processor Instruction:

10. a kind of machine readable media is stored thereon with instruction, when executed by one or more processors, so that device is held Data processing method of the row as described in one or more in claim 1 to 7.