CN112750184A - Data processing, action driving and man-machine interaction method and equipment - Google Patents

Data processing, action driving and man-machine interaction method and equipment Download PDF

Info

Publication number
CN112750184A
CN112750184A CN201911045674.2A CN201911045674A CN112750184A CN 112750184 A CN112750184 A CN 112750184A CN 201911045674 A CN201911045674 A CN 201911045674A CN 112750184 A CN112750184 A CN 112750184A
Authority
CN
China
Prior art keywords
audio
sequence segment
action
state information
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911045674.2A
Other languages
Chinese (zh)
Other versions
CN112750184B (en
Inventor
庄博宇
林冠芠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201911045674.2A priority Critical patent/CN112750184B/en
Publication of CN112750184A publication Critical patent/CN112750184A/en
Application granted granted Critical
Publication of CN112750184B publication Critical patent/CN112750184B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/01Indexing scheme relating to G06F3/01
    • G06F2203/012Walk-in-place systems for allowing a user to walk in a virtual environment while constraining him to a given position in the physical environment

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Processing Or Creating Images (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a method and equipment for data processing, action driving and man-machine interaction. The method comprises the following steps: intercepting a first audio sequence segment from a first audio frame in a first audio sequence; according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; and generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment. The technical scheme provided by the embodiment of the application effectively avoids complex training sample labeling work and reduces the action generation cost; moreover, the method has certain controllability on the quality of the generated motion, and further improves the cooperativity of the motion and the audio.

Description

Data processing, action driving and man-machine interaction method and equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for data processing, motion driving, and human-computer interaction.
Background
Currently, two-dimensional or three-dimensional animation (e.g., an avatar, an animated character in an animated movie, an animated character in virtual reality, etc.) driven based on skeletal points is the mainstream animation rendering technology. The skin is the visual cartoon appearance of the common cartoon, and can make various actions through the skeleton point driving and rendering technology without drawing different cartoons frame by frame.
In the prior art, a machine learning model is usually trained to generate a face skeleton point sequence, and then an animated face action is driven according to the action skeleton point sequence. However, this approach relies on a large number of manually labeled samples to train the model, which is costly; moreover, through the learning method, the quality of the generated face skeleton point sequence is not fixed, and the audio frequency and the motion are not coordinated.
Disclosure of Invention
In view of the above, the present application is proposed to provide a data processing, motion-driven and human-computer interaction method and apparatus that solves the above problems, or at least partially solves the above problems.
Thus, in one embodiment of the present application, a data processing method is provided. The method comprises the following steps:
intercepting a first audio sequence segment from a first audio frame in a first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
and generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment.
In another embodiment of the present application, a motion driven method is provided. The method comprises the following steps:
intercepting a first audio sequence segment from a first audio frame in a first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment;
and when the sound-emitting object emits the first audio frame in the first audio sequence, driving the action of the sound-emitting object according to the first action state information.
In another embodiment of the present application, a human-computer interaction method is provided. The method comprises the following steps:
receiving input information of a user;
generating a first audio sequence to be fed back by a feedback object according to the input information;
intercepting a first audio sequence segment from a first audio frame in the first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment;
and when the feedback object sends out the first audio frame in the first audio sequence, driving the feedback action of the feedback object according to the first action state information.
In an embodiment of the present application, an electronic device is provided. The electronic device includes: a memory and a processor, wherein,
the memory is used for storing programs;
the processor, coupled with the memory, to execute the program stored in the memory to:
intercepting a first audio sequence segment from a first audio frame in a first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
and generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment.
In another embodiment of the present application, an electronic device is provided. The electronic device includes: a memory and a processor, wherein,
the memory is used for storing programs;
the processor, coupled with the memory, to execute the program stored in the memory to:
intercepting a first audio sequence segment from a first audio frame in a first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment;
and when the sound-emitting object emits the first audio frame in the first audio sequence, driving the action of the sound-emitting object according to the first action state information.
In another embodiment of the present application, an electronic device is provided. The electronic device includes: a memory and a processor, wherein,
the memory is used for storing programs;
the processor, coupled with the memory, to execute the program stored in the memory to:
receiving input information of a user;
generating a first audio sequence to be fed back by a feedback object according to the input information;
intercepting a first audio sequence segment from a first audio frame in the first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment;
and when the feedback object sends out the first audio frame in the first audio sequence, driving the feedback action of the feedback object according to the first action state information.
In yet another embodiment of the present application, a data processing method is provided. The method comprises the following steps:
determining a picture to be displayed;
acquiring a first audio sequence associated with the picture to be displayed;
and acquiring a first action sequence matched with the first audio sequence of the animation in the picture according to the first audio sequence.
In yet another embodiment of the present application, a data processing method is provided. The method comprises the following steps:
acquiring a video to be matched;
extracting the characteristics of the video to be matched to obtain video characteristics;
searching and obtaining matched audio matched with the video features from an audio data set according to the video features;
and adding the matched audio to the video to be matched to obtain an audio and video file.
By adopting the technical scheme provided by the embodiment of the application, the matched action sequence can be automatically generated for the unknown audio sequence. Compared with the prior art, the technical scheme provided by the embodiment of the application effectively avoids complex training sample labeling work and reduces the action generation cost; and for each audio frame in the audio sequence to be matched, at least one candidate audio sequence segment similar to the audio sequence segment in which each audio frame is located is searched in the first data set; the action state information matched with each audio frame in the audio sequence to be matched is generated through at least one candidate action sequence segment matched with at least one candidate audio sequence segment, so that certain controllability can be realized on the quality of generated actions, and the cooperativity of the actions and the audio is further improved. In addition, the method provided by the embodiment of the application has strong scene migration.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1a is a schematic diagram of a virtual character action driving method according to an embodiment of the present application;
fig. 1b is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of an action driving method according to another embodiment of the present application;
fig. 3 is a schematic flowchart of a human-computer interaction method according to another embodiment of the present application;
fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of a motion driving device according to another embodiment of the present application;
FIG. 6 is a block diagram of a human-computer interaction device according to another embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device according to another embodiment of the present application.
Detailed Description
In the prior art, the combination of the driving of the skeleton points and the audio information is mainly used for synthesizing the virtual face of the speaking, such as: the facial expression driver is used for virtual anchor. A large amount of video data of human face during speaking are collected, a machine learning model is trained, and then the trained model is used for driving facial skeleton actions of any voice.
The inventor finds out in the process of researching the technical scheme provided by the embodiment of the application that: the existing mode needs to rely on a large amount of manually labeled samples to train the model, and the cost is high; through a learning mode, the quality of the generated action skeleton point sequence is not fixed, so that the audio frequency is not cooperated with the action; and its scene migration is poor, for example: it cannot migrate between two different curves: the model obtained through the lyric songs and the action training matched with the lyric songs can only generate proper actions for the lyric songs, but cannot generate proper actions for the rock songs.
In order to solve the above technical problem, an embodiment of the present application provides a new data processing method, which generates an action sequence for an unknown audio sequence by searching an existing similar audio segment and generating action state information corresponding to each audio frame according to an existing action sequence matching the similar audio segment.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Further, in some flows described in the specification, claims, and above-described figures of the present application, a number of operations are included that occur in a particular order, which operations may be performed out of order or in parallel as they occur herein. The sequence numbers of the operations, e.g., 101, 102, etc., are used merely to distinguish between the various operations, and do not represent any order of execution per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
Fig. 1b shows a schematic flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 1b, the method comprises:
101. a first audio sequence segment is truncated from a first audio frame in a first audio sequence.
102. And searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set according to the first audio sequence segment.
Wherein the first data set includes a plurality of reserve audio sequences.
103. And generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment.
In the foregoing 101, the first audio sequence includes a plurality of audio frames arranged in sequence. In a scenario with high requirements on motion consistency and coordination, the first audio frame may refer to any audio frame in the first audio sequence, that is, the data processing method provided in the embodiment of the present application may be adopted to generate the matched motion state information for each audio frame in the first audio sequence. In a scene with low requirements on motion coherence and harmony, the first audio frame may refer to any key audio frame in the first audio sequence, that is, the data processing method provided in the embodiment of the present application may be adopted to generate the matched motion state information for each key audio frame in the first audio sequence. The key audio frames in the first audio sequence may be specified in advance, which is not specifically limited in this application. For example: taking the audio frames sequenced into even numbers in the first audio sequence as key audio frames; alternatively, audio frames in the first audio sequence ordered as multiples of 4 are taken as key audio frames.
A first audio sequence segment is truncated from a first audio frame in a first audio sequence. The first audio sequence segment comprises at least one audio frame, and the at least one audio frame comprises a first audio frame.
Specifically, the first audio sequence segment includes N audio frames, where N is greater than 1. The specific value of N may be set according to actual needs, and this is not specifically limited in this embodiment of the application. For example: n is 20. Searching based on a plurality of audio frames of the first audio sequence located at the first audio frame may ensure reliability of the searched similar at least one candidate audio sequence segment.
In one example, the first audio sequence segment includes a first audio frame and N-1 audio frames of the first audio sequence adjacent to and preceding the first audio frame. When the number of audio frames in the first audio sequence that precede the first audio frame is less than N-1, the first sequence of audio frames includes the first audio frame and all audio frames in the first audio sequence that precede the first audio frame.
In another example, the first audio sequence segment includes: the audio frame comprises a first audio frame, N audio frames adjacent to and preceding the first audio frame in the first audio sequence, and N-N-1 audio frames adjacent to and following the first audio frame in the first audio sequence. Wherein n is greater than or equal to 1. The specific value of n can also be set according to actual needs. When the number of audio frames in the first audio sequence before the first audio frame is less than N, the first audio frame sequence segment comprises the first audio frame, all audio frames in the first audio sequence before the first audio frame and N-N-1 audio frames after the first audio frame. When the number of audio frames in the first audio sequence after the first audio frame is less than N-N-1, the first audio frame sequence segment includes the first audio frame, the N audio frames in the first audio sequence before the first audio frame, and all audio frames after the first audio frame.
In 102, at least one candidate audio sequence segment similar to the first audio sequence segment may be searched for in the first data set established in advance.
In an implementation manner, in the above 102, "at least one candidate audio sequence segment similar to the first audio sequence segment is searched according to the first audio sequence segment", specifically:
1021. and performing feature extraction on the first audio sequence segment to obtain a first sequence segment feature.
1022. And searching at least one candidate audio sequence segment similar to the first sequence segment in the first data set according to the first sequence segment characteristics.
At 1021, audio features such as volume, pitch, and speed of sound are extracted from the first audio sequence segment. The specific extraction steps can be implemented by using the prior art, and are not described in detail herein. For example: the audio features are extracted by using FFT (fast Fourier transform), MFCC (Mel Frequency cepstral Coefficient) and other techniques.
Specifically, before feature extraction, the first audio sequence segment may be smoothed to remove excessive noise and the signal strength of the first audio sequence segment may be normalized.
Wherein, the first sequence segment characteristics can include: audio features of each audio frame in the first audio sequence segment.
In an example, the above 1022 "finding at least one candidate audio sequence segment similar to the first sequence segment in the first data set according to the first sequence segment feature" can be implemented by:
s11, obtaining a first reserve audio sequence from the first data set.
Wherein the first reserved audio sequence comprises M audio frames. Wherein the first reserved audio sequence is any one of the reserved audio sequences in the first data set.
S12, determining the number N of audio frames in the first audio sequence segment.
S13, performing feature extraction on a reserved audio sequence segment between the j-th ordered audio frame and the (j + N-1) -th ordered audio frame in the first reserved audio sequence to obtain a second sequence segment feature.
Wherein j is all integers in the numerical range of [ 1, M-N +1 ].
And S14, calculating the similarity between the first sequence segment characteristic and the second sequence segment characteristic.
And S15, according to the similarity, determining at least one candidate audio sequence segment similar to the first audio sequence segment from all the reserve audio sequence segments determined from the reserve audio sequences in the plurality of reserve audio sequences.
In the above S13, the method for extracting the features of the reserve audio sequence segment may refer to the corresponding contents in the above embodiments, and is not described herein again.
In the above S15, in an implementation scheme, all the reserved audio sequence segments may be sorted according to the similarity, and the larger the similarity is, the earlier the sorting is; and taking the specified number of reserve audio sequence segments which are ranked at the top as the at least one candidate audio sequence segment. The number of the specific components can be set according to actual needs, which is not particularly limited. In this embodiment, it may be ensured that the number of at least one candidate audio sequence segment is not excessive or significant.
In another implementation, at least one of the reserve audio sequence segments with a similarity greater than a preset similarity value may be used as the at least one candidate audio sequence segment.
In 103, at least one candidate motion sequence segment respectively matching with the at least one candidate audio sequence segment is preset, and thus can be directly obtained. The at least one candidate action sequence segment comprises a first candidate action sequence segment; the at least one candidate audio sequence segment includes a first candidate audio sequence segment that matches the first candidate action sequence segment. The first candidate motion sequence segment includes a plurality of motion state information arranged in sequence. The audio frames in the first candidate audio sequence segment correspond one-to-one with the motion state information in the first candidate motion sequence segment. Wherein the first candidate audio sequence segment refers to any one of the at least one candidate audio sequence segment.
Specifically, the first data set further includes a plurality of reserve action sequences respectively matched with the plurality of reserve audio sequences. The reserve audio sequence in the first data set and the reserve action sequence matching the reserve audio sequence may be manually designed or acquired, which is not specifically limited in this embodiment of the application.
The correspondence between the reserve action sequence and the reserve audio sequence that match the reserve audio sequence may be established in advance in the first data set. The reserve action sequence comprises a plurality of action state information arranged in sequence. The motion state information in the reserve motion sequence that matches the reserve audio sequence corresponds one-to-one to the audio frames in the reserve audio sequence.
The method further comprises the following steps: from the first data set, at least one candidate action sequence segment respectively matching the at least one candidate audio sequence segment is obtained.
Specifically, a first reserve action sequence segment corresponding to the first candidate audio sequence segment is obtained from a second reserve action sequence matched with a second reserve audio sequence in which the first candidate audio sequence segment is located, and is used as the first candidate action sequence segment matched with the first candidate audio sequence segment. The first reserve motion sequence segment corresponding to the first candidate audio sequence segment is obtained from the second reserve motion sequence, that is, the motion state information corresponding to each audio frame in the first candidate audio sequence segment is obtained from the second reserve motion sequence. The first reserve action sequence segment is composed of action state information corresponding to each audio frame in the first candidate audio sequence segment in the second reserve action sequence according to the ordering of the audio frames in the first candidate audio sequence segment.
For example: candidate audio sequence segment a is located between the 1 st audio frame and the 20 th audio frame in the reserve audio sequence B, and candidate action sequence segment C matching candidate audio sequence segment a is located between the 1 st action state information and the 20 th action state information in the reserve action sequence D matching reserve audio sequence B. It is added that the candidate audio sequence segment a includes the 1 st audio frame and the 20 th audio frame in the reserved audio sequence B. The candidate motion sequence segment C includes the 1 st motion state information and the 20 th motion state information in the reserved motion sequence D.
Generating first motion state information matching the first audio frame based on at least one candidate motion sequence segment. The first motion state information may include one or more of expression state information, limb state information, and mouth shape state information. The first action state information may be bone state information (that is, bone joint point state information), and the bone state information includes spatial coordinate information of each bone joint point. For example: the first action state information is skeleton state information used for representing expression state information; for another example: the first action state information is skeleton state information used for representing limb state information; another example is: the first operation state information is skeleton state information indicating mouth shape state information.
It should be added that the above data processing method can be applied to the motion of driving sound-generating objects (e.g., driving virtual characters, robots), and the specific implementation will be described in the following embodiments.
By adopting the technical scheme provided by the embodiment of the application, the matched action sequence can be automatically generated for the unknown audio sequence. Compared with the prior art, the technical scheme provided by the embodiment of the application effectively avoids complex training sample labeling work and reduces the action generation cost; and for each audio frame in the audio sequence to be matched, at least one candidate audio sequence segment similar to the audio sequence segment in which each audio frame is located is searched in the first data set; the action state information matched with each audio frame in the audio sequence to be matched is generated through at least one candidate action sequence segment matched with at least one candidate audio sequence segment, so that certain controllability can be realized on the quality of generated actions, and the cooperativity of the actions and the audio is further improved.
It should be added that the method provided by the embodiment of the present application is not limited to facial movements, but is also applicable to limb movements and the like, and the applicability is strong.
In addition, the method provided by the embodiment of the application has better scene migration performance. Taking songs as an example, the songs are categorized into a lyric class and a rock class. The expressions or body movements corresponding to the lyric songs are completely different styles from the expressions or body movements corresponding to the rock songs as a whole. The actions corresponding to the lyric style are relatively slow to be played and relatively low in volume on the whole; the actions corresponding to the rock style are faster in rhythm and higher in volume on the whole. However, there is a climax part in the lyric song, and the climax part has a faster rhythm and a higher volume. According to the method provided by the embodiment of the application, the actions corresponding to the high-tide parts in the lyric songs can be combined to generate the actions corresponding to the rock style through audio matching, and the generated action style can be well adapted to the rock songs. Therefore, the method provided by the embodiment of the application can generate the action corresponding to the rock style based on the action corresponding to the existing lyric style, and the scene migration is strong.
For example: as shown in FIG. 1a, the rock song A is inputted, and the database (the database includes the first data set) includes only the lyric songs and the corresponding body action sequences. Suppose a user selects rock song A on a terminal song selection interface; then, by using the data processing method provided by the embodiment of the present application and the first data set, the limb motion sequence B matched with the rock song a can be generated. Therefore, the terminal interface can simulate the virtual character to sing the rock song A and synchronously drive the limb actions of the virtual character according to the limb action sequence B, so that the audio and the actions are mutually cooperated.
In practical application, the sequence of the first audio frame in the first audio sequence segment may be determined first; the candidate audio sequence segment is similar to the first audio sequence segment on the whole, and it can be considered that the third action state information at the sorting position in the candidate action sequence segment matched with the candidate audio sequence segment is most matched with the first audio frame, and the third action state information is used as a basis for determining the first action state information matched with the first audio frame, so that the reliability is high.
Specifically, in 103, "generating the first motion state information matched with the first audio frame according to at least one candidate motion sequence segment respectively matched with the at least one candidate audio sequence segment" may specifically be:
1031. obtaining an ordering of the first audio frame in the first audio sequence segment.
1032. And synthesizing the third action state information of the sequencing position in each candidate action sequence segment to determine first action state information.
1031, when the first audio sequence segment includes the first audio frame and N-1 audio frames adjacent to and before the first audio frame in the first audio sequence, the order of the first audio frame in the first audio sequence segment is nth. When the first audio sequence segment comprises: and when the first audio frame, N audio frames adjacent to and before the first audio frame in the first audio sequence and N-N-1 audio frames adjacent to and after the first audio frame in the first audio sequence are the first audio frame, the ordering of the first audio frame in the first audio sequence segment is N + 1. When the first audio sequence segment comprises: and when the first audio frame and N-1 audio frames which are adjacent to the first audio frame and are positioned after the first audio frame in the first audio sequence are arranged, the sequence of the first audio frame in the first audio sequence segment is 1 st.
In an implementation manner, in the above 1032, "synthesizing the third action state information at the sorting in each candidate action sequence segment to determine the first action state information", specifically: an average motion state information of the third motion state information at the ordering in each of the at least one candidate motion sequence segments may be calculated as the first motion state information.
The motion state information in the present application may specifically be in a tensor form, that is, the third motion state information is a third motion state tensor; the third motion state tensors at the sorting position in each candidate motion sequence segment in at least one candidate motion sequence segment can be added to obtain a total motion state tensor; the total motion state tensor is divided by the number of at least one candidate motion sequence segment to obtain an average motion state tensor, i.e. average motion state information.
In another implementation, the aforementioned 1032 "determining the first action state information by synthesizing the third action state information at the sorting position in each candidate action sequence segment" may specifically be implemented by:
and S21, acquiring second action state information matched with a second audio frame in the first audio sequence.
Wherein the second audio frame precedes the first audio frame in the first audio sequence.
S22, estimating a first state difference between the first motion state information and the second motion state information to be determined according to a second state difference between the third motion state information and the second motion state information at the sorting position in each candidate motion sequence segment and a weight corresponding to each candidate motion sequence segment.
And S23, determining the first action state information according to the second action state information and the first state difference.
In the above S21, in the scene with high requirements for motion consistency and coordination, the second audio frame may be the previous audio frame of the first audio frame; in a scenario with low requirements on motion coherence and harmony, the second audio frame may be the first key audio frame before the first audio frame.
When the second audio frame is the audio frame of the first audio sequence, the initial motion state information can be obtained as the second motion state information matched with the second audio frame. The initial motion state information may be preset or determined according to the motion state of the sound object before the first audio sequence is emitted.
When the second audio frame is not the audio frame of the first audio sequence, the second action state information matched with the second audio frame may also be generated by adopting the manner provided by the embodiment of the present application.
In the above S22, in one example, the motion state information is in the form of tensor; that is, the third motion state information is a third motion state tensor, and the second motion state information is a second motion state tensor; a difference action state tensor obtained by directly subtracting the second action state tensor from the third action state tensor can be used as the second state difference between the third action state information and the second action state information.
In another example, feature extraction may be performed on the third action state information to obtain a first feature; performing feature extraction on the second action state information to obtain a second feature; subtracting the second characteristic from the first characteristic to obtain a difference characteristic as a second state difference between the third action state information and the second action state information. All features in this application may specifically also be in the form of tensors.
The feature extraction of the action state information mainly extracts feature values of the bone points in a three-dimensional space, and in the specific implementation, the feature value dimension reduction method in the prior art can be adopted to realize the feature value dimension reduction method, for example: PCA (Principal Component Analysis), t-SNE (t-distributed stored systematic embedding algorithm).
In practical application, the weights corresponding to different candidate action sequence segments are different. And obtaining a first state difference through weighted averaging according to a second state difference between the third action state information and the second action state information at the sorting position in each candidate action sequence segment and the corresponding weight of each candidate action sequence segment.
For example: the at least one candidate action sequence segment comprises candidate action sequence segments A and B; the weights corresponding to the candidate action sequence segments A and B are lambda and beta respectively; a second state difference between the third motion state information and the second motion state information in the candidate motion sequence segment a is a difference feature a; a second state difference between the third action state information and the second action state information in the candidate action sequence segment B is a difference characteristic B; the first state difference v is: v ═ λ a + β b)/(a + b).
The weight corresponding to the first candidate action sequence segment may be calculated in one or more of the following manners, wherein the first candidate action sequence segment may be any one of the at least one candidate action sequence segment.
The first method is as follows: and determining the weight corresponding to the first candidate action sequence segment according to the similarity between the first candidate audio sequence segment matched with the first candidate action sequence segment and the first audio sequence segment. The greater the similarity, the greater the weight.
Specifically, the similarity may be directly used as the weight corresponding to the first candidate action sequence segment.
For a method for calculating the similarity between the first candidate audio sequence segment and the first audio sequence segment, reference may be made to corresponding contents in the prior art, and details are not repeated here. For example: the distance between the two respectively corresponding sequence segment features can be calculated, and the similarity is determined according to the distance.
The second method comprises the following steps: acquiring at least one fourth action state information which is adjacent to the third action state information and is positioned in front of the third action state information in the first candidate action sequence segment and at least one fifth action state information which is respectively matched with at least one third audio frame in the first audio sequence segment; wherein the at least one third audio frame in the first audio sequence segment is adjacent to and precedes the first audio frame; and determining the weight corresponding to the first candidate action sequence segment according to the first similarity between the at least one fourth motion state information and the at least one fifth motion state information.
Wherein, the number of the at least one fourth action state information may be one or more. In one example, a number threshold t may be set in advance, where t is greater than 1, and the specific value of t may be set according to actual needs, and t is smaller than N, for example: t is 5. When the number of action state information preceding the third action state information in the first candidate action sequence segment is less than or equal to t, the at least one fourth action state information includes all action state information preceding the third action state information in the first candidate action sequence segment. When the number of the action state information before the third action state information in the first candidate action sequence segment is greater than t, the at least one fourth action state information comprises t action state information which is adjacent to the third action state information and before the third action state information in the first candidate action sequence segment.
Wherein the number of the at least one third audio frame is the same as the number of the at least one fourth action state information.
Wherein the greater the first similarity, the greater the weight. Specifically, the first similarity may be directly used as the weight corresponding to the first candidate action sequence segment. The method for calculating the first similarity can be found in the prior art, and is not described in detail here. For example: a similarity between the sequence segment characteristic corresponding to the at least one fourth motion state information and the sequence segment characteristic corresponding to the at least one fifth motion state information may be calculated.
The third method comprises the following steps: the at least one candidate audio sequence segment includes a first candidate audio sequence segment that matches the first candidate action sequence segment. Obtaining at least one fifth audio frame in the first candidate audio sequence segment that precedes and is adjacent to a fourth audio frame at the ordering; obtaining the at least one third audio frame in the first audio sequence segment; calculating a second similarity between the at least one fifth audio frame and the at least one third audio frame; and determining the weight corresponding to the first candidate action sequence segment according to the second similarity.
Wherein the number of the at least one fifth audio frame may be one or more. In one example, a number threshold t may be set in advance, where t is greater than 1, and the specific value of t may be set according to actual needs, and t is smaller than N, for example: t is 5. When the number of audio frames of the first candidate audio sequence segment that precede the fourth audio frame is less than or equal to t, the at least one fifth audio frame includes all audio frames of the first candidate audio sequence segment that precede the fourth audio frame. When the number of audio frames preceding the fourth audio frame in the first candidate audio sequence segment is greater than t, the at least one fifth audio frame includes t audio frames that are adjacent to and preceding the fourth audio frame in the first candidate audio sequence segment.
The number of the at least one third audio frame is the same as the number of the at least one fifth audio frame. Wherein the greater the second similarity, the greater the weight. Specifically, the second similarity may be directly used as the weight corresponding to the first candidate motion sequence segment.
The method for calculating the second similarity can be found in the prior art, and is not described in detail here. For example: a similarity between the sequence segment feature corresponding to the at least one fifth audio frame and the sequence segment feature corresponding to the at least one third audio frame may be calculated.
In the first mode, the overall similarity of the fragments is compared, and in the second mode and the third mode, the similarity of partial fragments at the sequencing positions in the fragments is compared, and compared with the similarity, the granularity of comparison is finer, and the reliability of taking the third action state information as a basis can be more represented.
In the fourth mode, the first similarity in the second mode and the second similarity in the third mode may be integrated to determine the weight corresponding to the first candidate motion sequence segment.
Specifically, the weight corresponding to the first candidate action sequence segment may be determined according to the product of the first similarity and the second similarity. Specifically, the product may be used as the weight corresponding to the first candidate action sequence segment.
In the fourth mode, the two dimensions of audio similarity and motion similarity are comprehensively considered, so that the cooperativity of the generated motion and the audio can be improved.
In S23, when the first state difference is the differential operation state, the differential operation state is superimposed on the second operation state to obtain first operation state information.
When the first state difference is the difference characteristic, the step of "determining the first operating state information according to the second operating state information and the first state difference" in S23 may specifically be implemented by: superimposing the first state difference on the second feature to obtain a third feature; and performing feature restoration on the third feature to obtain the first action state information. The specific implementation of the feature reduction can be referred to in the prior art, and is not described in detail herein.
The generated actions can be ensured to accord with the action performance of the audio through a weighting fusion mode, and excessive and unreasonable actions cannot occur.
Further, the method may further include:
104. determining at least one unmatched audio frame of the first audio sequence for which no action state matching is currently performed.
105. Determining the first audio frame from the at least one unmatched audio frame.
In practical application, the audio frames subjected to action state matching can be marked. The unmarked audio frame is an unmatched audio frame.
In 105, the audio frame that is ranked the first in the first audio sequence among the at least one unmatched audio frame may be determined as the first audio frame. Or determining the first audio frame as the top-ranked key audio frame in the first audio sequence in the at least one unmatched audio frame.
In this way, it is ensured that action matching is performed for each audio frame or each key audio frame in the first audio sequence.
In a practical application scenario, the data processing method may be applied to an intelligent terminal, and the intelligent terminal locally matches corresponding motion state information for each audio frame in a first audio sequence, so as to match a corresponding motion sequence for the first audio sequence. Of course, the data processing method described above may also be applied to the server, and the intelligent terminal requests the server to match corresponding action state information for each audio frame in the first audio sequence, so as to match a corresponding action sequence for the first audio sequence.
At present, two-dimensional or three-dimensional animation is driven based on the skeleton points, which is the mainstream animation rendering technology at present. The outer layer of the skeleton is covered with a layer of animation skin, and the animation skin is the visual appearance of the general animation. Through the skeletal point driving and rendering technology, the animation skin can make various actions, such as limb actions and mouth shape actions. The motion sequence is specifically a skeletal point motion sequence.
In the scene of simulating the singing of the animation, the first audio sequence can be a song, and the action sequence matched with the first audio sequence is a mouth skeleton action sequence, so that the mouth shape of the animation can be synchronously driven according to the mouth skeleton action sequence while the song is played, and the singing of the animation can be simulated.
In practical application, a plurality of recommended songs can be provided for the user to select. After the user selects the songs, various animation skins can be provided for the user aiming at the selected songs for the user to select; according to the data processing method, obtaining a bone action sequence corresponding to the selected song; and driving the action of selecting the animation skin by the user according to the action sequence of the skeletal points through a rendering technology.
For example: when a user makes an expression package, in order to enable the animation in the expression package to simulate the expression or action when a certain sentence is spoken, the user can match a corresponding skeleton action sequence for a voice sequence of the sentence by using the data processing method; the user may also select one of a variety of animation skins from which to provide. Driving the expression or the action of the animation skin selected by the user according to the skeleton action sequence on a recording interface; and manufacturing a corresponding expression package according to the video or the plurality of pictures recorded on the recording interface.
If the animation skin is a charging item, after the user selects the animation skin, the user needs to complete payment to use the selected animation skin. It should be added that recommending an animation skin for a fee is equivalent to advertising.
Certainly, in order to meet the personalized requirements of the user, an animation skinning recommendation switch button can be set for the user, and when the switch button is in an on state, after the user selects a song, a plurality of animation skinning can be automatically provided for the user aiming at the selected song; when the toggle button is in the off state, a default animation skin is provided for the user after the user selects a song.
The plurality of recommended songs may be ranked according to popularity or user preference, and the plurality of animation skins may be ranked according to popularity or price.
It should be noted that the data processing method provided by the embodiment of the present application can also be applied to a virtual reality device, and is used for driving the motion of the animation in the virtual reality device according to the audio. Specifically, the format of the obtained bone action sequence matched with the first audio sequence can be converted into a format recognizable by the virtual reality device, so that the virtual reality device synchronously drives the motion of the animation according to the bone action sequence in the recognizable format when playing the first audio sequence.
After completing action sequence matching for the first audio frame sequence, when in use, the first audio sequence can be played on the same equipment, and the action of the animation is driven according to the action sequence matched with the first audio sequence; of course, the playing operation of the first audio sequence and the step of driving the motion of the animation according to the motion sequence matched with the first audio sequence can also be executed on the two devices synchronously.
In addition, matching can be performed in a video library according to the existing first audio, so that a first video matched with the first audio is obtained. Specifically, the audio feature of the first audio may be extracted, the video feature of each candidate video in the video library may be extracted, the similarity between the audio feature of the first audio and the video feature of each candidate video in the video library may be calculated, and the candidate video with the largest similarity may be used as the first video.
In another example, an embodiment of the present application further provides a data processing method. The method comprises the following steps:
3301. and determining the picture to be displayed.
3302. And acquiring a first audio sequence associated with the picture to be displayed.
3303. And acquiring a first action sequence matched with the first audio sequence of the animation in the picture according to the first audio sequence.
In 3301, the animation in the picture to be displayed may be one or more.
In 3302, the association relationship between the pictures and the audio sequences may be established in advance, so that the first audio sequence associated with the picture to be displayed may be obtained subsequently.
In an example, in 3303, the first audio sequence may be subjected to feature extraction to obtain audio features; and searching the action sequence data set for a first action sequence matched with the action sequence data set according to the audio features.
In another example, the first action sequence includes first action state information matched with each audio frame in the first audio sequence in sequence; the first audio sequence includes a first audio frame. In 3302, a first audio sequence segment may be specifically cut from the first audio frame in the first audio sequence; according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences; and generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment. The specific implementation manner and beneficial effects of the steps in this embodiment can refer to the corresponding contents in the above embodiments, and are not described herein again.
In this embodiment, the animation in the picture can be matched with a corresponding action sequence, and when the associated audio is played, the animation action in the picture can be synchronously driven based on the action sequence, which is beneficial to increasing the interest of picture display.
In another example, an embodiment of the present application further provides a data processing method. The method comprises the following steps:
4301. and acquiring a video to be matched.
4302. And extracting the characteristics of the video to be matched to obtain the video characteristics.
4303. And searching the audio data set according to the video characteristics to obtain matched audio matched with the video data set.
4304. And adding the matched audio to the video to be matched to obtain an audio and video file.
The video to be matched may be a dummy. In order to increase the ornamental effect of the mute dramas, the dubbing music can be added to the mute dramas in the above mode.
In this embodiment, by adding the audio matched with the video, the viewing effect of the video is improved.
The above data processing method can be applied to the motion of driving the sound-emitting object, for example: expression movements, mouth-shape movements or limb movements, etc. Fig. 2 shows a flow chart of an action driving method provided in an embodiment of the present application. The method comprises the following steps:
201. a first audio sequence segment is truncated from a first audio frame in a first audio sequence.
202. And searching at least one candidate audio sequence segment similar to the first audio sequence segment according to the first audio sequence segment.
203. And generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment.
204. And when the sound-emitting object emits the first audio frame in the first audio sequence, driving the action of the sound-emitting object according to the first action state information.
The specific implementation of the steps 201 to 203 can refer to the corresponding content in the above embodiments, and is not described herein again.
Currently, most motion drivers are based on bone drivers, so in step 204, the first motion state information may include bone state information (i.e., bone joint information). And driving the corresponding bone motion of the sound-emitting object according to the bone state information in the first action state information, namely driving the corresponding bone joint point motion of the sound-emitting object according to the bone joint point information in the first action state information.
Wherein, the sound object can be a robot or a virtual character. The virtual character can be an animation character in an animation movie; it may also be a virtual character in virtual reality or augmented reality; but also a virtual anchor.
The first audio sequence is the voice which needs to be uttered by the uttering object.
By adopting the technical scheme provided by the embodiment of the application, the matched action sequence can be automatically generated for the unknown audio sequence. Compared with the prior art, the technical scheme provided by the embodiment of the application effectively avoids complex training sample labeling work and reduces the action generation cost; and for each audio frame in the audio sequence to be matched, at least one candidate audio sequence segment similar to the audio sequence segment in which each audio frame is located is searched in the first data set; the action state information matched with each audio frame in the audio sequence to be matched is generated through at least one candidate action sequence segment matched with at least one candidate audio sequence segment, so that certain controllability can be realized on the quality of generated actions, and the cooperativity of the actions and the audio is further improved.
Here, it should be noted that: the content of each step in the method provided by the embodiment of the present application, which is not described in detail in the foregoing embodiment, may refer to the corresponding content in the foregoing embodiment, and is not described herein again. In addition, the method provided in the embodiment of the present application may further include, in addition to the above steps, other parts or all of the steps in the above embodiments, and specific reference may be made to corresponding contents in the above embodiments, which is not described herein again.
The data processing method provided by the above embodiments can also be applied to the field of human-computer interaction, for example: intelligent audio amplifier, robot etc. field. Fig. 3 shows a flowchart of a human-computer interaction method provided by the embodiment of the application. As shown in fig. 3, the method includes:
301. input information of a user is received.
302. And generating a first audio sequence to be fed back by the feedback object according to the input information.
303. A first audio sequence segment is truncated from a first audio frame in a first audio sequence.
304. And searching at least one candidate audio sequence segment similar to the first audio sequence segment according to the first audio sequence segment.
305. And generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment.
306. And when the feedback object sends out the first audio frame in the first audio sequence, driving the feedback action of the feedback object according to the first action state information.
In 301, the input information of the user may be voice information input by the user, character information input by the user, or the like.
Taking the interaction between the user and the virtual character on the smart speaker as an example, after the smart speaker is turned on, the user can speak into the smart speaker, that is, input voice information.
In 302, the feedback object may be a virtual character or a robot. The semantic recognition can be carried out on the input information, and a first audio sequence to be fed back by a feedback object is generated according to a semantic recognition result. The specific implementation of semantic recognition can be referred to in the prior art, and is not described in detail herein.
The specific implementation of the above 303, 304, and 305 can refer to the corresponding content in the above embodiments, and is not described herein again.
In 306, when the feedback object sends out the first audio frame in the first audio sequence, the feedback object drives the feedback action of the feedback object according to the first action state information.
Currently, most motion drivers are based on bone drivers, so in step 306, the first motion state information may include bone state information (i.e., bone joint information). And driving the corresponding bone motion of the feedback object according to the bone state information in the first action state information, namely driving the corresponding bone joint motion of the feedback object according to the bone joint point information in the first action state information.
By adopting the technical scheme provided by the embodiment of the application, the matched action sequence can be automatically generated for the unknown audio sequence. Compared with the prior art, the technical scheme provided by the embodiment of the application effectively avoids complex training sample labeling work and reduces the action generation cost; and for each audio frame in the audio sequence to be matched, at least one candidate audio sequence segment similar to the audio sequence segment in which each audio frame is located is searched in the first data set; the action state information matched with each audio frame in the audio sequence to be matched is generated through at least one candidate action sequence segment matched with at least one candidate audio sequence segment, so that certain controllability can be realized on the quality of generated actions, and the cooperativity of the actions and the audio is further improved.
Here, it should be noted that: the content of each step in the method provided by the embodiment of the present application, which is not described in detail in the foregoing embodiment, may refer to the corresponding content in the foregoing embodiment, and is not described herein again. In addition, the method provided in the embodiment of the present application may further include, in addition to the above steps, other parts or all of the steps in the above embodiments, and specific reference may be made to corresponding contents in the above embodiments, which is not described herein again.
In summary, the embodiment of the present application needs to prepare a first data set containing a plurality of audio sequence and motion sequence pairs matching each other in advance, but the first data set is not used for model learning, and corresponding motion state information is generated by searching similar segments in the first data set. The action state information is generated by means of the prepared first data set, with a certain controllability both qualitatively and synergistically. The embodiment of the application can generate the action sequence with the same quality as the first data set for any unknown audio by searching the first data set.
In addition, the technical scheme provided by the embodiment of the application has strong scene migration and adaptability. And based on the existing first data set, a reasonable motion sequence can be generated without motion capture in a real environment, so that the labor cost can be greatly saved.
Fig. 4 shows a block diagram of a data processing apparatus according to another embodiment of the present application. As shown in fig. 4, the apparatus includes: a first truncation module 401, a first search module 402, and a first generation module 403. Wherein,
a first clipping module 401 is configured to clip a first audio sequence segment from a first audio frame in a first audio sequence.
A first searching module 402, configured to search, according to the first audio sequence segment, in a first data set to obtain at least one candidate audio sequence segment similar to the first audio sequence segment; wherein the first data set includes a plurality of reserve audio sequences.
A first generating module 403, configured to generate first motion state information matching the first audio frame according to at least one candidate motion sequence segment respectively matching the at least one candidate audio sequence segment.
By adopting the technical scheme provided by the embodiment of the application, the matched action sequence can be automatically generated for the unknown audio sequence. Compared with the prior art, the technical scheme provided by the embodiment of the application effectively avoids complex training sample labeling work and reduces the action generation cost; and for each audio frame in the audio sequence to be matched, at least one candidate audio sequence segment similar to the audio sequence segment in which each audio frame is located is searched in the first data set; the action state information matched with each audio frame in the audio sequence to be matched is generated through at least one candidate action sequence segment matched with at least one candidate audio sequence segment, so that certain controllability can be realized on the quality of generated actions, and the cooperativity of the actions and the audio is further improved.
Further, the first generating module 403 is specifically configured to:
obtaining the ordering of the first audio frame in the first audio sequence segment;
and synthesizing the third action state information at the sorting position in each candidate action sequence segment to determine the first action state information.
Further, the first generating module 403 is specifically configured to:
acquiring second action state information matched with a second audio frame in the first audio sequence; wherein the second audio frame precedes the first audio frame in the first audio sequence;
estimating a first state difference between the first action state information and the second action state information to be determined according to a second state difference between third action state information and the second action state information at the sorting position in each candidate action sequence segment and a weight corresponding to each candidate action sequence segment;
and determining the first action state information according to the second action state information and the first state difference.
Further, the at least one candidate action sequence segment includes a first candidate action sequence segment; the above-mentioned device still includes:
a first obtaining module to: acquiring at least one fourth action state information which is adjacent to the third action state information and is positioned in front of the third action state information in the first candidate action sequence segment and at least one fifth action state information which is respectively matched with at least one third audio frame in the first audio sequence segment; wherein the at least one third audio frame in the first audio sequence segment is adjacent to and precedes the first audio frame;
a first determining module, configured to determine, according to a first similarity between the at least one fourth motion state information and the at least one fifth motion state information, a weight corresponding to the first candidate action sequence segment.
Further, the at least one candidate audio sequence segment includes a first candidate audio sequence segment matching the first candidate action sequence segment;
the above-mentioned device still includes:
a second obtaining module, configured to obtain at least one fifth audio frame that is located before and adjacent to a fourth audio frame at the sorting point in the first candidate audio sequence segment; further for obtaining said at least one third audio frame in said first audio sequence segment;
a first calculating module, configured to calculate a second similarity between the at least one fifth audio frame and the at least one third audio frame;
the first determining module is specifically configured to: and integrating the first similarity and the second similarity to determine the weight corresponding to the first candidate action sequence segment.
Further, the above apparatus may further include:
the first characteristic extraction module is used for extracting the characteristics of the third action state information to obtain first characteristics; the second action state information is also used for carrying out feature extraction on the second action state information to obtain a second feature;
and the second calculation module is used for subtracting the second characteristic from the first characteristic to obtain a difference characteristic which is used as a second state difference between the third action state information and the second action state information.
Further, the first generating module 403 is specifically configured to:
superimposing the first state difference on the second feature to obtain a third feature;
and performing feature restoration on the third feature to obtain the first action state information.
Further, the second audio frame is a previous audio frame of the first audio frame in the first audio sequence.
Further, the above apparatus further includes: a second determining module, configured to determine at least one unmatched audio frame in the first audio sequence for which an action state matching is not performed currently; and is further configured to determine the first audio frame from the at least one unmatched audio frame.
Further, the first searching module 402 is specifically configured to:
performing feature extraction on the first audio sequence segment to obtain a first sequence segment feature;
and searching at least one candidate audio sequence segment similar to the first sequence segment in the first data set according to the first sequence segment characteristics.
Further, the first data set further comprises a plurality of reserve action sequences respectively matched with the plurality of reserve audio sequences;
the above-mentioned device still includes:
a third obtaining module, configured to obtain, from the first data set, at least one candidate action sequence segment that matches the at least one candidate audio sequence segment respectively.
Further, the first motion state information comprises one or more of expression state information, limb state information and mouth shape state information.
Here, it should be noted that: the data processing apparatus provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principle of each module may refer to the corresponding content in the foregoing method embodiments, which is not described herein again.
Fig. 5 is a block diagram showing a structure of a motion driving device according to still another embodiment of the present application. As shown in fig. 5, the apparatus includes: a second truncation module 501, a second search module 502, a second generation module 503, and a first driving module 504. Wherein,
a second clipping module 501, configured to clip a first audio sequence segment from a first audio frame in the first audio sequence;
a second searching module 502, configured to search, according to the first audio sequence segment, in the first data set to obtain at least one candidate audio sequence segment similar to the first audio sequence segment; wherein the first data set comprises a plurality of reserve audio sequences;
a second generating module 503, configured to generate first motion state information matching the first audio frame according to at least one candidate motion sequence segment respectively matching the at least one candidate audio sequence segment;
a first driving module 504, configured to drive, when the sound-generating object generates the first audio frame in the first audio sequence, an action of the sound-generating object according to the first action state information.
By adopting the technical scheme provided by the embodiment of the application, the matched action sequence can be automatically generated for the unknown audio sequence. Compared with the prior art, the technical scheme provided by the embodiment of the application effectively avoids complex training sample labeling work and reduces the action generation cost; and for each audio frame in the audio sequence to be matched, at least one candidate audio sequence segment similar to the audio sequence segment in which each audio frame is located is searched in the first data set; the action state information matched with each audio frame in the audio sequence to be matched is generated through at least one candidate action sequence segment matched with at least one candidate audio sequence segment, so that certain controllability can be realized on the quality of generated actions, and the cooperativity of the actions and the audio is further improved.
Further, the first driving module 504 is specifically configured to:
and driving the corresponding skeleton motion of the sound-producing object according to the skeleton state information in the first action state information.
Further, the sound-producing object is a robot or a virtual character.
Here, it should be noted that: the motion driving apparatus provided in the above embodiments may implement the technical solutions described in the above method embodiments, and the specific implementation principle of each module may refer to the corresponding content in the above method embodiments, and is not described herein again.
Fig. 6 shows a block diagram of a human-computer interaction device according to another embodiment of the present application. As shown in fig. 6, the apparatus includes: a first receiving module 601, a third generating module 602, a third intercepting module 603, a third searching module 604, a fourth generating module 605 and a second driving module 606. Wherein,
a first receiving module 601, configured to receive input information of a user;
a third generating module 602, configured to generate a first audio sequence to be fed back by a feedback object according to the input information;
a third clipping module 603, configured to clip a first audio sequence segment from a first audio frame in the first audio sequence;
a third searching module 604, configured to search, according to the first audio sequence segment, in the first data set to obtain at least one candidate audio sequence segment similar to the first audio sequence segment; wherein the first data set comprises a plurality of reserve audio sequences;
a fourth generating module 605, configured to generate first motion state information matching the first audio frame according to at least one candidate motion sequence segment respectively matching the at least one candidate audio sequence segment;
a second driving module 606, configured to drive a feedback action of the feedback object according to the first action state information when the feedback object sends out the first audio frame in the first audio sequence.
By adopting the technical scheme provided by the embodiment of the application, the matched action sequence can be automatically generated for the unknown audio sequence. Compared with the prior art, the technical scheme provided by the embodiment of the application effectively avoids complex training sample labeling work and reduces the action generation cost; and for each audio frame in the audio sequence to be matched, at least one candidate audio sequence segment similar to the audio sequence segment in which each audio frame is located is searched in the first data set; the action state information matched with each audio frame in the audio sequence to be matched is generated through at least one candidate action sequence segment matched with at least one candidate audio sequence segment, so that certain controllability can be realized on the quality of generated actions, and the cooperativity of the actions and the audio is further improved.
Further, the first receiving module 601 is specifically configured to: and receiving voice information input by a user.
Here, it should be noted that: the human-computer interaction device provided in the above embodiments may implement the technical solutions described in the above method embodiments, and the specific implementation principle of each module may refer to the corresponding content in the above method embodiments, and will not be described herein again.
Fig. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 7, the electronic device includes a memory 1101 and a processor 1102. The memory 1101 may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device. The memory 1101 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The memory 1101 is used for storing programs;
the processor 1102 is coupled to the memory 1101, and configured to execute the program stored in the memory 1101, so as to implement the data processing method, the action driving method, or the human-computer interaction method provided by the foregoing method embodiments.
Further, as shown in fig. 7, the electronic device further includes: communication components 1103, display 1104, power components 1105, audio components 1106, and the like. Only some of the components are schematically shown in fig. 7, and the electronic device is not meant to include only the components shown in fig. 7.
Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps or functions of the data processing method, the action driving method, and the human-computer interaction method provided by the above method embodiments when executed by a computer.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (23)

1. A data processing method, comprising:
intercepting a first audio sequence segment from a first audio frame in a first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
and generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment.
2. The method of claim 1, wherein generating first motion state information matching the first audio frame according to at least one candidate motion sequence segment respectively matching the at least one candidate audio sequence segment comprises:
obtaining the ordering of the first audio frame in the first audio sequence segment;
and synthesizing the third action state information at the sorting position in each candidate action sequence segment to determine the first action state information.
3. The method of claim 2, wherein determining the first action state information by synthesizing third action state information at the ordering in each of the candidate action sequence segments comprises:
acquiring second action state information matched with a second audio frame in the first audio sequence; wherein the second audio frame precedes the first audio frame in the first audio sequence;
estimating a first state difference between the first action state information and the second action state information to be determined according to a second state difference between third action state information and the second action state information at the sorting position in each candidate action sequence segment and a weight corresponding to each candidate action sequence segment;
and determining the first action state information according to the second action state information and the first state difference.
4. The method of claim 3, wherein the at least one candidate action sequence segment comprises a first candidate action sequence segment;
the method further comprises the following steps:
acquiring at least one fourth action state information which is adjacent to the third action state information and is positioned in front of the third action state information in the first candidate action sequence segment and at least one fifth action state information which is respectively matched with at least one third audio frame in the first audio sequence segment; wherein the at least one third audio frame in the first audio sequence segment is adjacent to and precedes the first audio frame;
and determining the weight corresponding to the first candidate action sequence segment according to the first similarity between the at least one fourth motion state information and the at least one fifth motion state information.
5. The method of claim 4, wherein the at least one candidate audio sequence segment comprises a first candidate audio sequence segment that matches the first candidate action sequence segment;
the method further comprises the following steps:
obtaining at least one fifth audio frame in the first candidate audio sequence segment that precedes and is adjacent to a fourth audio frame at the ordering;
obtaining the at least one third audio frame in the first audio sequence segment;
calculating a second similarity between the at least one fifth audio frame and the at least one third audio frame;
correspondingly, determining the weight corresponding to the first candidate action sequence segment according to the first similarity between the at least one fourth motion state information and the at least one fifth motion state information includes:
and integrating the first similarity and the second similarity to determine the weight corresponding to the first candidate action sequence segment.
6. The method of any of claims 3 to 5, further comprising:
performing feature extraction on the third action state information to obtain a first feature;
performing feature extraction on the second action state information to obtain a second feature;
subtracting the second characteristic from the first characteristic to obtain a difference characteristic as a second state difference between the third action state information and the second action state information.
7. The method of claim 6, wherein determining the first action state information based on the second action state information and the first state difference comprises:
superimposing the first state difference on the second feature to obtain a third feature;
and performing feature restoration on the third feature to obtain the first action state information.
8. The method of any of claims 3-5, wherein the second audio frame is an audio frame that is previous to the first audio frame in the first audio sequence.
9. The method of any one of claims 1 to 5, further comprising:
determining at least one unmatched audio frame of the first audio sequence which is not matched with the action state currently;
determining the first audio frame from the at least one unmatched audio frame.
10. The method according to any of claims 1 to 5, wherein searching in the first data set for at least one candidate audio sequence segment similar to the first audio sequence segment based on the first audio sequence segment comprises:
performing feature extraction on the first audio sequence segment to obtain a first sequence segment feature;
and searching at least one candidate audio sequence segment similar to the first sequence segment in the first data set according to the first sequence segment characteristics.
11. The method according to any one of claims 1 to 5, wherein the first data set further comprises a plurality of reserve action sequences respectively matched to the plurality of reserve audio sequences;
the method further comprises the following steps:
from the first data set, at least one candidate action sequence segment respectively matching the at least one candidate audio sequence segment is obtained.
12. The method according to any one of claims 1 to 5, wherein the first motion state information comprises one or more of expression state information, limb state information, mouth shape state information.
13. A motion driving method, comprising:
intercepting a first audio sequence segment from a first audio frame in a first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment;
and when the sound-emitting object emits the first audio frame in the first audio sequence, driving the action of the sound-emitting object according to the first action state information.
14. The method of claim 13, wherein driving the action of the sound generating object according to the first action state information comprises:
and driving the corresponding skeleton motion of the sound-producing object according to the skeleton state information in the first action state information.
15. The method of claim 13, wherein the sound generating object is a robot or a virtual character.
16. A human-computer interaction method, comprising:
receiving input information of a user;
generating a first audio sequence to be fed back by a feedback object according to the input information;
intercepting a first audio sequence segment from a first audio frame in the first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment;
and when the feedback object sends out the first audio frame in the first audio sequence, driving the feedback action of the feedback object according to the first action state information.
17. The method of claim 16, wherein receiving input from a user comprises:
and receiving voice information input by a user.
18. An electronic device, comprising: a memory and a processor, wherein,
the memory is used for storing programs;
the processor, coupled with the memory, to execute the program stored in the memory to:
intercepting a first audio sequence segment from a first audio frame in a first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
and generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment.
19. An electronic device, comprising: a memory and a processor, wherein,
the memory is used for storing programs;
the processor, coupled with the memory, to execute the program stored in the memory to:
intercepting a first audio sequence segment from a first audio frame in a first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment;
and when the sound-emitting object emits the first audio frame in the first audio sequence, driving the action of the sound-emitting object according to the first action state information.
20. An electronic device, comprising: a memory and a processor, wherein,
the memory is used for storing programs;
the processor, coupled with the memory, to execute the program stored in the memory to:
receiving input information of a user;
generating a first audio sequence to be fed back by a feedback object according to the input information;
intercepting a first audio sequence segment from a first audio frame in the first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment;
and when the feedback object sends out the first audio frame in the first audio sequence, driving the feedback action of the feedback object according to the first action state information.
21. A data processing method, comprising:
determining a picture to be displayed;
acquiring a first audio sequence associated with the picture to be displayed;
and acquiring a first action sequence matched with the first audio sequence of the animation in the picture according to the first audio sequence.
22. The method of claim 21, wherein the first action sequence comprises first action state information matched with each audio frame in the first audio sequence in sequence; the first audio sequence comprises a first audio frame;
according to the first audio sequence, acquiring a first action sequence matched with the first audio sequence of the animation in the picture, wherein the first action sequence comprises the following steps:
truncating a first audio sequence segment from the first audio frame in the first audio sequence;
according to the first audio sequence segment, searching at least one candidate audio sequence segment similar to the first audio sequence segment in the first data set; wherein the first data set comprises a plurality of reserve audio sequences;
and generating first action state information matched with the first audio frame according to at least one candidate action sequence segment respectively matched with the at least one candidate audio sequence segment.
23. A data processing method, comprising:
acquiring a video to be matched;
extracting the characteristics of the video to be matched to obtain video characteristics;
searching and obtaining matched audio matched with the video features from an audio data set according to the video features;
and adding the matched audio to the video to be matched to obtain an audio and video file.
CN201911045674.2A 2019-10-30 2019-10-30 Method and equipment for data processing, action driving and man-machine interaction Active CN112750184B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911045674.2A CN112750184B (en) 2019-10-30 2019-10-30 Method and equipment for data processing, action driving and man-machine interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911045674.2A CN112750184B (en) 2019-10-30 2019-10-30 Method and equipment for data processing, action driving and man-machine interaction

Publications (2)

Publication Number Publication Date
CN112750184A true CN112750184A (en) 2021-05-04
CN112750184B CN112750184B (en) 2023-11-10

Family

ID=75640592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911045674.2A Active CN112750184B (en) 2019-10-30 2019-10-30 Method and equipment for data processing, action driving and man-machine interaction

Country Status (1)

Country Link
CN (1) CN112750184B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708367A (en) * 2022-03-28 2022-07-05 长沙千博信息技术有限公司 WebGL-based sign language digital human driving and real-time rendering system
CN115861492A (en) * 2022-12-15 2023-03-28 北京百度网讯科技有限公司 Audio-based action generation method, device, equipment and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615302A (en) * 2009-07-30 2009-12-30 浙江大学 The dance movement generation method that music data drives based on machine learning
CN106875955A (en) * 2015-12-10 2017-06-20 掌赢信息科技(上海)有限公司 The preparation method and electronic equipment of a kind of sound animation
WO2018049979A1 (en) * 2016-09-14 2018-03-22 厦门幻世网络科技有限公司 Animation synthesis method and device
CN107918663A (en) * 2017-11-22 2018-04-17 腾讯科技(深圳)有限公司 audio file search method and device
CN109064532A (en) * 2018-06-11 2018-12-21 上海咔咖文化传播有限公司 The automatic shape of the mouth as one speaks generation method of cartoon role and device
WO2019100757A1 (en) * 2017-11-23 2019-05-31 乐蜜有限公司 Video generation method and device, and electronic apparatus
CN109862422A (en) * 2019-02-28 2019-06-07 腾讯科技(深圳)有限公司 Method for processing video frequency, device, computer readable storage medium and computer equipment
CN110136698A (en) * 2019-04-11 2019-08-16 北京百度网讯科技有限公司 For determining the method, apparatus, equipment and storage medium of nozzle type

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615302A (en) * 2009-07-30 2009-12-30 浙江大学 The dance movement generation method that music data drives based on machine learning
CN106875955A (en) * 2015-12-10 2017-06-20 掌赢信息科技(上海)有限公司 The preparation method and electronic equipment of a kind of sound animation
WO2018049979A1 (en) * 2016-09-14 2018-03-22 厦门幻世网络科技有限公司 Animation synthesis method and device
CN107918663A (en) * 2017-11-22 2018-04-17 腾讯科技(深圳)有限公司 audio file search method and device
WO2019100757A1 (en) * 2017-11-23 2019-05-31 乐蜜有限公司 Video generation method and device, and electronic apparatus
CN109064532A (en) * 2018-06-11 2018-12-21 上海咔咖文化传播有限公司 The automatic shape of the mouth as one speaks generation method of cartoon role and device
CN109862422A (en) * 2019-02-28 2019-06-07 腾讯科技(深圳)有限公司 Method for processing video frequency, device, computer readable storage medium and computer equipment
CN110136698A (en) * 2019-04-11 2019-08-16 北京百度网讯科技有限公司 For determining the method, apparatus, equipment and storage medium of nozzle type

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIN JIN等: "Video Description Generation using Audio and Visual Cues", 《ACM》, pages 239 - 242 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708367A (en) * 2022-03-28 2022-07-05 长沙千博信息技术有限公司 WebGL-based sign language digital human driving and real-time rendering system
CN115861492A (en) * 2022-12-15 2023-03-28 北京百度网讯科技有限公司 Audio-based action generation method, device, equipment and medium
CN115861492B (en) * 2022-12-15 2023-09-29 北京百度网讯科技有限公司 Audio-based action generation method, device, equipment and medium

Also Published As

Publication number Publication date
CN112750184B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN109377539B (en) Method and apparatus for generating animation
CN108806656B (en) Automatic generation of songs
CN108806655B (en) Automatic generation of songs
JP6786751B2 (en) Voice connection synthesis processing methods and equipment, computer equipment and computer programs
US9911218B2 (en) Systems and methods for speech animation using visemes with phonetic boundary context
CN111935537A (en) Music video generation method and device, electronic equipment and storage medium
US9997153B2 (en) Information processing method and information processing device
US8972265B1 (en) Multiple voices in audio content
US11968433B2 (en) Systems and methods for generating synthetic videos based on audio contents
US20230215068A1 (en) Method for outputting blend shape value, storage medium, and electronic device
CN114144790A (en) Personalized speech-to-video with three-dimensional skeletal regularization and representative body gestures
CN114419205B (en) Driving method of virtual digital person and training method of pose acquisition model
CN111145777A (en) Virtual image display method and device, electronic equipment and storage medium
CN113923462A (en) Video generation method, live broadcast processing method, video generation device, live broadcast processing device and readable medium
CN112309365A (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN112750184B (en) Method and equipment for data processing, action driving and man-machine interaction
CN114999441A (en) Avatar generation method, apparatus, device, storage medium, and program product
Wang et al. HMM trajectory-guided sample selection for photo-realistic talking head
CN114356084A (en) Image processing method and system and electronic equipment
Xie et al. A statistical parametric approach to video-realistic text-driven talking avatar
CN116958342A (en) Method for generating actions of virtual image, method and device for constructing action library
Furukawa et al. Voice animator: Automatic lip-synching in limited animation by audio
JP3755503B2 (en) Animation production system
CN115529500A (en) Method and device for generating dynamic image
Kolivand et al. Realistic lip syncing for virtual character using common viseme set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant