CN114741561A - Action generating method, device, electronic equipment and storage medium - Google Patents

Action generating method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114741561A
CN114741561A CN202210463597.8A CN202210463597A CN114741561A CN 114741561 A CN114741561 A CN 114741561A CN 202210463597 A CN202210463597 A CN 202210463597A CN 114741561 A CN114741561 A CN 114741561A
Authority
CN
China
Prior art keywords
motion
action
feature
predicted
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210463597.8A
Other languages
Chinese (zh)
Inventor
李思尧
余伟江
顾天培
林纯泽
王权
钱晨
吕健勤
刘子纬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanyang Technological University
Sensetime International Pte Ltd
Original Assignee
Nanyang Technological University
Sensetime International Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanyang Technological University, Sensetime International Pte Ltd filed Critical Nanyang Technological University
Publication of CN114741561A publication Critical patent/CN114741561A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to an action generating method, an action generating device, an electronic device and a storage medium. The method comprises the following steps: acquiring a prediction action characteristic sequence related in time sequence; determining quantized motion characteristics matched with each predicted motion characteristic in the predicted motion characteristic sequence from a preset motion characteristic library; determining the incidence relation of each predicted action characteristic in the predicted action characteristic sequence on the space; and combining the quantized motion characteristics matched with the predicted motion characteristics according to the spatial association relationship of the predicted motion characteristics and the time sequence association relationship of the predicted motion characteristics to obtain a target motion characteristic sequence. The method and the device can give consideration to the incidence relation of each predicted action characteristic in time sequence and the incidence relation in space, and further can improve the standard property of the generated target action characteristic sequence and improve the consistency between the target action characteristic sequence and the audio rhythm in the beat.

Description

Action generating method, device, electronic equipment and storage medium
Cross Reference to Related Applications
This application claims priority to singapore patent application 10202202011P filed on 28.02/2022 by the intellectual property office of singapore, the entire contents of which are incorporated herein by reference.
Technical Field
The present disclosure relates to the field of computer information technology, and relates to, but is not limited to, a method and an apparatus for generating an action, an electronic device, and a storage medium.
Background
Currently, there is a need to simulate actions according to music in some scenes, such as virtual scenes like games or animations, and the like, and to simulate actions like dancing or art gymnastics and the like following the currently played music. However, in the prior art, the generated motion is not consistent with the music rhythm according to the given music, and the generated motion is not smooth, so that the satisfactory motion is difficult to produce.
Disclosure of Invention
The disclosure provides an action generation method, an action generation device, an electronic device and a storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided an action generation method, including: acquiring a prediction action characteristic sequence related in time sequence; wherein the sequence of predicted motion features is generated from at least audio features; determining quantized motion characteristics matched with each predicted motion characteristic in the predicted motion characteristic sequence from a preset motion characteristic library; determining the incidence relation of each predicted action characteristic in the predicted action characteristic sequence on the space; and combining the quantized motion characteristics matched with the predicted motion characteristics according to the spatial association relationship of the predicted motion characteristics and the time sequence association relationship of the predicted motion characteristics to obtain a target motion characteristic sequence.
According to a second aspect of the embodiments of the present disclosure, there is provided an action generating apparatus including: the acquisition module is configured to acquire a predicted action characteristic sequence related in time sequence; wherein the sequence of predicted motion features is generated from at least audio features; the first determination module is configured to determine quantitative action characteristics matched with each predicted action characteristic in the predicted action characteristic sequence from a preset action characteristic library; a second determination module configured to determine a spatial association relationship between each of the predicted motion features in the sequence of predicted motion features; and the combination module is configured to combine the quantized motion characteristics matched with the predicted motion characteristics according to the association relation of the predicted motion characteristics in space and the association relation of the predicted motion characteristics in time sequence to obtain a target motion characteristic sequence.
According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising a processor and a memory for storing a computer program operable on the processor; wherein the processor is configured to run the computer program to perform any one of the above-mentioned action generating methods.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing any one of the above-described action generation methods.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
in the embodiment of the disclosure, quantitative action features matched with each predicted action feature correlated in time sequence can be determined from a preset action feature library, and the quantitative action features matched with each predicted action feature are combined according to the spatial correlation of each predicted action feature and the temporal correlation of each predicted action feature to obtain a target action feature sequence.
In the process of generating the target action characteristic sequence, the incidence relation of each predicted action characteristic in time sequence and the incidence relation in space can be considered, so that the standard property of the generated target action characteristic sequence can be improved, and the consistency of the target action characteristic sequence with the audio rhythm in the beat can be improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flowchart illustrating an action generation method according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram illustrating an action generation method according to an embodiment of the present disclosure.
Fig. 3 is a schematic diagram illustrating a feature processing method according to an embodiment of the disclosure.
Fig. 4 is a schematic diagram illustrating an action prediction method according to an embodiment of the disclosure.
Fig. 5a is a first schematic diagram illustrating a method for determining a degree of association according to an embodiment of the present disclosure.
Fig. 5b is a schematic diagram of a second method for determining a degree of association according to the embodiment of the present disclosure.
Fig. 5c is a third schematic diagram illustrating a method for determining a degree of association according to an embodiment of the present disclosure.
Fig. 6 is a schematic diagram illustrating a method for determining a reward according to an embodiment of the disclosure.
Fig. 7 is a schematic diagram illustrating a target action according to an embodiment of the disclosure.
Fig. 8 is a schematic structural diagram of a target motion device according to an embodiment of the present disclosure.
Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The embodiment of the present disclosure provides an action generation method, which may be applied to an electronic device, where the electronic device may include: and the terminal equipment is, for example, a mobile terminal, a fixed terminal or a vehicle-mounted terminal. Wherein, the mobile terminal can include: the mobile phone, the tablet computer, the notebook computer or the wearable device can further comprise smart home equipment, such as a smart sound box. The fixed terminal may include: desktop computers or smart televisions, etc. The vehicle-mounted terminal may include a front-end device of the vehicle monitoring and management system, which may also be referred to as a remote information Control Unit (TCU), such as a vehicle-mounted terminal. The functions realized by the method can be applied to a computer system consisting of the terminal and/or the server. Here, the terminal may be a thin client, a thick client, a hand-held or laptop device, a microprocessor-based system, a set-top box, a programmable consumer electronics, a network personal computer, a small computer system, etc., and the server may be a server computer system, a small computer system mainframe computer system, and a distributed cloud computing environment including any of the above systems.
Fig. 1 is a flowchart illustrating an action generation method according to an exemplary embodiment, as shown in fig. 1, which mainly includes the following steps:
in step 101, acquiring a time-series associated predicted action characteristic sequence; wherein the sequence of predicted motion features is generated from at least audio features;
in step 102, determining quantized motion characteristics matched with each predicted motion characteristic in the predicted motion characteristic sequence from a preset motion characteristic library;
in step 103, determining a spatial association relationship of each predicted motion feature in the sequence of predicted motion features;
in step 104, the quantized motion features matched with the predicted motion features are combined according to the spatial association relationship and the time-series association relationship of the predicted motion features, so as to obtain a target motion feature sequence.
In the embodiment of the present disclosure, the action feature may be understood as data for characterizing an action attribute, and the action feature may include: three-dimensional coordinates of respective joint points of an object of motion (e.g., a virtual character) are generated, or a motion vector representing motion, or the like. One motion feature corresponds to one motion, and the motion feature can be represented in the form of a vector or an index value. In some embodiments, the electronic device may set the corresponding index value according to different actions in advance, for example: the index value corresponding to the hand raising action is 1, the index value corresponding to the leg kicking action is 2, the index value corresponding to the squatting action is 3, and the like, and the disclosure is not limited. Different action characteristics may represent different actions, such as: a first motion characteristic may represent upright, a second motion characteristic may represent upright, a third motion characteristic may represent straddle, etc.
A sequence of action features may be understood as a collection of a plurality of action features, for example: the motion characteristic sequence comprises three motion characteristics, wherein the first motion characteristic is to lift a hand, the second motion characteristic is to bend a waist, and the third motion characteristic is to lift a foot. Chronological association is to be understood as that the individual action features are associated on the basis of a chronological order, and the individual action features in the action feature sequence may have a chronological association. For example: actions characterized by a first action characteristic occur before, actions characterized by a second action characteristic occur after, actions characterized by a first action characteristic occur, actions characterized by a third action characteristic occur after, actions characterized by a second action characteristic occur, and so on.
In the embodiment of the disclosure, the electronic device may generate the predicted motion characteristic sequence at least according to the audio characteristic. Audio features may be understood as data for characterizing audio properties, and may include: one or more of sampling frequency, bit rate, channel number, frame rate and the like. The predicted motion feature sequence may be understood as a motion feature sequence corresponding to information such as audio features generated based on the information such as audio features. The electronic equipment can generate a predicted action characteristic sequence through the action prediction model, and the action prediction model can be understood as a trained neural network and input audio characteristics to obtain a corresponding predicted action characteristic sequence. For example: the electronic equipment can generate a predicted action characteristic sequence corresponding to the virtual teacher according to the audio characteristics of the knowledge point content, so that the diversity of teaching content can be improved, and the learning interest of students can be stimulated. The electronic equipment can also generate a predicted action characteristic sequence matched with the music information according to the audio characteristics of the music information, control the virtual anchor to display dance actions corresponding to the predicted action characteristic sequence, and reduce the labor cost for maintaining the virtual anchor.
In a possible embodiment, the electronic device may also generate a predicted motion feature sequence according to the audio feature and the initial motion feature, where the initial motion feature may be understood as a preset motion feature, may include one motion feature, may include multiple motion features, and the like. The electronic device may further generate a predicted motion feature sequence according to data such as audio features and scene features, where the scene features may be understood as a usage scene of the predicted motion feature sequence, for example: teaching scenes, customer service scenes or navigation scenes and the like. The present disclosure is not limited to the manner in which the electronic device obtains the temporally related predicted motion characteristic sequence.
In the embodiment of the disclosure, after the electronic device obtains the predicted action feature sequence, the quantized action features matched with each predicted action feature in the predicted action feature sequence may be determined from a preset action feature library. The preset action feature library may be understood as a set including a plurality of quantitative action features, and the electronic device may summarize and quantize the collected plurality of standard and reasonable actions into a limited, sparse and indexable quantitative action feature in advance, and store the quantitative action feature in the preset action feature library.
In some embodiments, the electronic device may learn the motion feature library in an unsupervised manner, so as to obtain the trained preset motion feature library. Each quantized motion feature in the preset motion feature library may correspond to a motion, for example: the preset action characteristic library comprises 512 quantified action characteristics represented by different actions. Quantifying the action features may be understood as high-dimensional features with sparse constraints. The electronic device may determine a matching relationship between the predicted motion feature and the quantized motion feature by determining a similarity between the predicted motion feature and the quantized motion feature. For example: the action characteristic library has 3 quantized action characteristics, the electronic device determines that the similarity between the first predicted action characteristic and the 3 quantized action characteristics is 0.5 (corresponding to the first quantized action characteristic), 0.9 (corresponding to the second quantized action characteristic) and 0.4 (corresponding to the third quantized action characteristic) through a preset similarity calculation strategy, and the similarity between the second predicted action characteristic and the 3 quantized action characteristics is 0.3 (corresponding to the first quantized action characteristic), 0.6 (corresponding to the second quantized action characteristic) and 0.85 (corresponding to the third quantized action characteristic). Taking the matching between the predicted motion feature with the greatest similarity and the quantized motion feature as an example, the electronic device may determine that the first predicted motion feature matches the second quantized motion feature, that the second predicted motion feature matches the third quantized motion feature, and so on.
In one possible embodiment, the electronic device may determine a quantized motion feature matching the predicted motion feature from a preset motion feature library. The corresponding relationship between the quantity of the predicted action features and the quantity of the matched quantized action features can comprise the following steps: one predicted motion feature corresponds to one quantized motion feature, or one predicted motion feature corresponds to at least two quantized motion features, etc. The time-series correlation between the predicted motion characteristics may be the same as or different from the time-series correlation between the quantized motion characteristics.
In the embodiment of the present disclosure, after the quantized motion features matched with the predicted motion features are determined, the spatial association relationship of each predicted motion feature in the predicted motion feature sequence may be determined. The spatial relationship may be understood as a relationship between positions, a relationship between moving distances, a relationship between moving directions, and the like, for example: the spatial association relationship between the first predicted motion characteristic and the second predicted motion characteristic may be a mirror relationship or the like. The electronic device may determine a spatial association of the predicted motion features by quantizing the motion features that match the predicted motion features.
For example: the electronic equipment analyzes a first quantized motion characteristic, determines a generation object of motion represented by the first quantized motion characteristic, determines three-dimensional coordinates of each joint point of a body part of the object as { (1,2,3),. }, analyzes a second quantized motion characteristic, determines a generation object of motion represented by the second quantized motion characteristic, determines three-dimensional coordinates of each joint point of the body part of the object as { (4,5,6),. }, and can determine an association relation of each predicted motion characteristic on space through a relation between the three-dimensional coordinates of each joint point. Here, the relationship between the three-dimensional coordinates of the respective joint points may be understood as a relationship between the same joint points in different quantized motion features, for example: the three-dimensional coordinates of the elbow joint represented by the first quantized motion features are (1,2,3), the three-dimensional coordinates of the elbow joint represented by the second quantized motion features are (4,6,7), and the relationship between the three-dimensional coordinates of the joint points may be the relationship between the three-dimensional coordinates of the elbow joint represented by different quantized motion features.
In the embodiment of the present disclosure, after determining the spatial association relationship of each predicted action feature, the electronic device may combine the quantized action features matched with each predicted action feature according to the spatial association relationship of each predicted action feature and the time-series association relationship of each predicted action feature, so as to obtain a target action feature sequence. The target motion characteristic sequence may be understood as an optimal motion characteristic sequence to be generated by the object, and the size of the target motion characteristic sequence may be the same as or different from that of the predicted motion characteristic sequence, and the disclosure is not limited thereto. For example: the electronic device determines a spatial relationship of each predicted motion feature (e.g., a relationship between three-dimensional coordinates of joint points of motion generation objects in the predicted motion feature), and a temporal relationship of each predicted motion feature (e.g., a relationship between generation timings of each predicted motion feature, etc.), wherein the generation timing of the predicted motion feature may be a timing at which the predicted motion feature is obtained.
The electronic device may combine the quantized motion features that match the respective predicted motion features. For example: the electronic device determines that the generation time of the first predicted motion characteristic is a first time, determines that the generation time of the second predicted motion characteristic is a second time, and determines that the first time and the second time are adjacent in time sequence. The electronic device determines a first set of three-dimensional coordinates of a joint point of the motion-generating object in the first quantized motion feature and a second set of three-dimensional coordinates of a joint point of the motion-generating object in the second quantized motion feature, respectively, by determining the first quantized motion feature matching the first predicted motion feature and the second quantized motion feature matching the second predicted motion feature. The electronic equipment determines information such as a moving distance and a moving direction of the same joint point in the first group of three-dimensional coordinates and the second group of three-dimensional coordinates on the space, and further determines the action of the action generation object from the first moment to the second moment. The electronic device may control the motion-generating object to transition smoothly from the motion at the first time to the motion at the second time, for example: if the motion generated at the first time (e.g., time 5 seconds) is raising the hand to a height of 1.1m and the motion generated at the second time (e.g., time 7 seconds) is raising the hand to a height of 1.2m, then it may be determined that the motion generated at time 6 seconds may be raising the hand to a height of 1.15m, etc.
In one possible embodiment, a pre-trained neural network may be used to determine the target motion feature sequence based on the audio features and the initial motion features. As shown in fig. 2, a schematic diagram of a method for generating an action according to an exemplary embodiment is shown. As shown in fig. 2, audio features (Music features) m and initial motion features (Starting pos codes) p0 may be input. The initial action characteristic p0 may be composed of at least two parts, for example: the initial motion characteristics of the upper body motion and the initial motion characteristics of the lower body motion can be represented by index numbers, for example: the initial motion characteristics are (14,13), and may represent an upper body motion, a lower body motion, and the like. Then, a predicted Motion characteristic sequence (position code sequence) is generated by using a target Motion prediction model (Actor-critical Motion GPT)201, and the predicted Motion characteristic sequence may include: p0 is (14,13), p1 is (27,68),. pk is (0,7), etc., where k is a positive integer. Quantized motion feature sequences (Quantized features) that match respective predicted motion features in the predicted motion feature sequence may then be determined from the pre-defined motion feature library 202 based on the predicted motion feature sequence. The preset motion feature library may be referred to as a dance Memory Codebook (Choreographic Memory Codebook). The preset motion characteristic library may include N sets of quantized motion characteristics 0 to (N-1), each of which may represent an upper body motion and a lower body motion, respectively. Then, a target motion feature sequence, which may be a Generated dance (Generated dance) or the like, may be Generated from the quantized motion feature sequence using a convolutional neural network decoder (CNN Decoders) 203.
In the embodiment of the disclosure, quantitative action features matched with each predicted action feature correlated in time sequence can be determined from a preset action feature library, and the quantitative action features matched with each predicted action feature are combined according to the spatial correlation of each predicted action feature and the temporal correlation of each predicted action feature to obtain a target action feature sequence. In the process of generating the target action characteristic sequence, the association relation of each predicted action characteristic in time sequence and the association relation in space can be considered, so that the standard of the generated target action characteristic sequence can be improved, and the consistency of the target action characteristic sequence with the audio rhythm in beat can be improved.
In some embodiments, the determining, from a preset motion feature library, quantized motion features that match respective predicted motion features in the sequence of predicted motion features includes:
respectively carrying out dimensionality reduction processing on each predicted action characteristic to obtain coding characteristics; the dimension of the coding feature is the same as the dimension of each quantitative action feature in the preset action feature library;
and according to the distance between the coding feature and each quantized motion feature in the preset motion feature library, determining the quantized motion feature matched with each predicted motion feature from the preset motion feature library.
In the embodiment of the present disclosure, the electronic device may perform dimension reduction processing on each predicted action feature to obtain a coding feature, where a dimension of the coding feature is the same as a dimension of each quantized action feature in the preset action feature library. In this way, the dimension of the coding feature can be the same as the dimension of the quantization action feature, so that the processing such as comparison, calculation and the like can be performed, and meanwhile, the dimension of the feature is reduced, so that the calculation amount of the subsequent processing can be reduced, the calculation efficiency is improved, and the like. For example: the dimension of the predicted motion feature sequence may be (T × (J × 3)), where T may represent the number of predicted motion features (i.e., the sequence length of the predicted motion feature sequence), J may represent the number of respective joint points of the subject body part, and 3 may represent the three-dimensional coordinates of the respective joint points. The dimension of the quantized motion feature sequence may be (T '× C), where T' may represent the number of quantized motion features (i.e., the sequence length of the quantized motion feature sequence), and C may represent the dimension of the quantized motion features.
In the embodiment of the disclosure, after the electronic device obtains the coding features, the quantized motion features matched with the predicted motion features may be determined from the preset motion feature library according to the distance between the coding features and each quantized motion feature in the preset motion feature library. The distance between the coding feature and the quantization action feature can be understood as the similarity or the degree of association between the features, and the like, and can be realized by determining the vector distance between different features. For example: the electronic device may determine the similarity between different features by determining a euclidean distance, a manhattan distance, a chebyshev distance, a mahalanobis distance, and other vector distances between the coded features and the quantized motion features. For example: the electronic equipment determines that the distances between the first coding feature and each quantized motion feature in the preset motion feature library are respectively 1,2,3 and the like, and the distances between the second coding feature and each quantized motion feature in the preset motion feature library are respectively 3, 4,5 and the like. Taking the minimum distance as an example of a rule for matching between different features, it may be determined that the first coding feature matches with a first quantized motion feature in the preset motion feature library, and the second coding feature matches with a third quantized motion feature in the preset motion feature library.
In the embodiment of the disclosure, the dimension reduction processing is performed on each predicted action feature to obtain the coding feature with the same dimension as that of the quantized action feature, and then the quantized action feature matched with each predicted action feature can be simply and accurately determined from the preset action feature library according to the distance between the coding feature and each quantized action feature in the preset action feature library, so that the calculation efficiency is improved, and the predicted action feature is converted into the pre-trained quantized action feature with sparse constraint.
In some embodiments, the determining, from the preset motion feature library, quantized motion features matching the respective predicted motion features according to the distance between the encoding feature and each quantized motion feature in the preset motion feature library includes:
determining the distance between the coding feature to be processed and each quantized motion feature in the preset motion feature library;
sequencing the distances corresponding to the coding features to be processed to obtain a sequencing result;
according to the sequencing result, determining quantization action characteristics corresponding to the coding characteristics to be processed;
and determining the quantization action characteristics matched with the prediction action characteristics according to the quantization action characteristics corresponding to the coding characteristics to be processed.
In the embodiment of the disclosure, the electronic device may determine a distance between the coding feature to be processed and each quantized motion feature in the preset motion feature library. For example: the coded feature sequence comprises 3 coded features, the preset action feature library comprises 10 quantized action features, and the electronic device can determine distances between the first coded features and the 10 quantized action features respectively, such as 3, 5, 2,. and the like, determine distances between the second coded features and the 10 quantized action features respectively, such as 7, 4,6,. and the like, and determine distances between the third coded features and the 10 quantized action features respectively, such as 1, 4, 3,. and the like.
After determining the distance between the coding feature to be processed and each quantized motion feature in the preset motion feature library, the electronic device may sort the distances corresponding to the coding feature to be processed, and determine the quantized motion feature corresponding to the coding feature to be processed according to the sorting result. For example: by ordering the distances of the same coding feature according to a rule from small to large, it can be determined that the first coding feature corresponds to (2, 3, 5.,), the second coding feature corresponds to (4,6, 7.,), the third coding feature corresponds to (1, 3, 4.,), and the like. Taking the minimum distance as an example of the matching condition, it may be determined that the first coding feature corresponds to a third quantized motion feature in the preset motion feature library, the second coding feature corresponds to a second quantized motion feature in the preset motion feature library, and the third coding feature corresponds to the first quantized motion feature in the preset motion feature library.
After determining the quantized motion features corresponding to the to-be-processed coding features, the electronic device may determine the quantized motion features matched with the predicted motion features according to the quantized motion features corresponding to the to-be-processed coding features. For example: the electronic device determines that the first coding feature corresponds to a third quantization action feature, the second coding feature corresponds to a second quantization action feature, the third coding feature corresponds to the first quantization action feature, and so on. The electronic device may determine a quantized motion feature matched with the predicted motion feature according to a corresponding relationship between the predicted motion feature and the coding feature, where the corresponding relationship between the predicted motion feature and the coding feature may be that one predicted motion feature corresponds to one coding feature, or that a plurality of predicted motion features correspond to one coding feature. For example: the first predicted motion characteristic corresponds to a first coding characteristic and the second predicted motion characteristic corresponds to a second coding characteristic, and then the first predicted motion characteristic corresponds to a third quantized motion characteristic and the second predicted motion characteristic corresponds to a second quantized motion characteristic, and so on.
In the embodiment of the disclosure, the distances corresponding to the coding features to be processed are sorted by determining the distances between the coding features to be processed and the quantization action features in the preset action feature library, and then the quantization action features corresponding to the coding features to be processed can be determined according to the sorting result, so that each predicted action feature can be simply and accurately converted into the quantization action feature according to the quantization action feature corresponding to each coding feature to be processed.
In some embodiments, the determining the spatial relationship between each of the predicted motion features in the sequence of predicted motion features includes:
determining relative position information between each key point and a preset standard point of a target role and displacement information of the target role according to each quantitative action characteristic; wherein the predicted action feature sequence is generated according to the audio features and the initial action features of the target role;
and determining the spatial association relationship of each predicted action characteristic according to the relative position information and the displacement information.
In the embodiment of the disclosure, the electronic device determines, according to each quantized motion characteristic, relative position information between each key point of the target role and a preset standard point and displacement information of the target role. The electronic device may generate a predicted motion feature sequence according to an audio feature and an initial motion feature of the target character, where the audio feature may include at least one rhythm, the initial motion feature may include at least one beat (also referred to as a rhythm), and one audio rhythm may correspond to the at least one motion beat. The electronic equipment can set initial action characteristics in a user-defined mode, and one action characteristic is randomly selected from the initial action characteristic set to serve as the initial action characteristic under the condition that a predicted action characteristic sequence needs to be generated according to the initial action characteristics. The target role can be understood as an object for generating actions, and can include virtual images such as a virtual teacher, and the electronic device can control the target role to perform action display according to the target action characteristic sequence. The electronic device may use joint points of respective joints of the body part of the target character as key points for characterizing the motion attributes, such as: shoulder, hip, knee, etc. The electronic device may use movement information between preset standard points as displacement information of the target role, where the preset standard points may be understood as any one of the key points, for example: the key points corresponding to the hip joint can be used as preset standard points.
In one possible embodiment, the electronic device may obtain the relative position information between each key point and the preset standard point by inputting the quantized motion characteristics into the first decoder. The relative position information may be understood as three-dimensional coordinate information between the key points. For example: the first key point is located in the direction of 30 degrees of the preset standard point, and the distance is 10 centimeters; the second key point is located in the 60-degree direction of the preset standard point, and the distance is 12 cm and the like. The electronic device may obtain displacement information of the target role by inputting the quantized motion features into the second decoder, where the displacement information may be understood as a movement attribute between motions represented by different quantized motion features, and the like. For example: the motion characterized by the second quantified motion characteristic appears to be moved 20 centimeters to the left, relative to the motion characterized by the first quantified motion characteristic. That is, the spatial association relationship of each predicted action characteristic can be expressed as relative position information between each key point of the target role and the preset standard point, displacement information of the target role, and the like.
In one possible embodiment, a neural network (i.e., a target feature processing model) trained in advance may be used to determine, from a preset motion feature library, quantized motion features matching respective predicted motion features in the sequence of predicted motion features, and determine spatial association relationships of the respective predicted motion features in the sequence of predicted motion features. As shown in fig. 3, a schematic diagram of a feature processing method according to an exemplary embodiment is shown. As shown in fig. 3, a predicted motion feature sequence may be input, which may include an upper body motion and a lower body motion. The predicted motion feature in the predicted motion feature sequence may be represented by P, and the dimension of the predicted motion feature sequence may be T × (J × 3), where T may represent the length of the sequence, J may represent the number of joint points of the body part of the subject generating the motion, and 3 may represent three-dimensional coordinates of each joint point. And then, respectively performing dimension reduction processing on each predicted action characteristic by using a position Encoder (Pose Encoder) E301 to obtain an Encoding characteristic sequence (Encoding features), wherein the Encoding characteristic can be represented by E, the dimension of the Encoding characteristic sequence can be T '× C, T' can represent the length of the sequence, and C can represent the dimension of the characteristic. Quantized motion feature sequences (Quantized features) that match respective predicted motion features in the predicted motion feature sequences, which may be denoted by eq, may then be determined from a preset motion feature library (chord Memory Codebook)302 based on the coded feature sequences, and the Quantized motion features may have dimensions T' xc. Then, relative position information (which may be used as a relative position information) between each key point of the target character and a preset standard point can be determined by using a Pose Decoder (Dp 303) and a Velocity Decoder (Global Velocity Decoder) Dv 304, respectively, according to the quantized motion feature sequence
Figure BDA0003621318830000071
Representation) and displacement information (available) of the target character
Figure BDA0003621318830000072
Representation), vector features of relative position informationMay be T × (J × 3), the vector feature of the displacement information may be T × 3, etc.
In the embodiment of the disclosure, the relative position information between each key point of the target role and the preset standard point and the displacement information of the target role are determined according to each quantized action characteristic, wherein the predicted action characteristic sequence is generated according to the audio characteristic and the initial action characteristic of the target role, so that the spatial association relationship of each predicted action characteristic can be rapidly and accurately determined according to the relative position information and the displacement information.
In some embodiments, the method further comprises:
acquiring a sample action characteristic sequence;
updating model parameters of an initial feature processing model and quantitative action features in an initial action feature library of the initial feature processing model according to each sample action feature in the sample action feature sequence and a preset incidence relation of each sample action feature on the space to obtain an updating result;
obtaining a target feature processing model with the preset action feature library based on the updating result;
the determining, from a preset motion feature library, quantized motion features that match with each predicted motion feature in the predicted motion feature sequence includes:
and determining quantitative action characteristics matched with each predicted action characteristic in the predicted action characteristic sequence from the preset action characteristic library of the target characteristic processing model.
In the embodiment of the present disclosure, the sample motion feature sequence may be understood as a motion feature sequence that is manually marked, and may be used to train an initial feature processing model. Model parameters in the initial feature processing model can be randomly set, and in the training process, a loss value can be calculated according to a predefined loss function, and the model parameters are updated. The electronic device may crawl the sample motion feature sequence from the internet, or select the sample motion feature sequence from a preset data set (e.g., a dance motion data set), etc. The preset association relationship of each sample action feature on the space can be understood as the association relationship between each sample action feature marked manually. The initial motion feature library can be understood as an untrained motion feature library, can be a part of an initial feature processing model, can also refer to two different parts, and can jointly realize the processing of determining the spatial association relationship of each predicted motion feature according to the predicted motion feature sequence. Each action feature in the initial action feature library can be set in a user-defined mode, and each action feature is updated in the process of updating the model parameters of the initial feature processing model.
The electronic device can update the model parameters of the initial feature processing model and the quantitative action features in the initial action feature library of the initial feature processing model according to each sample action feature in the sample action feature sequence and the preset incidence relation of each sample action feature in space, and then obtains the target feature processing model with the preset action feature library based on the update result. The target feature processing model can be understood as a trained feature processing model, and the preset action feature library can be understood as a trained action feature library. After the electronic device obtains the target feature processing model with the preset action feature library, in the actual use process of the model, the quantitative action features matched with each predicted action feature in the predicted action feature sequence can be determined from the preset action feature library of the target feature processing model, and the incidence relation and the like of each predicted action feature on the space can be determined by using the target feature processing model.
In the embodiment of the disclosure, by obtaining the sample action feature sequence, the model parameters of the initial feature processing model and the quantitative action features in the initial action feature library of the initial feature processing model can be updated according to the preset association relationship between each sample action feature in the sample action feature sequence and each sample action feature in the space, so that the initial feature processing model can be quickly and accurately trained into the target feature processing model, and the efficiency and the accuracy of model training are improved.
In some embodiments, the updating, according to the preset association relationship between each sample action feature in the sample action feature sequence and each sample action feature in space, the model parameter of the initial feature processing model and the quantized action feature in the initial action feature library of the initial feature processing model to obtain an updated result includes:
obtaining a prediction incidence relation of each sample action characteristic on the space by using the initial characteristic processing model;
determining a first loss value according to the predicted incidence relation in the space and the preset incidence relation in the space;
performing dimension reduction processing on each sample action characteristic to obtain a sample coding characteristic;
determining a second loss value according to the sample coding characteristics;
obtaining a first target loss value according to the first loss value and the second loss value;
and updating the model parameters of the initial characteristic processing model and the quantitative action characteristics in the initial action characteristic library of the initial characteristic processing model by using the first target loss value to obtain the updating result.
In the embodiment of the disclosure, in the training process of the initial feature processing model, the electronic device may obtain the prediction association relationship of each sample action feature in the space by using the initial feature processing model. The predicted association relationship may be understood as an association relationship obtained according to a preset model parameter. The electronic device may determine the first loss value according to the predicted association and the preset association, for example: the electronic device may determine the first loss value according to a difference between the predicted association and a preset association through a calculation strategy of a preset first loss function. The electronic equipment can perform dimension reduction processing on each sample action characteristic to obtain a sample coding characteristic, and the sample coding characteristic can be understood as a coding characteristic obtained according to a preset model parameter. The electronic device may process the sample coding features according to a calculation strategy of a preset second loss function, and determine a second loss value, for example: the electronic device may determine a norm value corresponding to the sample encoding characteristic, and use the obtained norm value as a second loss function, or the like.
After the electronic device determines the first loss value and the second loss value, a first target loss value may be obtained according to the first loss value and the second loss value. The first target loss value may be understood as a loss value determined during one training of the initial feature processing model. For example: the electronic device determines a first target loss value and the like based on a weighted sum process of the first loss value and the second loss value, or based on a ratio between the first loss value and the second loss value and the like. After the electronic device determines the first target loss value, the model parameters of the initial feature processing model and the motion features in the initial motion feature library of the initial feature processing model may be updated by using the first target loss value. For example: adjust model parameters from t1 to t2, adjust motion characteristics from e1 to e2, and so on.
In the embodiment of the disclosure, different loss values can be obtained according to different calculation strategies, so that a target loss value can be accurately determined, and the efficiency and accuracy of model training and the like are improved.
In some embodiments, the determining a second loss value according to the sample coding feature comprises:
according to the sample coding features, determining quantized motion features matched with the sample motion features from the initial motion feature library;
and determining the second loss value according to the sample coding characteristics and the quantization action characteristics matched with the sample action characteristics.
In the embodiment of the disclosure, the electronic device may determine, from the initial motion characteristic library, a quantized motion characteristic matching the sample motion characteristic according to the sample encoding characteristic. The electronic device may determine a distance between the sample encoding feature and a quantized motion feature in the initial motion feature library to determine a quantized motion feature that matches the sample motion feature. For example: the first sample coding feature is matched with a first quantized motion feature in the initial motion feature library, the second sample coding feature is matched with a second quantized motion feature in the initial motion feature library, and so on.
After determining the quantized motion features matched with the sample motion features from the initial motion feature library, the electronic device may determine the second loss value according to the sample coding features and the quantized motion features matched with the sample motion features. For example: the electronic device may determine a distance between the sample coding feature and the quantized motion feature as a second loss value. The present disclosure does not limit the manner in which the second loss value is determined according to the sample coding features and the quantization motion features matching the sample motion features.
In the embodiment of the disclosure, the second loss value is determined through the sample coding features and the quantization action features matched with the sample action features, and the model parameters can be updated quickly and accurately.
In some embodiments, the determining the second loss value according to the sample coding feature and a quantization action feature matched with the sample action feature comprises:
determining a first sub-loss value according to a difference value between the sample coding feature and the sample quantization action feature;
determining a second sub-loss value according to the inverse number of the difference between the sample coding characteristic and the sample quantization action characteristic;
and determining the second loss value according to the first sub-loss value, the second sub-loss value and a preset weight coefficient.
In an embodiment of the present disclosure, the electronic device may determine a first sub-loss value according to a difference between the sample coding characteristic and the sample quantization action characteristic, and determine a second sub-loss value according to an inverse of the difference between the sample coding characteristic and the sample quantization action characteristic. For example: the difference between the sample coding features and the sample quantization motion features is 2, the inverse number of the difference between the sample coding features and the sample quantization motion features is-2, and the like. And the electronic equipment determines a second loss value according to the first sub-loss value, the second sub-loss value and a preset weight coefficient. For example: and processing the weighted first sub-loss value and the second sub-loss value according to a preset calculation strategy to obtain a second loss value, wherein the preset weight coefficient corresponding to the first sub-loss value is 0.3, and the preset weight coefficient corresponding to the second sub-loss value is 0.7.
In one possible embodiment, the electronic device determines, from the preset motion feature library, a calculation formula of the quantized motion features matching the respective predicted motion features in the predicted motion feature sequence as follows:
Figure BDA0003621318830000091
in the formula (1), ejMay represent a predicted motion characteristic, zjCan represent quantized motion characteristics in a preset motion characteristic library, eq,jQuantized motion features that match the predicted motion features may be represented. Z may represent a preset motion feature library, j may represent the number of the current predicted motion feature, and argmin | | | may represent the minimum distance between the predicted motion feature and the quantized motion feature.
In a possible embodiment, the calculation formula for the electronic device to determine the first loss value according to the predicted association relationship and the preset association relationship may be as follows:
Figure BDA0003621318830000092
in the formula (2), LrecIt is possible to represent a first loss value,
Figure BDA0003621318830000093
may represent a predicted association, p may represent a preset association,
Figure BDA0003621318830000094
and p' may represent the first derivative,
Figure BDA0003621318830000095
and p ″)Can express the second derivative, | | | | non-woven phosphor1May represent the distance between features, alpha1And alpha2Custom parameters may be represented.
In one possible embodiment, the calculation formula for the electronic device to determine the first target loss value may be as follows:
Figure BDA0003621318830000096
in the formula (3), LVQMay represent a first target loss value, LrecMay represent a first loss value, e may represent a sample coding characteristic, eqSample quantitative motion features may be represented. Sg [ 2 ]]A stop gradient (stop gradient) can be expressed, it being understood that the gradient needs to be calculated when the model is trained in the forward direction, does not need to be calculated when the model is trained in the reverse direction, etc. The distance between the features can be represented by | | |, and the distance between the sample coding feature and the sample quantization action feature is set so that the sample coding feature is close to the sample quantization action feature, and meanwhile, the sample quantization action feature is close to the sample coding feature, so that training efficiency is improved, and the like, and beta can represent a user-defined parameter. In one possible embodiment, L may be utilizedVQA first loss value (which may also be referred to as a reconstruction loss) to train an encoder and a decoder in the neural network; can utilize LVQThe second term in (1), which may also be referred to as codebook loss (codebook loss), is used to train the initial action feature library in the neural network; can utilize LVQThe third term in (1), which may also be referred to as commitment loss, is used to train the encoder in the neural network, preventing the output of the encoder from frequently jumping repeatedly between quantized motion features of respective samples. The target feature processing model may also be referred to as a choreography Memory model (Choreographic Memory), and may be understood as a neural network trained based on a Vector Quantized variant automatic encoder (VQ-VAE) framework.
In the embodiment of the present disclosure, the second sub-loss value is determined according to the information of the difference between the sample coding characteristic and the sample quantization action characteristic, so that the similarity between the sample coding characteristic and the quantization action characteristic can be improved, and a standard action in the preset action characteristic library better conforms to an actual specification.
In some embodiments, the obtaining the temporally related sequence of predicted motion features includes:
determining a predicted action characteristic aligned with each sub-audio characteristic in time sequence according to each sub-audio characteristic in the audio characteristics and the initial action characteristic of the target role; wherein each of the sub-audio features is time-sequentially associated;
and obtaining the predicted action characteristic sequence according to each predicted action characteristic.
In the embodiment of the present disclosure, the electronic device may segment the audio feature to obtain at least two sub-audio features. For example: the electronic device may segment the audio features according to a preset tone level, a preset timbre range, a preset time interval, or the like, so as to obtain each sub-audio feature associated in time sequence. For example: the first sub-audio feature is a first frame of audio, the second sub-audio feature is a second frame of audio, and so on. The electronic device may produce a first predicted motion feature based on the first sub-audio feature and the initial motion feature, and then generate a second predicted motion feature based on the second sub-audio feature and the first predicted motion feature, and so on, such that each generated predicted motion feature has an association relationship in time sequence. In one possible embodiment, the electronic device may also produce a first predicted motion feature based on the first sub-audio feature and the initial motion feature, and then generate a second predicted motion feature based on the first sub-audio feature, the second sub-audio feature, the initial motion feature, and the first predicted motion feature, and so on.
After the predicted action features aligned with the sub-audio features in time sequence are determined, all the predicted action features corresponding to all the sub-audio features are obtained, and at least two predicted action features can be combined according to the association relation of all the sub-audio features in time sequence and the alignment relation between the predicted action features and the sub-audio features to obtain a predicted action feature sequence. For example: and the predicted action characteristic of the first column in the predicted action characteristic sequence is the predicted action characteristic corresponding to the first sub-audio characteristic, the predicted action characteristic of the second column in the predicted action characteristic sequence is the predicted action characteristic corresponding to the second sub-audio characteristic, and the like.
In the embodiment of the disclosure, the predicted action characteristic aligned with each sub-audio characteristic in time sequence is determined through each sub-audio characteristic and the initial action characteristic, so that a predicted action characteristic sequence can be obtained according to each predicted action characteristic, and the predicted action characteristic sequence can be accurately and quickly obtained.
In some embodiments, said determining, from each sub-audio feature of said audio features and the initial motion feature of the target character, a predicted motion feature that is time-aligned with each said sub-audio feature comprises:
splitting the initial action characteristic according to the position information of the target role to obtain at least two sub-initial action characteristics;
determining a degree of association between each of the sub-audio features and each of the sub-initial motion features; wherein the association degree at least characterizes the association degree of the sub-audio features and the sub-initial action features in content and time sequence;
and determining the predicted action characteristic according to the association degree between each sub-audio characteristic and each sub-initial action characteristic.
In the embodiment of the disclosure, the electronic device may split the initial action feature according to the position information of the target role to obtain at least two sub-initial action features. The part information may be understood as information of a part where the target character performs an action, for example: the target role is a virtual character, the part information can be the whole body, namely the initial action characteristic represents the whole body action. The whole body action can be divided into an upper body action and a lower body action, or a face action and a limb action and the like. The electronic device may split the initial motion characteristic to obtain at least two sub-initial motion characteristics, for example: the first sub-initial motion feature represents upper body motion, the second sub-initial motion feature represents lower body motion, and the like. In this way, more detailed whole body movements are facilitated to be accurately generated subsequently, enabling a greater variety of movements to be combined.
After obtaining at least two sub-initial motion characteristics, the electronic device may determine a degree of association between each sub-audio characteristic and each sub-initial motion characteristic, where the degree of association at least includes a degree of association between the sub-audio characteristic and each sub-initial motion characteristic in content and time sequence. In the process of generating the predicted action characteristics, according to the association relationship between each sub-audio characteristic and the sub-audio characteristic, the association relationship between each sub-audio characteristic and each sub-initial action characteristic, and the association relationship between each sub-initial action characteristic and the sub-initial action characteristic, each generated predicted action characteristic is smoother, and the consistency between the generated predicted action characteristic and the audio rhythm is improved. For example: there are 10 sub-audio features, and the sub-initial action features include: 3 initial action characteristics of the upper part of the body and 3 initial action characteristics of the lower part of the body, etc., wherein one initial action characteristic of the upper part of the body and one initial action characteristic of the lower part of the body are combined into a complete group of initial action characteristics.
In a possible embodiment, the electronic device may determine the association degree of the first sub-audio feature according to the association degree of the first sub-audio feature with the first sub-audio feature, the association relationship between the first sub-audio feature and the first upper body initial action feature, and the association relationship between the first sub-audio feature and the first lower body initial action feature. The association degree may include feature information of the sub-audio feature and the sub-initial motion feature themselves, and information of a position relationship between the sub-audio feature and the sub-initial motion feature. The electronic device may determine the association degree of the second sub audio feature according to the association degree of the second sub audio feature with the second sub audio feature, the association degree of the second sub audio feature with the first sub audio feature, the association relationship of the second sub audio feature with the second upper body initial motion feature, the association relationship of the second sub audio feature with the first upper body initial motion feature, the association relationship of the second sub audio feature with the second lower body initial motion feature, and the association relationship of the second sub audio feature with the first lower body initial motion feature. The determination manner of the degree of association of the second upper body initial motion feature, the second lower body initial motion feature, and the like is the same as the determination manner of the degree of association of the second sub audio feature.
After the electronic device determines the association degree between each sub-audio feature and each sub-initial motion feature, the predicted motion feature may be determined according to the association degree between each sub-audio feature and each sub-initial motion feature. For example: the electronic device may determine the degree of association according to a preset natural language processing layer (Transformer), so as to determine the predicted motion characteristic.
In one possible embodiment, a pre-trained neural network (i.e., a target motion prediction model) may be used to determine a sequence of predicted motion features based on the audio features and the initial motion features. Fig. 4 is a schematic diagram illustrating a motion prediction method according to an exemplary embodiment. As shown in fig. 4, the audio feature and the initial motion feature may be subjected to a splicing process by a feature encoding process (feature embedding), so as to obtain a spliced feature sequence. Wherein the initial action feature may include an Upper body action and a lower body action (Upper/lower half-body sequence). Such as: the first initial movement of the upper body as
Figure BDA0003621318830000111
The second initial movement of the upper body as
Figure BDA0003621318830000112
The first initial lower body movement is
Figure BDA0003621318830000113
The second initial lower body movement is
Figure BDA0003621318830000114
And so on. The portion of the concatenated sequence of features representing the audio feature may be denoted by m, the portion representing the feature characterized by the upper body movement may be denoted by u, and the portion representing the feature characterized by the lower body movement may be denoted by mThe portion of the feature of (a) may be denoted by l, etc., and the dimension of the stitching feature may be 3 x T'. Then combining position feature (Positional embedding) with splicing feature, wherein the position feature can represent position information of each item in the splicing feature, and inputting the combination result into State network (State network) fsIn 401, a state feature sequence (S) is obtained, wherein a second part of the state feature sequence may represent a state feature corresponding to the upper body movement, and a third part may represent a state feature corresponding to the upper body movement.
In one possible embodiment, a decision network (policy mapping network) f may then be utilizeda402, obtaining a candidate Action feature sequence and a probability value (Action probability) corresponding to each candidate Action feature, where the dimension of the feature sequence (a) of the probability value may be (3 × T') × N, where the candidate Action feature corresponding to the upper body Action may be represented as auThe candidate motion feature corresponding to the lower body motion can be represented as alAnd so on. A predicted sample motion feature may then be determined from the plurality of sequences of candidate motion features using a selector (top-1selection), e.g., the first initial motion of the upper body as
Figure BDA0003621318830000121
The second initial movement of the upper body as
Figure BDA0003621318830000122
The first initial lower body movement is
Figure BDA0003621318830000123
The second initial lower body movement is
Figure BDA0003621318830000124
After the predicted sample motion feature sequence is determined, Quantized motion feature sequences (Quantized features) matching respective predicted motion features in the predicted motion feature sequence may be determined from a preset motion feature library (Choroegraphic Memory Codebook)404, and a target motion feature sequence (Gene) may be generated from the Quantized motion feature sequences using a convolutional neural network decoder (CNN Decoders)405rated dance), and the like.
In one possible embodiment, after obtaining the state signature sequence, a value evaluation network (f) may be utilizedv) Determining criticizing values (criticizing values) corresponding to each state feature, wherein the dimension of the feature sequence characterized by the criticizing values can be (3 × T') × 1, VuRepresenting criticism value, V, corresponding to the characteristic of the upper body movementlAnd expressing the criticizing value corresponding to the characteristic characterized by the lower body motion. By pairs of VuAnd VlThe criticizing value (V) of the overall action of the combination. A consistent Reward process (Beat Align/Half-body Consistency Reward)406 may be performed based on the target action feature sequence and the Music tempo (Music beats) to determine a Reward value (Reward) R, and a time difference error (TD-error) epsilon based on the Reward value and the criticizing value. Cross-entropy loss (L) can be determined from the probability valuesCEDetermining Actor-critic loss (Actor-critic) L according to the probability value and the time difference errorACAnd determining a Value loss (L) based on the time difference errorVAnd finally, training the initialized neural network according to the cross entropy loss, the actor evaluation loss and the value loss to obtain the trained neural network (namely the target action prediction model). The target motion prediction model may also be referred to as a dance motion generation model (GPT), and may be understood as a neural network trained based on a GPT framework of reinforcement learning.
In one possible embodiment, the state network includes multiple natural language processing layers (Transformer layers), and the natural language processing layers in the state network may include: a lateral Normalization Layer (Layer Normalization), a Cross Conditional Causal Attention Layer (Cross Conditional cause Attention), a full link Layer (Linear), an active Layer (GELU), a filter Layer (Dropout), and the like. The decision network comprises a plurality of natural language processing layers, a Linear layer (Linear) or a regression layer (Softmax) and the like, and the value evaluation network comprises a plurality of natural language processing layers, a Linear layer and the like.
As shown in fig. 5a, 5b and 5c, the method for determining the difference in the degree of association according to an exemplary embodiment is illustrated. As shown in fig. 5a, a Full attention mode (Full attention) may be represented, with each feature determining a degree of association with all features in the sequence. As shown in the area of fig. 5b, a cause and effect attention (cause attention) manner may be expressed, a degree of association is determined between a first feature and a first feature in the feature sequence, a degree of association is determined between a second feature and the first feature and a second feature, a degree of association is determined between a third feature and the first feature, the second feature and the third feature, and the like. As shown in the area of fig. 5c, a Cross-conditional causal attention (Cross-conditional cause attention) may be expressed, in which, for a musical feature in the feature sequence, a first musical feature is associated with a first musical feature, a first upper body motion feature, and a first lower body motion feature to a certain extent, and a second musical feature is associated with a first musical feature, a second musical feature, a first upper body motion feature, a second upper body motion feature, a first lower body motion feature, and a second lower body motion feature to a certain extent, and the association is determined for different types of features in the feature sequence, such as the upper body motion feature and the lower body motion feature, in the same manner. In the embodiment, a cross-condition causal attention mode can be adopted to determine the association degree between the features, so that the upper body action and the lower body action can be taken into consideration in the process of determining the association relation, and the situation that the actions of the upper body and the lower body are inconsistent is avoided.
In the embodiment of the disclosure, the initial action features are split according to the part information of the target role to obtain at least two sub-initial action features, and then the association degree between each sub-audio feature and each sub-initial action feature can be determined, so that the predicted action features can be determined according to the association degree between each sub-audio feature and each sub-initial action feature, and the accuracy of generating the predicted action features can be improved.
In some embodiments, the method further comprises:
acquiring sample audio features and sample initial action features of sample roles;
splitting the initial sample action characteristics according to the part information of the sample roles to obtain at least two initial sub-sample action characteristics;
determining a prediction sample action characteristic sequence according to each subsample audio characteristic in the sample audio characteristics and each subsample initial action characteristic;
training an initial motion prediction model based on each predicted sample motion characteristic and each preset sample motion characteristic in the predicted sample motion characteristic sequence to obtain a target motion prediction model;
the obtaining of the predicted action characteristic sequence related in time sequence comprises:
and acquiring a predicted action characteristic sequence related in time sequence by using the target action prediction model.
In the embodiment of the disclosure, the sample audio features may be understood as manually labeled audio information, the motion represented by the sample initial motion features is generated by a sample role, and the sample initial motion features may be understood as a sequence of manually labeled motion features, and may be used to train an initial motion prediction model. The model parameters in the initial motion prediction model may be randomly set, and during the training process, the model parameters may be updated according to the loss values determined by the predefined loss function. The electronic device can crawl the sample audio features from the internet, as well as sample initial motion features corresponding to the sample audio features. The electronic equipment can split the initial action characteristics of the sample according to the position information of the sample role to obtain at least two initial action characteristics of the sub-sample. For example: splitting the initial action characteristics of the sample into: a sub-sample initial motion feature representing an upper body motion, a sub-sample initial motion feature representing a lower body motion, and the like.
In one possible embodiment, the electronic device may determine a sequence of predicted sample motion features from each subsample audio feature and each subsample initial motion feature using an initial motion prediction model. And the electronic equipment determines a loss value corresponding to a preset loss function according to the motion characteristics of each prediction sample and the motion characteristics of each preset sample, and updates the model parameters of the initial motion prediction model according to the loss value to obtain the trained target motion prediction model. After the electronic device obtains the trained target motion prediction model, a prediction motion characteristic sequence related in time sequence can be obtained according to the audio characteristic and the initial motion characteristic by using the target motion prediction model.
In the embodiment of the disclosure, the sample audio characteristics and the sample initial action characteristics of the sample roles are obtained, then the sample initial action characteristics can be split to obtain at least two sub-sample initial action characteristics, and then the predicted sample action characteristic sequence is determined, so that the initial action prediction model can be trained based on each predicted sample action characteristic and each preset sample action characteristic to obtain the target action prediction model, and the accuracy of model training can be improved.
In some embodiments, the training an initial motion prediction model based on each motion feature of the prediction samples in the motion feature sequence and each preset sample motion feature to obtain a target motion prediction model includes:
obtaining candidate action characteristics and probability values corresponding to the candidate action characteristics according to the audio characteristics of the sub samples and the initial action characteristics of the sub samples; wherein the probability value is used to determine the predicted sample action feature from the candidate action features;
determining a third loss value according to the number of the audio features of the sub-samples, the action features of the preset samples and the probability value;
determining a fourth loss value according to the number of the audio features of the subsample, the motion features of the prediction sample and the probability value;
obtaining a second target loss value according to the third loss value and the fourth loss value;
and updating the model parameters of the initial motion prediction model by using the second target loss value to obtain the target motion prediction model.
In the embodiment of the disclosure, in the process of training the target motion prediction model, the electronic device may obtain candidate motion features and probability values corresponding to the candidate motion features according to the audio features of the subsamples and the initial motion features of the subsamples, where the probability values are used to determine the motion features of the prediction samples from the candidate motion features. The candidate motion feature may be understood as a plurality of motion features that may be selected subsequently according to the current audio feature of the subsample and the initial motion feature of the subsample, each candidate motion feature may be used as a predicted sample motion feature, and in order to make the audio rhythm consistent with the motion beat, a candidate motion feature with the maximum probability of making the audio rhythm consistent with the motion beat needs to be determined from the plurality of candidate motion features. For example: when the current prediction sample motion feature is generated, the probability value corresponding to the first candidate motion feature is 0.8, the probability value corresponding to the second candidate motion feature is 0.3, and the probability value corresponding to the third candidate motion feature is 0.7, then the electronic device may use the first candidate motion feature as the current prediction sample motion feature.
After the electronic device determines the probability value corresponding to the candidate motion feature, a third loss value may be determined according to the number of the sub-sample audio features, the preset sample motion feature, and the probability value. The third loss value may be understood as a part of the loss value of the initial motion prediction model calculated according to the determination strategy of the preset third loss function, and may be understood as a difference between the predicted sample motion characteristic of the initial motion prediction model and the preset sample motion characteristic. The electronic device can determine a fourth loss value based on the number of sub-sample audio features, the predicted sample action features, and the probability value. The fourth loss value may be understood as a part of the loss value of the initial motion prediction model calculated according to a determination strategy of a preset fourth loss function, and may be understood as a reward in a case where the initial motion prediction model selects a predicted sample motion characteristic.
And the electronic equipment determines that a second target loss value can be obtained according to the third loss value and the fourth loss value, and updates the model parameters of the initial motion prediction model by using the second target loss value to obtain the target motion prediction model. For example: the electronic device may obtain the second target loss value by performing weighted addition processing on the third loss value and the fourth loss value.
In the embodiment of the disclosure, the candidate action features and the probability values corresponding to the candidate action features are obtained according to the audio features of the subsamples and the initial action features of the subsamples, then, the third loss value is determined according to the number of the audio features of the subsamples, the preset action features of the samples and the probability values, and the fourth loss value is determined according to the number of the audio features of the subsamples, the action features of the prediction samples and the probability values, so that the second target loss value is obtained, model parameters of an initial action prediction model are updated, the target loss value can be accurately determined, and the efficiency and the accuracy of model training are improved.
In some embodiments, the determining a third loss value according to the number of sub-sample audio features, the preset sample action features, and the probability value comprises:
determining a difference value between the vector characteristic of the probability value and the preset sample action characteristic;
and determining the third loss value according to the difference value between the vector characteristic of the probability value and the preset sample action characteristic and the number of the audio characteristics of the subsamples.
In this embodiment, the electronic device may determine a difference between the vector feature of the probability value and the preset sample motion feature, and then determine the third loss value according to the difference between the vector feature of the probability value and the preset sample motion feature and the number of the sub-sample audio features. For example: in the process of sequentially determining the motion characteristics of the prediction sample by the electronic equipment, the difference values between the vector characteristics of the probability value and the preset sample motion characteristics are respectively 2, 1, 3, 4 and 2. Then, when the predicted sample motion characteristic corresponding to the 3 rd sub-sample audio characteristic is determined, the first difference value, the second difference value, and the third difference value may be accumulated to obtain a third loss value when the predicted sample motion characteristic corresponding to the current sub-sample audio characteristic is determined.
In the embodiment of the disclosure, by determining the difference between the vector feature of the probability value and the preset sample action feature, the third loss value can be determined quickly and accurately according to the difference between the vector feature of the probability value and the preset sample action feature and the number of the audio features of the sub-samples, so that the efficiency and accuracy of model training are improved.
In some embodiments, the determining a fourth loss value as a function of the number of subsample audio features, the predicted sample action features and the probability value comprises:
determining a difference between the predicted sample action feature and a vector feature of the probability value;
determining the orientation characteristics of the sample roles according to the initial action characteristics of each subsample;
determining the fourth loss value according to the number of the sub-sample audio features, the difference value between the predicted sample action feature and the vector feature of the probability value, and the orientation feature.
In the embodiment of the disclosure, the electronic device may determine a difference between the motion feature of the prediction sample and the vector feature of the probability value, determine the orientation feature of the sample role according to the initial motion feature of each sub-sample, and understand the difference between the motion feature of the prediction sample and the vector feature of the probability value as a distance difference between different features. Orientation features may be understood as direction information of actions generated by a sample character when the sample character generates the actions, for example: the orientation characteristic of the whole body motion may be characterized as being oriented to the right front, the orientation characteristic of the upper body motion may be characterized as being oriented to the left, the orientation characteristic of the lower body motion may be characterized as being oriented to the right, etc.
After the electronic device determines the difference between the motion features of the prediction sample and the vector features of the probability value and the orientation features of the sample roles, a fourth loss value can be determined according to the number of the audio features of the sub-samples, the difference between the motion features of the prediction sample and the vector features of the probability value and the orientation features. For example: in the process of sequentially determining the motion features of the prediction sample, the difference values between the vector features of the probability values and the motion features of the prediction sample are respectively 3, 5,6, 2, and 4, and the orientation features can be respectively expressed as: the upper body movement is consistent with the lower body movement, and the upper body movement is inconsistent with the lower body movement. Then, when the motion feature of the prediction sample corresponding to the 3 rd sub-sample audio feature is determined, the difference and the orientation feature may be fused (for example, multiplied), and then the first fusion result, the second fusion result, and the third fusion result are accumulated to obtain a fourth loss value when the motion feature of the prediction sample corresponding to the current sub-sample audio feature is determined.
In the embodiment of the disclosure, by determining the difference between the motion characteristics of the prediction sample and the vector characteristics of the probability value and the orientation characteristics of the sample role, the fourth loss value can be determined quickly and accurately according to the number of the audio characteristics of the sub-samples, the difference between the motion characteristics of the prediction sample and the vector characteristics of the probability value and the orientation characteristics, and the efficiency, the accuracy and the like of model training are improved.
In a possible embodiment, the calculation formula for the electronic device to obtain the candidate action features and the probability values corresponding to the candidate action features according to the audio features of the subsample and the initial action features of the subsample may be as follows:
Figure BDA0003621318830000151
in the formula (4), the first and second groups,
Figure BDA0003621318830000152
may represent probability values corresponding to candidate motion features determined by the upper body motion,
Figure BDA0003621318830000153
may represent probability values corresponding to candidate motion features determined by the lower body motion, P () may represent a probability distribution,
Figure BDA0003621318830000154
and
Figure BDA0003621318830000155
may represent quantized motion characteristics corresponding to the upper body motion and the lower body motion, respectively, k may represent a plurality of determined candidate motion characteristics, m1-tCan represent audioAnd (5) characterizing.
Figure BDA0003621318830000156
May represent an initial movement characteristic corresponding to the upper body movement,
Figure BDA0003621318830000157
can represent the initial action characteristic corresponding to the lower body action, argmaxkAnd | | can represent the corresponding predicted action characteristic when the reward value is maximum.
In a possible embodiment, the electronic device may determine the third loss value according to the number of the audio features of the sub-samples, the preset sample action features, and the probability value, as follows:
Figure BDA0003621318830000158
in the formula (5), LCEA third loss value, also referred to as a cross-entropy loss value, may be represented, T' may represent the number of sub-sample audio features,
Figure BDA0003621318830000159
may represent a probability value, h may represent an upper body motion u or a lower body motion l, t may represent the number of the current process,
Figure BDA00036213188300001510
may represent preset sample motion features and crossentry () may represent the distance between the acquisition of different features.
In one possible embodiment, the calculation formula for the electronic device to determine the fourth loss value according to the number of the audio features of the sub-samples, the motion features of the predicted samples and the probability value may be as follows:
v=vu+vl=fv(s)T′:2T′-1+fv(s)2T′:3T′-1 (6);
ε0:T′-2=r0:T′-2+sg[v1:T′-1]-v0:T′-2 (7);
Figure BDA00036213188300001511
Figure BDA00036213188300001512
Figure BDA00036213188300001513
Figure BDA00036213188300001514
in the formulas (6), (7), (8), (9), (10) and (11), v can represent criticizing values corresponding to the motion characteristics of the prediction sample, and v isuAnd vlThe criticizing values f corresponding to the motion characteristics of the predicted sample represented by the upper body motion and the lower body motion can be respectively representedv(s) may represent a value assessment network. Epsilon0May represent a time difference error, and R ═ R may represent a prize value, Sg [ deg. ] [ ]]A stop gradient may be indicated, it being understood that a gradient needs to be calculated for forward propagation, a gradient does not need to be calculated for backward propagation, etc. L is a radical of an alcoholACA fourth loss value, which may also be referred to as actor-rating loss, may be represented, T' may represent the number of sub-sample audio features,
Figure BDA0003621318830000161
may represent a probability value, h may represent an upper body motion u or a lower body motion l, t may represent the number of the current process,
Figure BDA0003621318830000162
may represent predicting a sample motion feature and crossentry () may represent taking the distance between different features. Rb() May represent a first reward, which may also be referred to as a beat-aligned reward, Rc() Can be used forRepresents a second reward, which may also be referred to as a component consistency reward (compositional consistency reward), inf { } may represent an infimum limit,
Figure BDA0003621318830000163
a reference prize corresponding to the second prize may be indicated,
Figure BDA0003621318830000164
and
Figure BDA0003621318830000165
the corresponding orientation features of the upper body motion and the lower body motion relative to the xz plane may be represented, respectively.
In one possible embodiment, the value loss calculation may be as follows:
Figure BDA0003621318830000166
in the formula (12), LvMay represent a loss of value, T' may represent a number of sub-sample audio features, epsilon may represent a time difference error, | | | | survival2A norm formula can be expressed.
In one possible embodiment, as shown in FIG. 6, a diagram of a method for reward determination is shown according to an exemplary embodiment. As shown in the area (a), a first reward may be represented, d may represent a duration corresponding to the audio feature of the sub-sample, a vertical dotted line may represent a music tempo (music beat), a curve may represent a motion change trend, a solid point may represent a motion tempo (dance beat), and in the case where a music tempo point exists in the duration corresponding to the audio feature of the sub-sample and a motion tempo point exists, the first reward is 1; if the music rhythm point exists in the duration corresponding to the audio features of the sub-sample, the first reward is-1 under the condition that the action rhythm point does not exist, or the music rhythm point does not exist in the duration corresponding to the audio features of the sub-sample. As shown in FIG. b, a second reward may be presented, which may be based on the predicted sample action characteristics and the lower body corresponding to the upper body actionThe motion characteristics of the predicted sample corresponding to the motion are respectively determined to be the orientation characteristics n of the upper body motionu() And orientation feature n of lower body motionl() Then, the corresponding orientation feature with respect to the xz plane can be obtained
Figure BDA0003621318830000167
And
Figure BDA0003621318830000168
in the azimuth
Figure BDA0003621318830000169
And orientation characteristics
Figure BDA00036213188300001610
The second reward is 1 under the condition that the included angles between the first reward and the second reward do not belong to the same side, and the direction characteristic is
Figure BDA00036213188300001611
And orientation characteristics
Figure BDA00036213188300001612
The second reward is the angle between different orientation features, if the angles between them are on the same side.
In one possible embodiment, shown in FIG. 7, a diagram of a target action is shown in accordance with an exemplary embodiment. p is a radical of formula0And p1A predicted motion characteristic may be represented. The decoded target motion sequence has a continuous repetition coding (e.g., first row p) in the presence of predicted motion characteristics0And p0Second row p1And p1) In the case of (2), the generated target motion remains static and there are successive different encodings of the predicted motion characteristics (e.g. first line p)0And p1) In the case of (3), smooth transition is made between the generated target actions.
In some embodiments, the action generation method in the embodiments of the present disclosure may be used for generating a dance action, for example, a target dance action may be generated according to target music and an initial action of a target character, and the generating of the target dance action may include the following steps:
acquiring audio characteristics of target music and initial action characteristics of a target role;
generating a prediction action characteristic sequence related in time sequence according to the audio characteristic of the target music and the initial action characteristic of the target role; wherein the sequence of predicted action features includes predicted dance action features;
determining quantized motion characteristics matched with the predicted dance motion characteristics from a preset motion characteristic library;
determining the incidence relation of each predicted dance motion characteristic in the predicted motion characteristic sequence on the space;
according to the incidence relation of each predicted dance action characteristic in the space and the incidence relation of each predicted dance action characteristic in the time sequence, combining the quantized action characteristics matched with each predicted dance action characteristic to obtain a target action characteristic sequence corresponding to the target role;
and controlling the target role to execute the target dance motion corresponding to the target motion characteristic sequence.
In an embodiment of the present disclosure, the target music may include: the initial motion characteristics of the target character can be determined by preset default motion characteristics or according to the current instruction of the user. The electronic equipment can take the audio characteristics of the target music and the initial action characteristics of the target role as input, obtain a target action characteristic sequence corresponding to the target role according to the audio characteristics of the target music and the initial action characteristics of the target role, and further can control the target role to execute the target dance action corresponding to the target action characteristic sequence. The target dance motion can be understood as a dance motion matched with a form of target music, and the target dance can comprise: dancing such as floor dancing (Breaking), Ballet jazz (Ballet jazz), free dance (House dance), or hip hop (Middle hip-hop). Different forms of target music may correspond to different types of target dance movements, for example: target music and initial action characteristics in the form of balladry are input, and the target character can be controlled to execute dance actions in the national dance type and the like.
On the basis of the motion generation method provided by the foregoing embodiment, an embodiment of the present disclosure also provides a motion generation apparatus.
Fig. 8 is a schematic structural diagram of an action generating device according to an embodiment of the present disclosure, and as shown in fig. 8, the action generating device may include:
an obtaining module 801 configured to obtain a sequence of predicted motion characteristics related in time sequence; wherein the sequence of predicted motion features is generated from at least audio features;
a first determining module 802 configured to determine quantized motion features matching respective predicted motion features in the sequence of predicted motion features from a preset motion feature library; a second determining module 803, configured to determine a spatial association relationship between each of the predicted motion features in the sequence of predicted motion features; the combining module 804 is configured to combine the quantized motion features matched with the predicted motion features according to the association relationship of the predicted motion features in space and the association relationship of the predicted motion features in time sequence to obtain a target motion feature sequence.
In some embodiments, the first determining module 802 is configured to: respectively carrying out dimensionality reduction processing on each predicted action characteristic to obtain coding characteristics; the dimension of the coding feature is the same as the dimension of each quantitative action feature in the preset action feature library; and determining the quantized motion characteristics matched with the predicted motion characteristics from the preset motion characteristic library according to the distance between the coding characteristics and each quantized motion characteristic in the preset motion characteristic library.
In some embodiments, the first determining module 802 is configured to: determining the distance between the coding feature to be processed and each quantized motion feature in the preset motion feature library; sequencing the distances corresponding to the coding features to be processed to obtain a sequencing result; according to the sequencing result, determining quantization action characteristics corresponding to the coding characteristics to be processed; and determining the quantization action characteristics matched with the prediction action characteristics according to the quantization action characteristics corresponding to the coding characteristics to be processed.
In some embodiments, the second determining module 803 is configured to: determining relative position information between each key point and a preset standard point of a target role and displacement information of the target role according to each quantitative action characteristic; wherein the predicted action feature sequence is generated according to the audio features and the initial action features of the target role; and determining the spatial association relationship of each predicted action characteristic according to the relative position information and the displacement information.
In some embodiments, the apparatus 800 further comprises: a third determining module configured to obtain a sample action feature sequence; the updating module is configured to update the model parameters of the initial feature processing model and the quantitative action features in the initial action feature library of the initial feature processing model according to the preset association relationship of each sample action feature in the sample action feature sequence and each sample action feature on the space, so as to obtain an updating result; a fourth determining module configured to obtain a target feature processing model with the preset action feature library based on the update result; the first determining module 802 configured to: and determining quantitative action characteristics matched with each predicted action characteristic in the predicted action characteristic sequence from the preset action characteristic library of the target characteristic processing model.
In some embodiments, the update module is configured to: obtaining a prediction incidence relation of each sample action characteristic on the space by using the initial characteristic processing model; determining a first loss value according to the predicted incidence relation in the space and the preset incidence relation in the space; performing dimension reduction processing on each sample action characteristic to obtain a sample coding characteristic; determining a second loss value according to the sample coding characteristics; obtaining a first target loss value according to the first loss value and the second loss value; and updating the model parameters of the initial characteristic processing model and the quantitative action characteristics in the initial action characteristic library of the initial characteristic processing model by using the first target loss value to obtain the updating result.
In some embodiments, the update module is configured to: according to the sample coding features, determining quantized motion features matched with the sample motion features from the initial motion feature library; and determining the second loss value according to the sample coding characteristics and the quantization action characteristics matched with the sample action characteristics.
In some embodiments, the update module is configured to: determining a first sub-loss value according to a difference value between the sample coding characteristic and the sample quantization action characteristic; determining a second sub-loss value according to the inverse of the difference between the sample coding characteristic and the sample quantization action characteristic; and determining the second loss value according to the first sub-loss value, the second sub-loss value and a preset weight coefficient.
In some embodiments, the obtaining module 801 is configured to: determining predicted action features aligned in time sequence with each sub-audio feature according to each sub-audio feature in the audio features and the initial action feature of the target role; wherein each of the sub-audio features is time-sequentially associated; and obtaining the predicted action characteristic sequence according to each predicted action characteristic.
In some embodiments, the obtaining module 801 is configured to: splitting the initial action characteristic according to the position information of the target role to obtain at least two sub-initial action characteristics; determining a degree of association between each of the sub-audio features and each of the sub-initial motion features; wherein the association degree at least characterizes the association degree of the sub-audio features and the sub-initial action features in content and time sequence; and determining the predicted action characteristic according to the association degree between each sub-audio characteristic and each sub-initial action characteristic.
In some embodiments, the apparatus 800 further comprises: the fifth determining module is configured to obtain sample audio characteristics and sample initial action characteristics of the sample roles; the splitting module is configured to split the initial action characteristics of the sample according to the position information of the role of the sample to obtain at least two initial action characteristics of the sub-sample; a sixth determining module configured to determine a sequence of predicted sample motion features according to each subsample audio feature of the sample audio features and each of the subsample initial motion features; a seventh determining module, configured to train the initial motion prediction model based on each predicted sample motion feature in the predicted sample motion feature sequence and each preset sample motion feature to obtain a target motion prediction model; the obtaining module 801 is configured to: and acquiring a predicted action characteristic sequence related in time sequence by using the target action prediction model.
In some embodiments, the seventh determining module is configured to: obtaining candidate action characteristics and probability values corresponding to the candidate action characteristics according to the audio characteristics of the sub samples and the initial action characteristics of the sub samples; wherein the probability value is used to determine the predicted sample motion feature from the candidate motion features; determining a third loss value according to the number of the audio features of the sub-samples, the action features of the preset samples and the probability value; determining a fourth loss value according to the number of the audio features of the subsample, the motion features of the prediction sample and the probability value; obtaining a second target loss value according to the third loss value and the fourth loss value; and updating the model parameters of the initial motion prediction model by using the second target loss value to obtain the target motion prediction model.
In some embodiments, the seventh determining module is configured to: determining a difference value between the vector characteristic of the probability value and the preset sample action characteristic; and determining the third loss value according to the difference value between the vector characteristic of the probability value and the preset sample action characteristic and the number of the audio characteristics of the subsamples.
In some embodiments, the seventh determining module is configured to: determining a difference between the predicted sample action feature and a vector feature of the probability value; determining the orientation characteristics of the sample roles according to the initial action characteristics of each subsample; determining the fourth loss value according to the number of the sub-sample audio features, the difference value between the predicted sample action feature and the vector feature of the probability value, and the orientation feature.
In some embodiments, the apparatus 800 further comprises: the eighth determining module is configured to acquire the audio features of the target music and the initial action features of the target role; the generating module is configured to generate a prediction action characteristic sequence related in time sequence according to the audio characteristics of the target music and the initial action characteristics of the target role; wherein the sequence of predicted action features includes predicted dance action features; a ninth determining module, configured to determine quantized motion features matching the predicted dance motion features from a preset motion feature library; a tenth determining module, configured to determine an association relationship of each predicted dance action feature in the predicted action feature sequence on the space; an eleventh determining module, configured to combine quantized motion features matched with the predicted dance motion features according to the association relationship of the predicted dance motion features in space and the association relationship of the predicted dance motion features in time sequence to obtain a target motion feature sequence corresponding to the target role; and the control module is configured to control the target role to execute the target dance action corresponding to the target action characteristic sequence.
The obtaining module 801, the first determining module 802, the second determining module 803, and the combining module 804 described above may be implemented based on a processor of an electronic device.
In addition, each functional module in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.
Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Specifically, the computer program instructions corresponding to one motion generation method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disc, or a usb disk, and when the computer program instructions corresponding to one motion generation method in the storage medium are read or executed by an electronic device, any one of the motion generation methods of the foregoing embodiments is implemented.
Based on the same technical concept as the foregoing embodiment, referring to fig. 9, it shows an electronic device 90 provided by an embodiment of the present invention, which may include: a memory 91, a processor 92 and a computer program stored on the memory 91 and executable on the processor 92; wherein the content of the first and second substances,
a memory 91 for storing computer programs and data; a processor 92 for executing the computer program stored in the memory to implement any one of the action generating methods of the previous embodiments.
In practical applications, the memory 91 may be a volatile memory (RAM); or a non-volatile memory (non-volatile memory) such as a ROM, a flash memory (flash memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 92.
The processor 92 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor.
In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.
The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, which are not repeated herein for brevity
The methods disclosed in the method embodiments provided by the present disclosure may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in the various product embodiments provided by the disclosure may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the various method or apparatus embodiments provided by the present disclosure may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (17)

1. A method of motion generation, the method comprising:
acquiring a prediction action characteristic sequence related in time sequence; wherein the sequence of predicted motion features is generated from at least audio features;
determining quantized motion characteristics matched with each predicted motion characteristic in the predicted motion characteristic sequence from a preset motion characteristic library;
determining the incidence relation of each predicted action characteristic in the predicted action characteristic sequence on the space;
and combining the quantized motion characteristics matched with the predicted motion characteristics according to the spatial association relationship of the predicted motion characteristics and the time sequence association relationship of the predicted motion characteristics to obtain a target motion characteristic sequence.
2. The method of claim 1, wherein determining quantized motion features from a library of predetermined motion features that match respective predicted motion features in the sequence of predicted motion features comprises:
respectively carrying out dimensionality reduction processing on each predicted action characteristic to obtain coding characteristics; the dimension of the coding feature is the same as the dimension of each quantitative action feature in the preset action feature library;
and determining the quantized motion characteristics matched with the predicted motion characteristics from the preset motion characteristic library according to the distance between the coding characteristics and each quantized motion characteristic in the preset motion characteristic library.
3. The method according to claim 2, wherein the determining, from the preset motion feature library, quantized motion features matching respective predicted motion features according to a distance between the coded feature and each quantized motion feature in the preset motion feature library comprises:
determining the distance between the coding feature to be processed and each quantized motion feature in the preset motion feature library;
sequencing all the distances corresponding to the coding features to be processed to obtain a sequencing result;
according to the sequencing result, determining quantization action characteristics corresponding to the coding characteristics to be processed;
and determining the quantization action characteristics matched with the prediction action characteristics according to the quantization action characteristics corresponding to the coding characteristics to be processed.
4. The method according to any one of claims 1 to 3, wherein the determining the spatial relationship between each of the predicted motion features in the sequence of predicted motion features comprises:
determining relative position information between each key point and a preset standard point of a target role and displacement information of the target role according to each quantitative action characteristic; wherein the predicted action feature sequence is generated according to the audio features and the initial action features of the target role;
and determining the spatial association relationship of each predicted action characteristic according to the relative position information and the displacement information.
5. The method according to any one of claims 1 to 4, further comprising:
acquiring a sample action characteristic sequence;
updating model parameters of an initial feature processing model and quantitative motion features in an initial motion feature library of the initial feature processing model according to the sample motion features in the sample motion feature sequence and a preset incidence relation of the sample motion features on the space to obtain an updating result;
obtaining a target feature processing model with the preset action feature library based on the updating result;
the determining, from a preset motion feature library, a quantized motion feature that matches each predicted motion feature in the sequence of predicted motion features includes:
and determining quantitative action characteristics matched with each predicted action characteristic in the predicted action characteristic sequence from the preset action characteristic library of the target characteristic processing model.
6. The method according to claim 5, wherein the updating model parameters of an initial feature processing model and quantized motion features in an initial motion feature library of the initial feature processing model according to the preset association relationship of each sample motion feature and each sample motion feature in the sample motion feature sequence on the space to obtain an updated result comprises:
obtaining a prediction incidence relation of each sample action characteristic on the space by using the initial characteristic processing model;
determining a first loss value according to the predicted incidence relation in the space and the preset incidence relation in the space;
performing dimension reduction processing on each sample action characteristic to obtain a sample coding characteristic;
determining a second loss value according to the sample coding characteristics;
obtaining a first target loss value according to the first loss value and the second loss value;
and updating the model parameters of the initial feature processing model and the quantitative action features in the initial action feature library of the initial feature processing model by using the first target loss value to obtain the updating result.
7. The method of claim 6, wherein determining a second loss value based on the sample coding features comprises:
according to the sample coding features, determining quantized motion features matched with the sample motion features from the initial motion feature library;
and determining the second loss value according to the sample coding characteristics and the quantization action characteristics matched with the sample action characteristics.
8. The method of claim 7, wherein determining the second loss value based on the sample coding feature and a quantization action feature that matches the sample action feature comprises:
determining a first sub-loss value according to a difference value between the sample coding characteristic and the sample quantization action characteristic;
determining a second sub-loss value according to the inverse of the difference between the sample coding characteristic and the sample quantization action characteristic;
and determining the second loss value according to the first sub-loss value, the second sub-loss value and a preset weight coefficient.
9. The method according to any one of claims 1 to 8, wherein the obtaining of the temporally related sequence of predicted motion features comprises:
determining a predicted action characteristic aligned with each sub-audio characteristic in time sequence according to each sub-audio characteristic in the audio characteristics and the initial action characteristic of the target role; wherein each of the sub-audio features is time-sequentially associated;
and obtaining the predicted action characteristic sequence according to each predicted action characteristic.
10. The method of claim 9, wherein determining a predicted motion characteristic that is time-aligned with each of the sub-audio features based on each of the sub-audio features and an initial motion characteristic of a target character comprises:
splitting the initial action characteristic according to the position information of the target role to obtain at least two sub-initial action characteristics;
determining a degree of association between each of the sub-audio features and each of the sub-initial motion features; wherein the association degree at least characterizes the association degree of the sub-audio features and the sub-initial action features in content and time sequence;
and determining the predicted action characteristic according to the association degree between each sub-audio characteristic and each sub-initial action characteristic.
11. The method of any one of claims 1 to 10, further comprising:
acquiring sample audio features and sample initial action features of sample roles;
splitting the initial action characteristics of the sample according to the position information of the sample role to obtain at least two initial action characteristics of the subsample;
determining a prediction sample action characteristic sequence according to each subsample audio characteristic in the sample audio characteristics and each subsample initial action characteristic;
training an initial motion prediction model based on each predicted sample motion characteristic and each preset sample motion characteristic in the predicted sample motion characteristic sequence to obtain a target motion prediction model;
the acquiring of the time-series associated predicted action characteristic sequence comprises:
and acquiring a predicted action characteristic sequence related in time sequence by using the target action prediction model.
12. The method according to claim 11, wherein the training an initial motion prediction model based on each predicted sample motion feature and each preset sample motion feature in the predicted sample motion feature sequence to obtain a target motion prediction model comprises:
obtaining candidate action characteristics and probability values corresponding to the candidate action characteristics according to the audio characteristics of the sub samples and the initial action characteristics of the sub samples; wherein the probability value is used to determine the predicted sample motion feature from the candidate motion features;
determining a third loss value according to the number of the audio features of the sub-samples, the action features of the preset samples and the probability value;
determining a fourth loss value according to the number of the sub-sample audio features, the predicted sample action features and the probability value;
obtaining a second target loss value according to the third loss value and the fourth loss value;
and updating the model parameters of the initial motion prediction model by using the second target loss value to obtain the target motion prediction model.
13. The method of claim 12, wherein determining a third loss value based on the number of sub-sample audio features, the preset sample action features, and the probability value comprises:
determining a difference value between the vector characteristic of the probability value and the preset sample action characteristic;
determining the third loss value according to the difference value between the vector feature of the probability value and the preset sample action feature and the number of the audio features of the sub-samples;
and/or the presence of a gas in the gas,
determining a fourth loss value according to the number of sub-sample audio features, the predicted sample motion features, and the probability value, including:
determining a difference between the predicted sample action feature and a vector feature of the probability value;
determining the orientation characteristics of the sample roles according to the initial action characteristics of each subsample;
determining the fourth loss value according to the number of the sub-sample audio features, the difference value between the predicted sample action feature and the vector feature of the probability value, and the orientation feature.
14. The method according to any one of claims 1 to 13, further comprising:
acquiring audio characteristics of target music and initial action characteristics of a target role;
generating a prediction action characteristic sequence related in time sequence according to the audio characteristic of the target music and the initial action characteristic of the target role; wherein the sequence of predicted action features includes predicted dance action features;
determining quantized motion characteristics matched with the predicted dance motion characteristics from a preset motion characteristic library;
determining the incidence relation of each predicted dance motion characteristic in the predicted motion characteristic sequence on the space;
combining quantized motion characteristics matched with the predicted dance motion characteristics according to the association relation of the predicted dance motion characteristics in space and the association relation of the predicted dance motion characteristics in time sequence to obtain a target motion characteristic sequence corresponding to the target role;
and controlling the target role to execute the target dance motion corresponding to the target motion characteristic sequence.
15. An action generating apparatus, characterized in that the apparatus comprises:
the acquisition module is configured to acquire a prediction action characteristic sequence related in time sequence; wherein the sequence of predicted motion features is generated from at least audio features;
the first determination module is configured to determine quantitative action characteristics matched with each predicted action characteristic in the predicted action characteristic sequence from a preset action characteristic library;
a second determination module configured to determine a spatial association relationship between each of the predicted motion features in the sequence of predicted motion features;
and the combination module is configured to combine the quantized motion characteristics matched with the predicted motion characteristics according to the association relation of the predicted motion characteristics in space and the association relation of the predicted motion characteristics in time sequence to obtain a target motion characteristic sequence.
16. An electronic device comprising a processor and a memory for storing a computer program operable on the processor; wherein the content of the first and second substances,
the processor is configured to run the computer program to perform the action generation method of any of claims 1 to 14.
17. A computer storage medium on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the action generation method of any one of claims 1 to 14.
CN202210463597.8A 2022-02-28 2022-04-28 Action generating method, device, electronic equipment and storage medium Pending CN114741561A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10202202011P 2022-02-28
SG10202202011P 2022-02-28

Publications (1)

Publication Number Publication Date
CN114741561A true CN114741561A (en) 2022-07-12

Family

ID=82286341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210463597.8A Pending CN114741561A (en) 2022-02-28 2022-04-28 Action generating method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114741561A (en)

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110781820A (en) * 2019-10-25 2020-02-11 网易(杭州)网络有限公司 Game character action generating method, game character action generating device, computer device and storage medium
CN110955786A (en) * 2019-11-29 2020-04-03 网易(杭州)网络有限公司 Dance action data generation method and device
CN110992449A (en) * 2019-11-29 2020-04-10 网易(杭州)网络有限公司 Dance action synthesis method, device, equipment and storage medium
CN111080752A (en) * 2019-12-13 2020-04-28 北京达佳互联信息技术有限公司 Action sequence generation method and device based on audio and electronic equipment
CN111432233A (en) * 2020-03-20 2020-07-17 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating video
CN111459450A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
US20200314508A1 (en) * 2019-03-25 2020-10-01 Rovi Guides, Inc. Systems and methods for creating customized content
CN111968202A (en) * 2020-08-21 2020-11-20 北京中科深智科技有限公司 Real-time dance action generation method and system based on music rhythm
CN111986700A (en) * 2020-08-28 2020-11-24 广州繁星互娱信息科技有限公司 Method, device, equipment and storage medium for triggering non-contact operation
KR102192210B1 (en) * 2020-06-23 2020-12-16 인하대학교 산학협력단 Method and Apparatus for Generation of LSTM-based Dance Motion
CN112330779A (en) * 2020-11-04 2021-02-05 北京慧夜科技有限公司 Method and system for generating dance animation of character model
CN112365568A (en) * 2020-11-06 2021-02-12 广州小鹏汽车科技有限公司 Audio processing method and device, electronic equipment and storage medium
WO2021067988A1 (en) * 2019-09-30 2021-04-08 Snap Inc. Automated dance animation
CN112990283A (en) * 2021-03-03 2021-06-18 网易(杭州)网络有限公司 Image generation method and device and electronic equipment
US20210343058A1 (en) * 2020-04-29 2021-11-04 Htc Corporation Method for generating action according to audio signal and electronic device
CN113763532A (en) * 2021-04-19 2021-12-07 腾讯科技(深圳)有限公司 Human-computer interaction method, device, equipment and medium based on three-dimensional virtual object
CN113750523A (en) * 2021-04-19 2021-12-07 腾讯科技(深圳)有限公司 Motion generation method, device, equipment and storage medium for three-dimensional virtual object
CN113781609A (en) * 2021-08-26 2021-12-10 河南科技学院 Dance action real-time generation system based on music rhythm
CN113901189A (en) * 2021-10-18 2022-01-07 深圳追一科技有限公司 Digital human interaction method and device, electronic equipment and storage medium

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200314508A1 (en) * 2019-03-25 2020-10-01 Rovi Guides, Inc. Systems and methods for creating customized content
WO2021067988A1 (en) * 2019-09-30 2021-04-08 Snap Inc. Automated dance animation
CN110781820A (en) * 2019-10-25 2020-02-11 网易(杭州)网络有限公司 Game character action generating method, game character action generating device, computer device and storage medium
CN110955786A (en) * 2019-11-29 2020-04-03 网易(杭州)网络有限公司 Dance action data generation method and device
CN110992449A (en) * 2019-11-29 2020-04-10 网易(杭州)网络有限公司 Dance action synthesis method, device, equipment and storage medium
CN111080752A (en) * 2019-12-13 2020-04-28 北京达佳互联信息技术有限公司 Action sequence generation method and device based on audio and electronic equipment
CN111432233A (en) * 2020-03-20 2020-07-17 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating video
CN111459450A (en) * 2020-03-31 2020-07-28 北京市商汤科技开发有限公司 Interactive object driving method, device, equipment and storage medium
US20210343058A1 (en) * 2020-04-29 2021-11-04 Htc Corporation Method for generating action according to audio signal and electronic device
KR102192210B1 (en) * 2020-06-23 2020-12-16 인하대학교 산학협력단 Method and Apparatus for Generation of LSTM-based Dance Motion
CN111968202A (en) * 2020-08-21 2020-11-20 北京中科深智科技有限公司 Real-time dance action generation method and system based on music rhythm
CN111986700A (en) * 2020-08-28 2020-11-24 广州繁星互娱信息科技有限公司 Method, device, equipment and storage medium for triggering non-contact operation
CN112330779A (en) * 2020-11-04 2021-02-05 北京慧夜科技有限公司 Method and system for generating dance animation of character model
CN112365568A (en) * 2020-11-06 2021-02-12 广州小鹏汽车科技有限公司 Audio processing method and device, electronic equipment and storage medium
CN112990283A (en) * 2021-03-03 2021-06-18 网易(杭州)网络有限公司 Image generation method and device and electronic equipment
CN113763532A (en) * 2021-04-19 2021-12-07 腾讯科技(深圳)有限公司 Human-computer interaction method, device, equipment and medium based on three-dimensional virtual object
CN113750523A (en) * 2021-04-19 2021-12-07 腾讯科技(深圳)有限公司 Motion generation method, device, equipment and storage medium for three-dimensional virtual object
CN113781609A (en) * 2021-08-26 2021-12-10 河南科技学院 Dance action real-time generation system based on music rhythm
CN113901189A (en) * 2021-10-18 2022-01-07 深圳追一科技有限公司 Digital human interaction method and device, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHEN KANG ET AL.: ""ChoreoMaster: Choreography-Oriented Music-Driven Dance Synthesis"", 《ACM TRANSACTIONS ON GRAPHICS》, vol. 40, no. 4, 1 August 2021 (2021-08-01), pages 1 - 13 *
LI BUYU ET AL.: ""DanceNet3D:Music Based Dance Generation with parametric Motion Transformer"", 《ARXIV》, 18 March 2021 (2021-03-18), pages 9 *
LI RUILONG ET AL.: ""AI Choreographer: Music Conditioned 3D Dance Generation with AIST plus"", 《 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021)》, 1 January 2021 (2021-01-01), pages 13381 - 13392 *
姜莱 等: ""音频驱动跨模态视觉生成算法综述"", 《图学学报》, vol. 43, no. 02, 4 January 2022 (2022-01-04), pages 181 - 188 *
祁玉: ""音频驱动的舞蹈动作生成"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, 15 January 2022 (2022-01-15), pages 136 - 457 *

Similar Documents

Publication Publication Date Title
Huang et al. Dance revolution: Long-term dance generation with music via curriculum learning
Punnakkal et al. BABEL: Bodies, action and behavior with english labels
Fabius et al. Variational recurrent auto-encoders
EP3803846B1 (en) Autonomous generation of melody
Aristidou et al. Rhythm is a dancer: Music-driven motion synthesis with global structure
US20170124400A1 (en) Automatic video summarization
Ahn et al. Generative autoregressive networks for 3d dancing move synthesis from music
CN110503074A (en) Information labeling method, apparatus, equipment and the storage medium of video frame
Hsu et al. Example-based control of human motion
WO2020088491A1 (en) Method, system, and device for classifying motion behavior mode
CN111581519A (en) Item recommendation method and system based on user intention in session
CN115599984B (en) Retrieval method
CN116392812A (en) Action generating method and virtual character animation generating method
CN107506479B (en) A kind of object recommendation method and apparatus
Narasimhan et al. Strumming to the beat: Audio-conditioned contrastive video textures
CN111104964B (en) Method, equipment and computer storage medium for matching music with action
Lu et al. Co-speech gesture synthesis using discrete gesture token learning
CN113039561A (en) Aligning sequences by generating encoded representations of data items
Han et al. AMD: Autoregressive Motion Diffusion
CN111711868B (en) Dance generation method, system and device based on audio-visual multi-mode
Chen et al. Cross-domain recommendation with behavioral importance perception
CN114117086A (en) Method and device for manufacturing multimedia works and computer readable storage medium
CN114741561A (en) Action generating method, device, electronic equipment and storage medium
CN115294228B (en) Multi-figure human body posture generation method and device based on modal guidance
Liu et al. Motion improvisation: 3d human motion synthesis with a transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination