CN116883523A - Image generation method, image generation model training method and related devices - Google Patents
Image generation method, image generation model training method and related devices Download PDFInfo
- Publication number
- CN116883523A CN116883523A CN202210295206.6A CN202210295206A CN116883523A CN 116883523 A CN116883523 A CN 116883523A CN 202210295206 A CN202210295206 A CN 202210295206A CN 116883523 A CN116883523 A CN 116883523A
- Authority
- CN
- China
- Prior art keywords
- sample
- image
- target
- initial
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 155
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000011218 segmentation Effects 0.000 claims abstract description 253
- 238000012545 processing Methods 0.000 claims abstract description 41
- 230000006399 behavior Effects 0.000 claims description 166
- 230000004927 fusion Effects 0.000 claims description 78
- 238000003860 storage Methods 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 18
- 238000010586 diagram Methods 0.000 claims description 11
- 230000003542 behavioural effect Effects 0.000 claims description 9
- 238000013473 artificial intelligence Methods 0.000 abstract description 8
- 238000005516 engineering process Methods 0.000 description 18
- 238000000605 extraction Methods 0.000 description 15
- 238000009826 distribution Methods 0.000 description 12
- 210000000056 organ Anatomy 0.000 description 11
- 238000013507 mapping Methods 0.000 description 10
- 238000013145 classification model Methods 0.000 description 8
- 230000008921 facial expression Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 5
- 238000013508 migration Methods 0.000 description 5
- 230000005012 migration Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000033001 locomotion Effects 0.000 description 3
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the application discloses an image generation method, an image generation model training method and a related device, which can be applied to the fields of artificial intelligence, computer vision, image processing and the like. The image generation method disclosed by the embodiment of the application comprises the following steps: determining a state key point image when the reference object is in a target behavior state; determining an initial image of a target object, and determining initial segmentation features corresponding to the initial image; and determining a target image when the target object is in the target behavior state based on the state key point image, the initial image and the initial segmentation feature. By adopting the embodiment of the application, the target image of the target object in the target behavior state can be determined based on the state key point image of any reference object in the target behavior state, and the applicability is high.
Description
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to an image generation method, an image generation model training method, and a related apparatus.
Background
With the continuous development of artificial intelligence technology, the behavior state of a reference object is often migrated to a target object in the field of image generation to generate a new image, that is, the target object in the obtained new image has the same behavior state as the reference object.
In the prior art, an image generation model is often obtained by relying on a 3D modeling technology or training through a large amount of sample data so as to realize the image generation process. The former image generation effect often depends on modeling accuracy, and it is difficult to obtain a high-quality image in the existing modeling method, while the latter cannot guarantee the image generation effect in the case of a limited data volume.
Disclosure of Invention
The embodiment of the application provides an image generation method, an image generation model training method and a related device, which can determine a target image when a target object is in a target behavior state based on a state key point image when any reference object is in the target behavior state, and have high applicability.
In one aspect, an embodiment of the present application provides an image generating method, including:
determining a state key point image when the reference object is in a target behavior state;
determining an initial image of a target object and a corresponding initial segmentation map thereof;
and determining a target image when the target object is in the target behavior state based on the state key point image, the initial image and the initial segmentation map.
In another aspect, an embodiment of the present application provides an image generation model training method, including:
Determining a training sample set comprising sample images of at least one sample object in different behavior states;
determining a first sample image and a second sample image from the training sample set, and determining an initial sample segmentation map corresponding to the first sample image and a state key sample image corresponding to the second sample image, wherein the first sample image and the second sample image correspond to the same sample object;
inputting the first sample image, the initial sample segmentation map and the state key sample image into an initial model to obtain a predicted image when a first sample object corresponding to the first sample image is in a first behavior state corresponding to the second sample image;
and determining total training loss based on the second sample image and the predicted image, performing iterative training on the initial model based on the training sample set and the total training loss, stopping training until the total training loss meets the training ending condition, and determining the model at the time of stopping training as an image generation model.
In another aspect, an embodiment of the present application provides an image generating apparatus, including:
The image determining module is used for determining a state key point image when the reference object is in the target behavior state;
the image processing module is used for determining an initial image of the target object and a corresponding initial segmentation map thereof;
and the image generation module is used for determining a target image when the target object is in the target behavior state based on the state key point image, the initial image and the initial segmentation map.
In another aspect, an embodiment of the present application provides an image generation model training apparatus, including:
a sample determination module for determining a training sample set comprising sample images of at least one sample object in different behavior states;
the sample processing module is used for determining a first sample image and a second sample image from the training sample set, determining an initial sample segmentation map corresponding to the first sample image and a state key sample image corresponding to the second sample image, wherein the first sample image and the second sample image correspond to the same sample object;
the image prediction module is used for inputting the first sample image, the initial sample segmentation map and the state key sample image into an initial model to obtain a predicted image when a first sample object corresponding to the first sample image is in a first behavior state corresponding to the second sample image;
And the model training module is used for determining total training loss based on the second sample image and the predicted image, carrying out iterative training on the initial model based on the training sample set and the total training loss until the total training loss accords with the training ending condition, stopping training, and determining the model at the time of stopping training as an image generation model.
In another aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other;
the memory is used for storing a computer program;
the processor is configured to execute the image generation method and/or the image generation model training method provided by the embodiment of the application when the computer program is invoked.
In another aspect, embodiments of the present application provide a computer readable storage medium storing a computer program that is executed by a processor to implement the image generation method and/or the image generation model training method provided by the embodiments of the present application.
In another aspect, an embodiment of the present application provides a computer program product, where the computer program product includes a computer program, where the computer program implements the image generation method and/or the image generation model training method provided by the embodiment of the present application when the computer program is executed by a processor.
In the embodiment of the application, the target image when the target object is in the target behavior state corresponding to the reference object can be determined based on the state key point image when any reference object is in the target behavior state. The training method of the image generation model in the embodiment of the application can train based on simple sample images, can quickly train to obtain the image generation model with higher accuracy under the condition of not needing a large number of training samples, and further generates the target image when the target object is in the target behavior state corresponding to the reference object through the image generation model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic view of a scenario of an image generating method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of an image generating method according to an embodiment of the present application;
FIG. 3 is a schematic view of a scenario featuring object segmentation determination according to an embodiment of the present application;
FIG. 4 is a schematic view of a scenario for determining fusion features according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a scene of generating video clips according to an embodiment of the present application;
FIG. 6 is a flow chart of a training method for classification models according to an embodiment of the present application;
FIG. 7 is a schematic diagram of an initial model provided by an embodiment of the present application;
fig. 8 is a schematic structural view of an image generating apparatus according to an embodiment of the present application;
FIG. 9 is a schematic structural diagram of an image generation model training apparatus according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, fig. 1 is a schematic view of a scenario of an image generating method according to an embodiment of the present application. As shown in fig. 1, if the behavior state of the target object needs to be converted into the target behavior state of the reference object, a state key point image when the reference object is in the target behavior state may be determined, and an initial segmentation feature corresponding to the initial image of the target object may be determined.
For example, if there is an expression image of a reference object and an expression image of a target object, a state key point image corresponding to the reference object when the reference object is in the current expression may be determined, and an initial segmentation image corresponding to the expression image of the target object may be determined.
Further, the target image when the target object is in the target behavior state can be determined based on the state key point image when the reference object is in the target behavior state, the initial image of the target object and the initial segmentation map corresponding to the initial image of the target object.
For example, a new expression level of the target object may be determined based on the expression level of the target object, the initial segmentation level corresponding to the expression level, and the state key point image corresponding to the reference object when the reference object is in the current expression. The expression map is an expression map of a target object, and the expression of the target object is consistent with the expression in the expression map of a reference object.
Referring to fig. 2, fig. 2 is a flowchart of an image generating method according to an embodiment of the present application. As shown in fig. 2, the image generating method provided by the embodiment of the application may include the following steps:
step S21, determining a state key point image when the reference object is in the target behavior state.
In some possible embodiments, the target behavior state in which the reference object is located and the state keypoint image when the reference object is in the target behavior state may be determined based on the object image of the reference object.
The target behavior state of the reference object is used to describe a specific posture of the reference object, such as sitting posture, standing posture, facial expression or limb movement, and the like, which is not limited herein.
For example, if the object image of the reference object is a face image of the reference object, the target behavior state of the reference object may be used to describe the facial expression of the reference object.
For another example, if the object image of the reference object is a human body image of the reference object, the target behavior state of the reference object may be used to describe the specific limb motion of the reference object.
The state key points when the reference object is in the target behavior state can be used for representing the specific gesture of the reference object, and the state key points can be obtained by carrying out key point identification on the object image of the reference object.
For example, the object image of the reference object is a face image of the reference object, the target behavior state of the reference object can be used for describing the facial expression of the reference object, the state key points when the reference object is in the target behavior state are face key points, and the facial expression of the reference object at the moment can be determined through the distribution positions of the face key points.
For another example, the object image of the reference object is a human body image of the reference object, the target behavior state of the reference object can be used for describing the specific limb action of the reference object, and the state key points when the reference object is in the target behavior state are human body key points, so that the limb action of the reference object can be determined according to the distribution positions of the human body key points.
The state key point image when the reference object is determined to be in the target behavior state can be determined specifically through a pre-trained key point recognition model, and the model structure is not limited herein.
The reference objects in the embodiments of the present application include, but are not limited to, characters, animals, or avatars, and the like, which are not limited herein.
Step S22, determining an initial image of the target object and a corresponding initial segmentation map thereof.
In some possible embodiments, the initial image of the target object and the object image of the reference object are images with similar types, for example, the initial image of the target object and the object image of the reference object are both face images, or are both human images, so that the target image of the target object in the same behavior state of the reference object can be obtained.
The reference objects in the embodiments of the present application include, but are not limited to, characters, animals, or avatars, and the like, which are not limited herein.
For the target object, after determining the initial image of the target object, an initial segmentation map corresponding to the target object may be determined based on the initial image. The initial segmentation map corresponding to the target object is used for representing the distribution state of each limb part (such as limbs) or each organ (such as face organ) of the target object.
When determining the initial segmentation map of the target object, the model structure of the initial segmentation map can be determined based on the pre-trained segmentation feature extraction model.
Step S23, determining a target image when the target object is in the target behavior state based on the state key point image, the initial image and the initial segmentation map.
In some possible embodiments, after obtaining the state key point image when the reference object is in the target behavior state and the initial segmentation map corresponding to the target object, the corresponding target segmentation feature when the target object is in the target behavior state may be determined based on the state key point image and the initial segmentation map. And determining initial object features of the target object based on the initial image of the target object, thereby determining a target image of the target object when the target object is born in the target behavior state based on the initial object features and the target segmentation features.
The target segmentation features corresponding to the target object in the target behavior state are used for representing the distribution state and the morphology of each limb part or each organ of the target object in the target behavior state.
For example, where the target behavioral state is used to describe a facial expression of the reference object, then the corresponding target segmentation features of the target object when in the target behavioral state may be used to characterize the distribution state and morphology of the facial organs of the target object when the facial expression was made.
Similarly, when the target behavior state is used to describe a jump behavior of the reference object, then the corresponding target segmentation feature of the target object when the target object is in the target behavior state may be used to characterize the distribution state and morphology of each limb portion of the target object when the jump behavior is made.
In some possible embodiments, when determining the target segmentation feature corresponding to the target object in the target behavior state, convolution processing may be performed on the state key point image when the reference object is in the target behavior state, so as to obtain a first weight and a first offset corresponding to the initial segmentation feature corresponding to the target object when the initial segmentation feature corresponding to the target object is adjusted.
The state key point image can be subjected to convolution processing through a first initial convolution network, so that a first initial convolution characteristic is obtained. And further processing the first initial convolution characteristic through different convolution networks (such as a first convolution network and a second convolution network) respectively to obtain a first weight and a first offset corresponding to the initial segmentation characteristic.
The network structures and related parameters of the first initial convolution network, the first convolution network and the second convolution network, which are adopted in determining the target segmentation feature, are not limited.
Further, an initial segmentation feature corresponding to the target object can be determined based on the initial segmentation feature map, and the initial segmentation feature is processed based on the initial segmentation feature, the corresponding first weight and the first offset to obtain a target segmentation feature corresponding to the target object in the target behavior state.
Namely multiplying the first weight by the initial segmentation feature, and adding the result and the first offset to obtain a corresponding target segmentation feature when the target object is in the target behavior state.
For the initial segmentation feature, the initial segmentation feature may be normalized to obtain a normalized initial segmentation feature, and then the first weight is multiplied by the normalized initial segmentation feature, and the result is added to the first offset to obtain a target segmentation feature corresponding to the target object in the target behavior state.
That is, specific gesture information can be provided based on the state key point image corresponding to the reference object, the initial segmentation feature can be processed based on the gesture information to adjust the gesture of the target object in the initial image, and finally the corresponding target segmentation feature when the target object is in the target behavior state is obtained.
Referring to fig. 3, fig. 3 is a schematic view of a scene for determining a target segmentation feature according to an embodiment of the present application. As shown in fig. 3, the initial segmentation feature corresponding to the target object is normalized to obtain a normalized initial segmentation feature, the state key point image is input into a first initial convolution network to obtain a first initial convolution feature, and the first initial convolution feature is processed through the first convolution network and a second convolution network to obtain a first weight and a first offset. And then multiplying the normalized initial segmentation feature by a first weight, and adding the multiplication result and the first offset to finally obtain the corresponding target segmentation feature when the target object is in the target behavior state.
In some possible embodiments, after determining the initial object feature of the target object, the fusion feature may be obtained based on the initial object feature of the target object and the corresponding target segmentation feature when the target object is in the target behavior state. The initial object characteristics of the target object are fused with the distribution state and the morphology of each limb part or each organ when the target object is in the target behavior state, so that the fusion characteristics can simultaneously represent the initial object characteristics of the target object and the distribution state and the morphology of each limb part or each organ when the target object is in the target behavior.
The initial object features of the target object may be obtained based on a pre-trained feature extraction network, and the specific network structure is not limited herein.
When determining the fusion feature corresponding to the target object, convolution processing can be performed on the initial object feature of the reference object to obtain a second weight and a second offset corresponding to the adjustment of the target segmentation feature corresponding to the target object.
The initial object features can be subjected to convolution processing through a second initial convolution network, so that first initial convolution features are obtained. And further processing the second initial convolution characteristic through different convolution networks (such as a third convolution network and a fourth convolution network) respectively to obtain a second weight and a second offset corresponding to the target segmentation characteristic.
The network structures and related parameters of the second initial convolutional network, the third convolutional network and the fourth convolutional network adopted in determining the fusion characteristic are not limited.
Further, the target segmentation feature may be processed based on the target segmentation feature and the corresponding second weight and second offset thereof, to obtain a fusion feature corresponding to the target object in the target behavior state.
Namely multiplying the second weight by the target segmentation feature, and adding the result and the second offset to obtain a corresponding fusion feature when the target object is in the target behavior state.
For the target segmentation feature, the target segmentation feature may be normalized to obtain a normalized target segmentation feature, and then the second weight is multiplied by the normalized target segmentation feature, and the result is added with the second offset to obtain a fusion feature corresponding to the target object in the target behavior state.
That is, based on the above manner, the corresponding target segmentation feature when the target object is in the target behavior state can be fused with the initial object feature thereof, so as to obtain the fusion feature which can simultaneously represent the initial object feature of the target object and the distribution state and the morphology of each limb part or each organ when the target object is in the target behavior.
Referring to fig. 4, fig. 4 is a schematic view of a scenario for determining fusion features according to an embodiment of the present application. As shown in fig. 4, the target segmentation feature corresponding to the target object is normalized to obtain a normalized target segmentation feature, the initial object feature of the target object is input into a second initial convolution network to obtain a second initial convolution feature, and the second initial convolution feature is processed through a third convolution network and a fourth convolution network to obtain a second weight and a second offset. And then multiplying the normalized target segmentation feature by a second weight, and adding the multiplication result and the second offset to finally obtain a corresponding fusion feature when the target object is in the target behavior state.
When determining the fusion feature, the target segmentation feature determined based on the initial segmentation feature and the state key point image can be processed based on the deconvolution network to obtain a corresponding target segmentation map when the target object is in the target behavior state. And then, carrying out feature extraction on the target segmentation map when determining the fusion feature, and determining the fusion feature based on the extracted feature and the initial object feature of the target object.
Optionally, in order to fully fuse the initial object feature of the target object and the distribution state and the morphology of each limb portion or each organ when the target object is in the target behavior, the fusion feature determined based on the second weight and the second offset may be determined as the first fusion feature, and further, the first fusion feature and the initial object feature of the target object are fused based on the above manner to obtain the second fusion feature, so that the type of the fusion feature is used, and the finally obtained result is determined to be the corresponding fusion feature when the target object is in the target behavior state after the feature fusion of the preset times is performed.
For example, if two fusion operations are required to be performed based on the initial object feature pair of the target object, after the target segmentation feature is processed based on the second weight and the second offset obtained in the implementation manner to obtain a first fusion feature, the first fusion feature may be normalized again to obtain a normalized first fusion feature, and further, the normalized first fusion feature is processed based on the second weight and the second offset to obtain a second fusion feature, and the second fusion feature is determined to be the fusion feature corresponding to the target object in the target behavior state.
Similarly, if three times of fusion are needed based on the initial object features of the target object, the second fusion features can be standardized again after the second fusion features are obtained, the standardized second fusion features are obtained, and further the standardized second fusion features are processed based on the second weight and the second offset, so that the fusion features corresponding to the target object in the target behavior state are obtained.
Further, when determining a corresponding target image when the target object is in the target behavior state based on the corresponding fusion feature when the target object is in the target state, the network implementation may be generated based on the pre-trained image.
For the pre-trained image generation network, the image generation network can perform feature modification on the standard feature map corresponding to the image generation network based on the fusion feature of the target object, for example, the feature modification process can be realized through a style migration algorithm, and the target image corresponding to the target object in the target behavior state is generated based on the modified target feature map.
The style migration algorithm may be an AdaIn algorithm, or may be another algorithm, which is not limited herein.
When the target feature map is determined, a first mean value and a first variance corresponding to the standard feature map can be determined, a second mean value and a second variance corresponding to the fusion feature when the target object is in the target behavior state can be determined, and the standard feature map is processed based on the first mean value, the first variance, the second mean value and the second variance, so that the target feature map is obtained.
As an example, the target feature map x' may be obtained by:
wherein μ (x) and σ (x) are respectively a first mean and a first variance corresponding to the standard feature map, and μ (y) and σ (y) are respectively a second mean and a second variance corresponding to the fusion feature.
According to the image generation method provided by the embodiment of the application, the target object can have the same behavior state as the reference object under the condition of keeping the related information of the target object. A sequence of state key point images may be acquired on the basis of which a plurality of state key point images are included, each state key point image corresponding to a behavior state. The state key point image sequence is used for representing the static behavior state of the continuous behavior state at different moments, and the reference objects corresponding to the state key point images can be the same or different.
Further, for each state key point image in the state key point image sequence, a target image when the target object is in a behavior state corresponding to the state key point image can be determined based on the initial image of the target object and the state key point. And then, based on the arrangement sequence of the state key point images in the state key point image sequence, arranging the target images corresponding to the state key point images to obtain a target image sequence so as to generate a target video based on the target image sequence.
The object in the target video is the target object, and the continuous behavior state of the target object in the target video is the same as the continuous behavior state corresponding to the state key point image sequence. That is, given one state key point image sequence and an initial image of a target object, a target video that appears in a continuous behavior state in which the target object corresponds to the state key point image sequence can be generated.
Based on the above implementation, given an initial image of a target object, a target video of the target object in the object scene may be generated based on a sequence of state key point images under different scene requirements.
For example, if a news anchor is taken as a reference object, a segment of news broadcast video corresponding to the news anchor may be determined, a status key point corresponding to the news anchor in each video frame may be determined, and a status key point image corresponding to each video frame may be determined, so as to obtain a status key point image sequence corresponding to the news anchor.
When the target object is a virtual character, a target video for news broadcasting of the virtual character in a continuous behavior state when the news is broadcast by the news anchor may be generated based on an initial image of the virtual character and a state key point image sequence corresponding to the news anchor.
For example, if a movie character is taken as a reference object, a movie fragment corresponding to the character may be determined, a status key point corresponding to the character in each video frame may be determined, and a status key point image corresponding to each video frame may be determined, so as to obtain a status key point image sequence corresponding to the character.
For another example, referring to fig. 5, fig. 5 is a schematic view of a scene of generating a video clip according to an embodiment of the present application. In the case where the target object is an animated character, a state key point image sequence may be determined based on the image sequence given the image sequence of the reference object when in different states of behavior. And then, based on the initial image and the state key point image sequence of the target object provided by the embodiment of the application, generating a target image when the target object is in each behavior state corresponding to the image sequence, and further generating a target video for moving the target object based on each behavior state corresponding to the image sequence based on each target image.
The image generation method provided by the embodiment of the application can be realized based on a server or a terminal. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service.
The terminal (may also be referred to as a user terminal or a user device) may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart voice interaction device (e.g., a smart speaker), a wearable electronic device (e.g., a smart watch), a vehicle-mounted terminal, a smart home appliance (e.g., a smart television), an AR/VR device, and the like.
The image generation method provided by the embodiment of the application can be realized based on a classification model, namely, the state key point image of the reference object in the target behavior state is determined, the initial image of the target object and the initial segmentation graph thereof are determined, and then the state key point image, the initial image of the target object and the initial segmentation graph are input into the classification model to obtain the target image when the target object is in the target behavior state.
Referring to fig. 6, fig. 6 is a flow chart of a classification model training method according to an embodiment of the present application.
As shown in fig. 6, the classification model training method provided by the embodiment of the application may include the following steps:
step S61, determining a training sample set.
In some possible embodiments, the training sample set includes sample images of at least one sample object in different behavior states.
Wherein, in determining the training samples, a plurality of video clips may be determined. For each video clip, objects that appear consecutively in the video clip are taken as one sample object, and video frames that include the sample object are taken as sample images.
Alternatively, at least one sample object may be first determined, and an image of the sample corresponding to a different behavior state may be determined as a sample image.
Alternatively, continuous images of each pedestrian while walking may be acquired based on an intelligent transportation system (Intelligent Traffic System, ITS) or an intelligent vehicle-road coordination system (Intelligent Vehicle Infrastructure Cooperative Systems, IVICS), each pedestrian is taken as one sample object, and the corresponding continuous image of the pedestrian while walking continuously is determined as a sample image.
The intelligent transportation system is also called an intelligent transportation system (Intelligent Transportation System), which is to effectively and comprehensively apply advanced scientific technologies (information technology, computer technology, data communication technology, sensor technology, electronic control technology, automatic control theory, operation study, artificial intelligence and the like) to transportation, service control and vehicle manufacturing, and strengthen the connection among vehicles, roads and users. An intelligent vehicle-road cooperative system, namely a vehicle-road cooperative system for short, is one development direction of an Intelligent Transportation System (ITS). The vehicle-road cooperative system adopts advanced wireless communication, new generation internet and other technologies, comprehensively implements vehicle-vehicle and vehicle-road dynamic real-time information interaction, and develops vehicle active safety control and road cooperative management on the basis of full-time idle dynamic traffic information acquisition and fusion, thereby fully realizing effective cooperation of people and vehicles and roads.
In the embodiment of the present application, the training sample set may be stored in a server, a database, a cloud storage space or a blockchain, and may specifically be determined based on the actual application scenario requirements, which is not limited herein. The database may be considered in short as an electronic filing cabinet-where electronic files are stored-and may be used in the present application to store training sample sets. Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. Blockchains are essentially a de-centralized database, which is a string of data blocks that are generated in association using cryptographic methods. In the present application, each data block in the blockchain may store a training sample set. Cloud storage is a new concept which extends and develops in the concept of cloud computing, and refers to that a large number of storage devices (storage devices are also called storage nodes) of different types in a network are combined to work cooperatively through application software or application interfaces through functions of cluster application, grid technology, distributed storage file systems and the like, so as to store training sample sets together.
Step S62, determining a first sample image and a second sample image from the training sample set, and determining an initial sample segmentation map corresponding to the first sample image and a state key sample image corresponding to the second sample image.
After determining the training sample set, for each training, two sample images of the same sample object corresponding to different behavior states may be determined from the training sample set and determined as a first sample image and a second sample image, respectively. For convenience of description, hereinafter, a sample object corresponding to the first sample image or the second sample image is referred to as a first sample object, and a behavior state of the first sample object in the second sample image is referred to as a first behavior state.
Further, an initial sample segmentation map corresponding to the first sample image and a state key sample image corresponding to the second sample image can be determined, and then the first behavior state is migrated to the first sample image through the initial sample segmentation map corresponding to the first sample image and the state key sample image corresponding to the second sample image, so that a predicted image of the first sample object in the first behavior state is obtained.
The first behavior state is used to describe a specific posture of the first sample object in the second sample image, such as sitting posture, standing posture, facial expression, or limb movement, and the like, which is not limited herein.
For example, the second sample image is a face image of the first sample object, and the first behavior state may be used to describe a facial expression of the first sample object in the second sample image.
Each state key point in the state key point sample image corresponding to the second sample image can be used for representing the specific gesture of the first sample object in the second sample image, and the state key point of the first sample object in the second sample image can be obtained by carrying out key point identification on the second sample image.
The state key point image for determining the first sample object corresponds to the first behavior state can be determined by a pre-trained key point recognition model, and the model structure is not limited herein.
Wherein the initial sample segmentation map corresponding to the first sample object is used to characterize the distribution state of each limb portion (e.g., limb) or each organ (e.g., facial organ) of the first sample object when the first sample object is in the first behavior state.
When determining the initial sample segmentation map corresponding to the first sample object in the first behavior state, the model structure of the initial sample segmentation map can be determined based on the pre-trained segmentation feature extraction model, and the model structure is not limited.
Step S63, inputting the first sample image, the initial sample segmentation map and the state key sample image into an initial model to obtain a predicted image when the first sample object corresponding to the first sample image is in the first behavior state corresponding to the second sample image.
In some possible embodiments, after determining the initial sample segmentation map corresponding to the first sample image and the state-critical sample image corresponding to the second sample image, the first sample image, the initial sample segmentation map, and the state-critical sample image may be output to an initial model, and a predicted image of the first sample object in the first behavior state may be generated by the initial model.
Wherein the initial model is based on a predicted image when the first sample image is in the first behavior state determined by:
determining initial sample segmentation features corresponding to the initial sample segmentation map, and determining target sample segmentation features corresponding to the first sample object in the first behavior state based on the state key sample image and the initial sample segmentation features;
determining sample object features corresponding to the first sample object based on the first sample image;
a predicted image is determined for the first sample object in the first behavioral state based on the target sample segmentation feature and the sample object feature.
The initial model comprises a segmentation map deformation network, a feature extraction network, a style mapping network and an image generation network.
Specifically, the initial model determines the target sample segmentation features through a segmentation map morphing network.
After the initial sample segmentation map and the state key sample image are input into an initial model, the initial sample segmentation map and the state key sample image are firstly input into a segmentation map deformation network, and the following operations are executed through the segmentation map deformation network:
and extracting features of the initial sample segmentation map corresponding to the first sample image to obtain initial sample segmentation features corresponding to the first sample image.
Further, convolution processing is performed on the state key point sample image when the first object is in the first behavior state, so as to obtain a third weight and a third offset corresponding to the first object when the initial sample segmentation feature corresponding to the first sample image is adjusted.
And carrying out convolution processing on the state key point sample image through a third initial convolution network to obtain a third initial convolution characteristic. And further processing the third initial convolution characteristic through different convolution networks (such as a fifth convolution network and a sixth convolution network) respectively to obtain a third weight and a third offset corresponding to the initial sample segmentation characteristic.
The network structures and related parameters of the third initial convolution network, the fifth convolution network and the sixth convolution network, which are adopted in determining the target sample segmentation feature, are not limited.
Further, the initial sample segmentation feature may be processed based on the initial sample segmentation feature and its corresponding third weight and third offset to obtain a corresponding target sample segmentation feature when the first sample object is in the first behavior state.
Namely multiplying the initial sample segmentation feature by a third weight, and adding the result and the third offset to obtain a corresponding target sample segmentation feature when the first object is in the first behavior state.
For the initial sample segmentation feature, the initial sample segmentation feature may be normalized to obtain a normalized initial sample segmentation feature, and then a third weight may be multiplied by the normalized initial sample segmentation feature, and the result may be added to a third offset to obtain a target sample segmentation feature corresponding to the first object in the first behavior state.
That is, specific gesture information can be provided based on the state key point image corresponding to the second sample image, and the initial sample segmentation feature can be processed based on the gesture information to adjust the gesture of the first object in the first sample image, so as to finally obtain the corresponding target sample segmentation feature when the first object is in the first behavior state.
The implementation manner of determining the target sample segmentation feature by the initial model based on the segmentation map deformation network is merely an example, and specifically, refer to the implementation manner of determining the target segmentation feature in step S23 in fig. 2, which is not described herein.
The segmentation map morphing network can be realized based on a SPADE network. For example, if the initial sample segmentation feature and the state key point sample image need to be fused for multiple times, the segmentation map deformation can be realized by a plurality of cascaded SPADE networks, and the front end and the back end of the network comprise a plurality of convolution layers for processing the feature.
Specifically, the initial model determines sample object features of the first sample object through a feature extraction network. After the first sample image is input into the initial model, the feature extraction network is used for carrying out feature extraction on the first sample image, so that the sample object features of the first sample object are obtained.
The feature extraction network may be implemented based on cosface network model with network structure of ResNet50, or based on arcface network model, specifically, may be determined based on actual application scene requirements, and is not limited herein.
Specifically, the initial model determines a predicted image of the first sample object in the first behavior state through the style mapping network and the image generation network.
The initial model obtains a sample fusion feature when the first sample object is in a first behavior state by inputting the target sample segmentation feature and the sample object feature of the first object into a style mapping network.
When determining the sample fusion feature corresponding to the first object, convolution processing can be performed on the sample object feature of the first object to obtain a fourth weight and a fourth offset corresponding to the adjustment of the target sample segmentation feature corresponding to the first object.
The sample object feature can be subjected to convolution processing through a fourth initial convolution network, so that a fourth initial convolution feature is obtained. And further processing the fourth initial convolution characteristic through different convolution networks (such as a seventh convolution network and an eighth convolution network) respectively to obtain a fourth weight and a fourth offset corresponding to the target sample segmentation characteristic.
The network structures and related parameters of the fourth initial convolutional network, the seventh convolutional network and the eighth convolutional network, which are adopted in determining the sample fusion characteristics, are not limited.
Further, the target sample segmentation feature may be processed based on the target sample segmentation feature and the fourth weight and the fourth offset corresponding thereto, to obtain a sample fusion feature corresponding to the first object in the first behavior state.
Namely multiplying the fourth weight by the target sample segmentation feature, and adding the result and the fourth offset to obtain a corresponding sample fusion feature when the first object is in the first behavior state.
For the target sample segmentation feature, the target sample segmentation feature may be normalized to obtain a normalized target sample segmentation feature, further multiplying the normalized target sample segmentation feature by a fourth weight, and adding the result to the fourth offset to obtain a sample fusion feature corresponding to the first object in the first behavior state.
That is, based on the above manner, the target sample segmentation feature corresponding to the first object in the first behavior state may be fused with the sample object feature thereof, so as to obtain the sample object feature capable of simultaneously characterizing the first object and the sample fusion feature of the distribution state and the morphology of each limb portion or each organ when the first object is in the first behavior.
After the target sample segmentation feature is obtained, the segmentation graph deformation network can perform deconvolution processing on the target sample segmentation feature to obtain a target sample segmentation feature graph. In this case, the style mapping network may perform feature extraction on the target sample segmentation feature map, thereby determining sample fusion features based on the extracted features and sample object features of the first object.
When determining the sample fusion feature, the style mapping network may perform multiple fusion on the sample object feature and the target sample segmentation feature, and the specific manner may refer to the implementation manner of multiple fusion on the initial object feature and the target segmentation feature in step S23 in fig. 2, which is not described herein.
The style mapping network can be realized based on a SPADE network. For example, if the target sample segmentation feature and the sample object information need to be fused for multiple times, the style mapping network may be implemented by a plurality of cascaded SPADE networks, and the front end and the back end of the network include a plurality of convolution layers for processing the feature.
By carrying out standardization processing on the initial sample segmentation feature and the target sample segmentation feature, the convergence speed of the initial model based on the gradient descent method or the random gradient descent method can be increased, and the model training efficiency can be improved.
The initial model obtains a corresponding prediction image when the first sample object is in a first behavior state by inputting the sample fusion characteristic into an image generation network.
The image generating network is a pre-trained network model, and can perform feature modification on a standard feature map corresponding to the image generating network based on sample fusion features of the first object, for example, the feature modification process can be realized through a style migration algorithm, and a prediction image corresponding to the first object in a first behavior state is generated based on the modified target sample feature map.
Wherein the style migration algorithm includes, but is not limited to, adaIn algorithm, and the pre-trained network model includes, but is not limited to, a StyleGAN network model.
When the initial model generates a predicted image based on the image generation network, a first mean value and a first variance of a standard feature image corresponding to the image generation network can be determined first, a third mean value and a third variance corresponding to a sample fusion feature when the first object is in a first behavior state are determined, and the standard feature image is processed based on the first mean value, the first variance, the third mean value and the third variance, so that a target sample feature image is obtained.
For example, in the case where the style mapping network is implemented based on a SPADE network and the image generating network is implemented based on a StyleGAN network, assuming that the feature dimension of the sample fusion feature is 1x18x512, the 18-layer hidden layer feature of the StyleGAN may be modified based on the sample fusion feature to achieve the purpose of modifying the standard feature map corresponding to the image generating network, so as to generate the predicted image when the first object is in the first behavior state based on the finally obtained target sample feature map.
As an example, the target sample feature map x″ may be obtained by:
wherein μ (x) and σ (x) are respectively a first mean and a first variance corresponding to the standard feature map, and μ (y ') and σ (y') are respectively a third mean and a third variance corresponding to the sample fusion feature.
The method for training the classification model according to the embodiment of the present application is further described below with reference to fig. 7. Fig. 7 is a schematic structural diagram of an initial model according to an embodiment of the present application. As shown in fig. 7, after the first sample image and the second sample image are determined, an initial sample segmentation map corresponding to the first sample image and a state key point image corresponding to the second sample image may be determined.
After the initial sample segmentation map and the state key sample image are input into an initial model, the initial sample segmentation map and the state key sample image are input into a segmentation map deformation network, and feature extraction is carried out on the initial sample segmentation map corresponding to the first sample image through the segmentation map deformation network, so that initial sample segmentation features corresponding to the first sample image are obtained. Further, the state key sample image and the initial sample segmentation feature are fused when the first object is in the first behavior state, and a target sample segmentation map is obtained.
At the same time, the initial model determines sample object features of the first sample object through the feature extraction network. After the first sample image is input into the initial model, the feature extraction network is used for carrying out feature extraction on the first sample image, so that the sample object features of the first sample object are obtained.
Further, the initial model performs feature extraction on the target sample segmentation map to obtain target sample segmentation features, and the target sample segmentation features and sample object features of the first object are input into a style mapping network to obtain sample fusion features when the first sample object is in a first behavior state.
Further, the initial model modifies a standard feature map corresponding to the image generation network based on a style migration algorithm AdaIn to obtain a target sample feature map, and further processes the target sample feature map based on a style mapping network to obtain a predicted image when the first object is in a first behavior state.
The convolutional network involved in the embodiment of the present application may also be replaced by a residual block (residual block) in the residual network, which is not described herein.
And S64, determining total training loss based on the second sample image and the predicted image, performing iterative training on the initial model based on the training sample set and the total training loss, stopping training until the total training loss meets the training ending condition, and determining the model at the time of stopping training as an image generation model.
Specifically, after determining the total training loss corresponding to training the initial model, iterative training may be performed on the initial model based on the total training loss and the training sample set until training is stopped when the training ending condition is met.
That is, before each training is started, new first sample image and second sample image are determined from the training sample set, and then the initial model is trained based on the first sample image and the second sample image through the training mode.
The training ending condition may be that the total training loss converges, or the total training loss of the continuous preset times is smaller than a preset threshold, or the difference between two adjacent total training losses of the continuous preset times is smaller than a preset threshold, which may be specifically determined based on the actual application scene requirement, and is not limited herein.
For each training, stopping training at the moment when the total training loss meets the training ending condition, and determining an initial model at the end of training as a final image generation model. If the total training loss does not meet the training ending condition in the training process of the initial model, the model parameters of the initial model can be adjusted, the adjusted model is trained again based on the mode, the total training loss is determined, and training is stopped until the total training loss meets the training ending condition.
The image generation model training method provided by the embodiment of the application can be realized through related technologies such as the artificial intelligence field, the cloud computing field, the image processing field and the like. For example, model training can be performed on the basis of machine learning technology in the artificial intelligence field and computer vision technology in the image processing field to obtain a classification model.
The machine learning is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, such as training based on machine learning techniques to obtain a classification model. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.
The computer vision technology generally comprises technologies such as image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and the like, and common biological feature recognition technologies such as face recognition, fingerprint recognition and the like.
In some possible implementations, the object reconstruction loss may be determined based on the second sample image and the predicted image, and the object reconstruction loss is determined as the total training loss. Wherein the object reconstruction loss characterizes a difference between the second sample image and the predicted object.
Optionally, the total training loss is also determined based on at least one first training loss and the object reconstruction loss, e.g. a weight corresponding to each first training loss and a weight corresponding to the object reconstruction loss may be determined, and the sum of the weights of each first training loss and the object reconstruction loss is determined as the total training loss.
Wherein the first training loss comprises at least one of:
determining segmentation feature loss based on the segmentation feature corresponding to the second sample image and the target sample segmentation feature;
determining object feature loss based on the object feature corresponding to the predicted image and the sample object feature corresponding to the first sample image;
an image quality loss determined based on the predicted image and the second sample image.
The segmentation feature loss characterizes feature differences between actual segmentation features corresponding to the second sample image and target sample segmentation features, the object feature loss characterizes feature differences between object features corresponding to the predicted image and sample object features corresponding to the first sample image, and the image quality loss characterizes image quality differences between the predicted image and the second sample image.
The object reconstruction loss and the segmentation feature loss can be calculated based on a minimized absolute error loss function (L1 loss function) or a minimized square error (L2 loss function), and are not limited herein.
The object feature loss is determined based on the feature similarity between the object feature corresponding to the predicted image and the sample object corresponding to the first sample image, for example, the object feature loss is determined based on the cosine similarity between the object feature corresponding to the predicted image and the sample object corresponding to the first sample image, which is not limited herein.
The image quality loss may be determined based on learning perceived image block similarity (Learned Perceptual Image Patch Similarity, LPIPS) or other perceived loss function, without limitation.
The image generation network in the initial model is a pre-training network, so that the image generation model with higher accuracy can be quickly trained under the condition that a large number of training samples are not needed, and further, a target image when a target object is in a target behavior state corresponding to a reference object is generated through the image generation model. In addition, the embodiment of the application can fully fuse the initial sample segmentation feature and the state key point sample image in the training process, and can fully fuse the target sample segmentation feature and the sample object information at the same time, thereby further improving the accuracy of the image generation model.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present application. The image generation device provided by the embodiment of the application comprises:
an image determining module 81, configured to determine a state key point image when the reference object is in the target behavior state;
an image processing module 82, configured to determine an initial image of the target object and a corresponding initial segmentation map thereof;
the image generating module 83 is configured to determine a target image when the target object is in the target behavior state based on the state key point image, the initial image, and the initial segmentation map.
In some possible embodiments, the image generating module 83 is configured to:
determining initial segmentation features corresponding to the initial segmentation map, and determining target segmentation features corresponding to the target object in the target behavior state based on the state key point image and the initial segmentation features;
determining initial object characteristics corresponding to the target object based on the initial image;
and determining a target image when the target object is in the target behavior state based on the target segmentation feature and the initial object feature.
In some possible embodiments, the image generating module 83 is configured to:
Convolving the state key point image to obtain a first weight and a first offset corresponding to the initial segmentation feature;
and determining a target segmentation feature corresponding to the target object in the target behavior state based on the initial segmentation feature, the first weight and the first offset.
In some possible embodiments, the image generating module 83 is configured to:
convolving the initial object feature to obtain a second weight and a second offset corresponding to the target segmentation feature;
determining a fusion feature based on the target segmentation feature, the second weight, and the second offset;
and determining a corresponding target image when the target object is in the target behavior state based on the fusion characteristic.
In some possible embodiments, the image generating module 83 is configured to:
determining a first mean value and a first variance corresponding to the standard feature map, and a second mean value and a second variance corresponding to the fusion feature;
processing the standard feature map based on the first mean, the first variance, the second mean and the second variance to obtain a target feature map;
And determining a corresponding target image when the target object is in the target behavior state based on the target feature map.
In a specific implementation, the image generating apparatus may execute the implementation provided by each step in fig. 2 through each built-in functional module, and specifically, the implementation provided by each step may be referred to, which is not described herein again.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an image generation model training apparatus according to an embodiment of the present application. The image generation model training device provided by the embodiment of the application comprises:
a sample determination module 91 for determining a training sample set comprising sample images of at least one sample object in different behavior states;
a sample processing module 92, configured to determine a first sample image and a second sample image from the training sample set, and determine an initial sample segmentation map corresponding to the first sample image and a state-critical sample image corresponding to the second sample image, where the first sample image and the second sample image correspond to a same sample object;
an image prediction module 93, configured to input the first sample image, the initial sample segmentation map, and the state key sample image into an initial model, and obtain a predicted image when a first sample object corresponding to the first sample image is in a first behavior state corresponding to the second sample image;
The model training module 94 is configured to determine a total training loss based on the second sample image and the predicted image, iteratively train the initial model based on the training sample set and the total training loss until the total training loss meets a training end condition, and determine a model at the time of stopping training as an image generation model.
In some possible embodiments, the image prediction module 93 is configured to:
determining initial sample segmentation features corresponding to the initial sample segmentation map, and determining target sample segmentation features corresponding to the first sample object in the first behavior state based on the state key sample image and the initial sample segmentation features;
determining a sample object feature corresponding to the first sample object based on the first sample image;
and determining a predicted image when the first sample object is in the first behavior state based on the target sample segmentation feature and the sample object feature.
In some possible embodiments, the image prediction module 93 is configured to:
carrying out convolution processing on the state key point sample image to obtain a third weight and a third offset corresponding to the initial sample segmentation feature;
And determining a target sample segmentation feature corresponding to the first sample object in the first behavior state based on the initial sample segmentation feature, the third weight and the third offset.
In some possible embodiments, the image prediction module 93 is configured to:
convolving the sample object feature to obtain a fourth weight and a fourth offset corresponding to the target sample segmentation feature;
determining a sample fusion feature based on the target sample segmentation feature, the fourth weight, and the fourth offset;
and determining a prediction image corresponding to the first sample object in the first behavior state based on the sample fusion characteristic.
In some possible embodiments, the image prediction module 93 is configured to:
determining a first mean value and a first variance corresponding to the standard feature map, and a third mean value and a third variance corresponding to the sample fusion feature;
processing the standard feature map based on the first mean, the first variance, the third mean and the third variance to obtain a target sample feature map;
and determining a prediction image corresponding to the first sample object in the first behavior state based on the target sample characteristic diagram.
In some possible embodiments, the model training module 94 is configured to:
determining an object reconstruction loss based on the second sample image and the predicted image;
determining a total training loss based on a first training loss and the subject reconstruction loss, the first training loss comprising at least one of:
determining a segmentation feature loss based on the segmentation feature corresponding to the second sample image and the target sample segmentation feature;
determining object feature loss based on the object feature corresponding to the predicted image and the sample object feature corresponding to the first sample image;
and determining an image quality loss based on the predicted image and the second sample image.
In a specific implementation, the image generation model training device may execute, through each functional module built in the image generation model training device, an implementation manner provided by each step in fig. 6, and specifically, the implementation manner provided by each step may be referred to, which is not described herein again.
Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 10, the electronic device 1000 in the present embodiment may include: processor 1001, network interface 1004, and memory 1005, and in addition, the electronic device 1000 may further include: an object interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The object interface 1003 may include a Display (Display) and a Keyboard (Keyboard), and the optional object interface 1003 may further include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (NVM), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 10, an operating system, a network communication module, an object interface module, and a device control application may be included in the memory 1005, which is one type of computer-readable storage medium.
In the electronic device 1000 shown in fig. 10, the network interface 1004 may provide a network communication function; while object interface 1003 is primarily an interface for providing input to a user; processor 1001 when used for image generation, processor 1001 may be used to invoke a device control application stored in memory 1005 to implement:
determining a state key point image when the reference object is in a target behavior state;
determining an initial image of a target object and a corresponding initial segmentation map thereof;
and determining a target image when the target object is in the target behavior state based on the state key point image, the initial image and the initial segmentation map.
In some possible embodiments, the processor 1001 is configured to:
determining initial segmentation features corresponding to the initial segmentation map, and determining target segmentation features corresponding to the target object in the target behavior state based on the state key point image and the initial segmentation features;
determining initial object characteristics corresponding to the target object based on the initial image;
and determining a target image when the target object is in the target behavior state based on the target segmentation feature and the initial object feature.
In some possible embodiments, the processor 1001 is configured to:
convolving the state key point image to obtain a first weight and a first offset corresponding to the initial segmentation feature;
and determining a target segmentation feature corresponding to the target object in the target behavior state based on the initial segmentation feature, the first weight and the first offset.
In some possible embodiments, the processor 1001 is configured to:
convolving the initial object feature to obtain a second weight and a second offset corresponding to the target segmentation feature;
determining a fusion feature based on the target segmentation feature, the second weight, and the second offset;
and determining a corresponding target image when the target object is in the target behavior state based on the fusion characteristic.
In some possible embodiments, the processor 1001 is configured to:
determining a first mean value and a first variance corresponding to the standard feature map, and a second mean value and a second variance corresponding to the fusion feature;
processing the standard feature map based on the first mean, the first variance, the second mean and the second variance to obtain a target feature map;
And determining a corresponding target image when the target object is in the target behavior state based on the target feature map.
Optionally, the processor 1001, when used for training of the image generation model, may be configured to invoke a device control application stored in the memory 1005 to implement:
determining a training sample set comprising sample images of at least one sample object in different behavior states;
determining a first sample image and a second sample image from the training sample set, and determining an initial sample segmentation map corresponding to the first sample image and a state key sample image corresponding to the second sample image, wherein the first sample image and the second sample image correspond to the same sample object;
inputting the first sample image, the initial sample segmentation map and the state key sample image into an initial model to obtain a predicted image when a first sample object corresponding to the first sample image is in a first behavior state corresponding to the second sample image;
and determining total training loss based on the second sample image and the predicted image, performing iterative training on the initial model based on the training sample set and the total training loss until the total training loss meets the training ending condition, and determining the model at the time of stopping training as an image generation model.
In some possible embodiments, the processor 1001 is configured to:
determining initial sample segmentation features corresponding to the initial sample segmentation map, and determining target sample segmentation features corresponding to the first sample object in the first behavior state based on the state key sample image and the initial sample segmentation features;
determining a sample object feature corresponding to the first sample object based on the first sample image;
and determining a predicted image when the first sample object is in the first behavior state based on the target sample segmentation feature and the sample object feature.
In some possible embodiments, the processor 1001 is configured to:
carrying out convolution processing on the state key point sample image to obtain a third weight and a third offset corresponding to the initial sample segmentation feature;
and determining a target sample segmentation feature corresponding to the first sample object in the first behavior state based on the initial sample segmentation feature, the third weight and the third offset.
In some possible embodiments, the processor 1001 is configured to:
convolving the sample object feature to obtain a fourth weight and a fourth offset corresponding to the target sample segmentation feature;
Determining a sample fusion feature based on the target sample segmentation feature, the fourth weight, and the fourth offset;
and determining a prediction image corresponding to the first sample object in the first behavior state based on the sample fusion characteristic.
In some possible embodiments, the processor 1001 is configured to:
determining a first mean value and a first variance corresponding to the standard feature map, and a third mean value and a third variance corresponding to the sample fusion feature;
processing the standard feature map based on the first mean, the first variance, the third mean and the third variance to obtain a target sample feature map;
and determining a prediction image corresponding to the first sample object in the first behavior state based on the target sample characteristic diagram.
In some possible embodiments, the processor 1001 is configured to:
determining an object reconstruction loss based on the second sample image and the predicted image;
determining a total training loss based on a first training loss and the subject reconstruction loss, the first training loss comprising at least one of:
determining a segmentation feature loss based on the segmentation feature corresponding to the second sample image and the target sample segmentation feature;
Determining object feature loss based on the object feature corresponding to the predicted image and the sample object feature corresponding to the first sample image;
and determining an image quality loss based on the predicted image and the second sample image.
It should be appreciated that in some possible embodiments, the processor 1001 may be a central processing unit (central processing unit, CPU), which may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
In a specific implementation, the electronic device 1000 may execute, through each functional module built in the electronic device, an implementation manner provided by each step in fig. 2 and/or fig. 6, and specifically, the implementation manner provided by each step may be referred to, which is not described herein again.
The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored and executed by a processor to implement the method provided by each step in fig. 2 and/or fig. 6, and specifically, the implementation manner provided by each step may be referred to, which is not described herein.
The computer readable storage medium may be an apparatus provided in any one of the foregoing embodiments or an internal storage unit of an electronic device, for example, a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the electronic device. The computer readable storage medium may also include a magnetic disk, an optical disk, a read-only memory (ROM), a random access memory (randomaccess memory, RAM), or the like. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used to store the computer program and other programs and data required by the electronic device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Embodiments of the present application provide a computer program product comprising a computer program for executing the method provided by the steps of fig. 2 and/or fig. 6 by a processor.
The terms first, second and the like in the claims and in the description and drawings are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or electronic device that comprises a list of steps or elements is not limited to the list of steps or elements but may, alternatively, include other steps or elements not listed or inherent to such process, method, article, or electronic device. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments. The term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.
Claims (16)
1. An image generation method, the method comprising:
determining a state key point image when the reference object is in a target behavior state;
determining an initial image of a target object and a corresponding initial segmentation map thereof;
and determining a target image when the target object is in the target behavior state based on the state key point image, the initial image and the initial segmentation map.
2. The method of claim 1, wherein the determining a target image of the target object in the target behavioral state based on the state keypoint image, the initial image, and the initial segmentation map comprises:
determining initial segmentation features corresponding to the initial segmentation map, and determining target segmentation features corresponding to the target object in the target behavior state based on the state key point image and the initial segmentation features;
determining initial object characteristics corresponding to the target object based on the initial image;
and determining a target image when the target object is in the target behavior state based on the target segmentation feature and the initial object feature.
3. The method of claim 2, wherein the determining, based on the state keypoint image and the initial segmentation feature, a corresponding target segmentation feature for the target object in the target behavioral state comprises:
performing convolution processing on the state key point image to obtain a first weight and a first offset corresponding to the initial segmentation feature;
and determining a corresponding target segmentation feature when the target object is in the target behavior state based on the initial segmentation feature, the first weight and the first offset.
4. The method of claim 2, wherein the determining a target image of the target object in the target behavioral state based on the target segmentation feature and the initial object feature comprises:
performing convolution processing on the initial object feature to obtain a second weight and a second offset corresponding to the target segmentation feature;
determining a fusion feature based on the target segmentation feature, the second weight, and the second offset;
and determining a corresponding target image when the target object is in the target behavior state based on the fusion characteristic.
5. The method of claim 4, wherein determining a corresponding target image of the target object in the target behavioral state based on the fusion feature comprises:
determining a first mean and a first variance corresponding to the standard feature map and a second mean and a second variance corresponding to the fusion feature;
processing the standard feature map based on the first mean, the first variance, the second mean and the second variance to obtain a target feature map;
and determining a corresponding target image when the target object is in the target behavior state based on the target feature map.
6. A method of training an image generation model, the method comprising:
determining a training sample set comprising sample images of at least one sample object in different behavior states;
determining a first sample image and a second sample image from the training sample set, and determining an initial sample segmentation map corresponding to the first sample image and a state key sample image corresponding to the second sample image, wherein the first sample image and the second sample image correspond to the same sample object;
inputting the first sample image, the initial sample segmentation map and the state key sample image into an initial model to obtain a predicted image when a first sample object corresponding to the first sample image is in a first behavior state corresponding to the second sample image;
and determining total training loss based on the second sample image and the predicted image, performing iterative training on the initial model based on the training sample set and the total training loss, stopping training until the total training loss meets the training ending condition, and determining the model at the time of stopping training as an image generation model.
7. The method of claim 6, wherein the predictive image is determined from the initial model by:
determining initial sample segmentation features corresponding to the initial sample segmentation map, and determining target sample segmentation features corresponding to the first sample object in the first behavior state based on the state key sample application sample image and the initial sample segmentation features;
determining sample object features corresponding to the first sample object based on the first sample image;
a predicted image is determined when the first sample object is in the first behavioral state based on the target sample segmentation feature and the sample object feature.
8. The method of claim 7, wherein the determining a corresponding target sample segmentation feature when the first sample object is in the first behavioral state based on the state-critical sample image and the initial sample segmentation feature comprises:
performing convolution processing on the state key point sample image to obtain a third weight and a third offset corresponding to the initial sample segmentation feature;
based on the initial sample segmentation feature, the third weight, and the third offset, a target sample segmentation feature corresponding to the first sample object when in the first behavior state is determined.
9. The method of claim 7, wherein the determining a predicted image of the first sample object in the first behavior state based on the target sample segmentation feature and the sample object feature comprises:
carrying out convolution processing on the sample object characteristics to obtain a fourth weight and a fourth offset corresponding to the target sample segmentation characteristics;
determining a sample fusion feature based on the target sample segmentation feature, the fourth weight, and the fourth offset;
based on the sample fusion feature, a predicted image corresponding to the first sample object in the first behavior state is determined.
10. The method of claim 9, wherein the determining, based on the sample fusion feature, a corresponding predicted image when the first sample object is in the first behavior state comprises:
determining a first mean and a first variance corresponding to the standard feature map and a third mean and a third variance corresponding to the sample fusion feature;
processing the standard feature map based on the first mean, the first variance, the third mean and the third variance to obtain a target sample feature map;
And determining a corresponding predicted image when the first sample object is in the first behavior state based on the target sample characteristic diagram.
11. The method of claim 7, wherein the determining a total training loss based on the second sample image and the predicted image comprises:
determining an object reconstruction loss based on the second sample image and the prediction image;
determining a total training loss based on a first training loss and the subject reconstruction loss, the first training loss comprising at least one of:
determining segmentation feature loss based on the segmentation feature corresponding to the second sample image and the target sample segmentation feature;
object feature loss determined based on object features corresponding to the predicted image and sample object features corresponding to the first sample image;
an image quality loss determined based on the predicted image and the second sample image.
12. An image generation apparatus, the apparatus comprising:
the image determining module is used for determining a state key point image when the reference object is in the target behavior state;
the image processing module is used for determining an initial image of the target object and a corresponding initial segmentation map thereof;
And the image generation module is used for determining a target image when the target object is in the target behavior state based on the state key point image, the initial image and the initial segmentation map.
13. An image generation model training apparatus, the apparatus comprising:
a sample determination module for determining a training sample set comprising sample images of at least one sample object in different behavior states;
the sample processing module is used for determining a first sample image and a second sample image from the training sample set, determining an initial sample segmentation map corresponding to the first sample image and a state key sample image corresponding to the second sample image, wherein the first sample image and the second sample image correspond to the same sample object;
the image prediction module is used for inputting the first sample image, the initial sample segmentation map and the state key sample image into an initial model to obtain a predicted image when a first sample object corresponding to the first sample image is in a first behavior state corresponding to the second sample image;
and the model training module is used for determining total training loss based on the second sample image and the predicted image, carrying out iterative training on the initial model based on the training sample set and the total training loss until the total training loss accords with the training ending condition, stopping training, and determining the model at the time of stopping training as an image generation model.
14. An electronic device comprising a processor and a memory, the processor and the memory being interconnected;
the memory is used for storing a computer program;
the processor is configured to perform the method of any of claims 1 to 11 when the computer program is invoked.
15. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1 to 11.
16. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the method of any one of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210295206.6A CN116883523A (en) | 2022-03-23 | 2022-03-23 | Image generation method, image generation model training method and related devices |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210295206.6A CN116883523A (en) | 2022-03-23 | 2022-03-23 | Image generation method, image generation model training method and related devices |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116883523A true CN116883523A (en) | 2023-10-13 |
Family
ID=88260989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210295206.6A Pending CN116883523A (en) | 2022-03-23 | 2022-03-23 | Image generation method, image generation model training method and related devices |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116883523A (en) |
-
2022
- 2022-03-23 CN CN202210295206.6A patent/CN116883523A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111401216B (en) | Image processing method, model training method, image processing device, model training device, computer equipment and storage medium | |
CN112418292B (en) | Image quality evaluation method, device, computer equipment and storage medium | |
CN113947764B (en) | Image processing method, device, equipment and storage medium | |
CN113761153A (en) | Question and answer processing method and device based on picture, readable medium and electronic equipment | |
CN111833360A (en) | Image processing method, device, equipment and computer readable storage medium | |
CN116958323A (en) | Image generation method, device, electronic equipment, storage medium and program product | |
CN114936377A (en) | Model training and identity anonymization method, device, equipment and storage medium | |
CN117078790B (en) | Image generation method, device, computer equipment and storage medium | |
CN118115622B (en) | Image generation model processing method, device, equipment, storage medium and product | |
CN118096924B (en) | Image processing method, device, equipment and storage medium | |
JP2023545052A (en) | Image processing model training method and device, image processing method and device, electronic equipment, and computer program | |
CN117252947A (en) | Image processing method, image processing apparatus, computer, storage medium, and program product | |
CN116994021A (en) | Image detection method, device, computer readable medium and electronic equipment | |
CN114373224B (en) | Fuzzy 3D skeleton action recognition method and device based on self-supervision learning | |
CN113657272B (en) | Micro video classification method and system based on missing data completion | |
CN111461091B (en) | Universal fingerprint generation method and device, storage medium and electronic device | |
JP7479507B2 (en) | Image processing method and device, computer device, and computer program | |
CN116883523A (en) | Image generation method, image generation model training method and related devices | |
CN112463936B (en) | Visual question-answering method and system based on three-dimensional information | |
CN115708135A (en) | Face recognition model processing method, face recognition method and device | |
CN113822291A (en) | Image processing method, device, equipment and storage medium | |
CN118155214B (en) | Prompt learning method, image classification method and related devices | |
CN116645700B (en) | Feature extraction model processing method and device and feature extraction method and device | |
WO2023207391A1 (en) | Virtual human video generation method, and apparatus | |
CN118015142A (en) | Face image processing method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |