CN111783497A - Method, device and computer-readable storage medium for determining characteristics of target in video - Google Patents
Method, device and computer-readable storage medium for determining characteristics of target in video Download PDFInfo
- Publication number
- CN111783497A CN111783497A CN201910265480.7A CN201910265480A CN111783497A CN 111783497 A CN111783497 A CN 111783497A CN 201910265480 A CN201910265480 A CN 201910265480A CN 111783497 A CN111783497 A CN 111783497A
- Authority
- CN
- China
- Prior art keywords
- target
- determining
- feature
- information
- processed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000010801 machine learning Methods 0.000 claims description 45
- 230000004927 fusion Effects 0.000 claims description 39
- 238000010586 diagram Methods 0.000 claims description 38
- 230000011218 segmentation Effects 0.000 claims description 34
- 238000007499 fusion processing Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 11
- 239000002131 composite material Substances 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 4
- 230000017105 transposition Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 description 11
- 238000012549 training Methods 0.000 description 10
- 238000000605 extraction Methods 0.000 description 5
- 230000000875 corresponding effect Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000005065 mining Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/103—Static body considered as a whole, e.g. static pedestrian or occupant recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0631—Item recommendations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0641—Shopping interfaces
- G06Q30/0643—Graphical representation of items or shoppers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Economics (AREA)
- Development Economics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The disclosure relates to a method and a device for determining characteristics of a target in a video and a computer-readable storage medium, and relates to the technical field of computers. The method comprises the following steps: determining target characteristics of a target to be processed in each frame image according to image characteristics extracted from each frame image of the video; determining difference features between the target features; and fusing the target characteristics according to the difference characteristics to determine the comprehensive characteristics of the target to be processed. The technical scheme of the disclosure can improve the accuracy of target feature determination.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for determining characteristics of an object in a video, and a computer-readable storage medium.
Background
The shape, posture and other target features of the target in the image can be recognized by utilizing an image processing technology. For example, in the e-commerce field, the shape and posture of the human body of the user can be acquired by using an image processing technology, so that functions such as virtual fitting, clothing recommendation and the like can be realized.
In the related art, a user in an online scene cannot directly acquire three-dimensional point cloud information of the user, and image processing needs to be performed on a single picture or a frame image in a video uploaded by the user, so that the shape and the posture of the user are estimated.
Disclosure of Invention
The inventors of the present disclosure found that the following problems exist in the above-described related art: the single picture or frame image is processed separately, so that the unified target characteristic cannot be determined based on the continuity of each frame image, and the accuracy of determining the target characteristic is poor.
In view of this, the present disclosure provides a technical solution for determining a feature of a target in a video, which can improve accuracy of determining the feature of the target.
According to some embodiments of the present disclosure, there is provided a method for determining a feature of an object in a video, including: determining target characteristics of a target to be processed in each frame image according to image characteristics extracted from each frame image of a video; determining difference features between the target features; and fusing the target characteristics according to the difference characteristics to determine the comprehensive characteristics of the target to be processed.
In some embodiments, the determining the target feature of the target to be processed in each frame image includes: extracting a background segmentation map of each frame image, a target segmentation map of the target to be processed and a thermodynamic map (Heatmap) of key points of the target to be processed as the image features; and determining the target feature of each frame image according to the target segmentation map, the background segmentation map and the thermodynamic map.
In some embodiments, the determining the composite characteristic of the object to be processed comprises: compensating each target characteristic according to the difference characteristic; and fusing the compensated target characteristics into the comprehensive target characteristics of the target.
In some embodiments, determining the difference feature between the target features comprises: the target feature comprises an attitude parameter and a projection parameter of the target to be processed, and the projection parameter comprises a rotation parameter of the target to be processed relative to a camera, a translation parameter of the target to be processed projected to an image plane and a zooming parameter; determining the difference features using a first machine learning model based on the thermodynamic diagram, the pose parameters, and the projection parameters of the respective frame images.
In some embodiments, the first machine learning model includes a stitching module to stitch the thermodynamic diagram, the pose parameters, and the projection parameters as first input information, a first convolution module including a plurality of convolution layers to determine first convolution information from the first input information, and a first connection module including a plurality of full connection layers to determine the difference features from the first convolution information.
In some embodiments, the first machine learning model is trained according to a loss function, the loss function is constructed according to a characteristic error parameter and a key point distance parameter, the characteristic error parameter is determined according to the compensated error between each target feature and the true value of each target feature, and the key point distance parameter is determined according to the position change of each key point in two adjacent frames of images.
In some embodiments, the target features include shape parameters, pose parameters, and projection parameters of the target to be processed, and the compensated target features include compensated pose parameters and projection parameters.
In some embodiments, the determining the composite characteristic of the object to be processed comprises: and taking the shape parameters, the compensated posture parameters and the projection parameters in the images of the frames as second input information, and utilizing a second machine learning model to fuse the shape parameters in the images of the frames into a unified shape parameter as the comprehensive characteristic.
In some embodiments, the determining the composite characteristic of the object to be processed comprises: and performing fusion processing for K times by using the second machine learning model, and fusing the features of the target in each frame of image into one comprehensive feature, wherein K is a positive integer greater than 1, the number of fusion results of the fusion processing for the (K + 1) th time is less than that of the fusion results obtained by the fusion processing for the kth time, and K is a positive integer less than or equal to K.
In some embodiments, the second machine learning model includes a second convolution module, a third convolution module, a transpose module, a second connection module, and a fusion module, the second convolution module includes a plurality of convolution layers for determining second convolution information according to the second input information, the transpose module transposes the second convolution information into a format required by the second connection module, the second connection module includes a plurality of fully-connected layers for determining feature information according to the second convolution information, the third convolution module is configured to determine third convolution information according to the second input information, the third convolution information and the feature information have the same dimension, and the fusion module is configured to fuse the third convolution information and the feature information into the fusion information.
In some embodiments, the target segmentation map and the background segmentation map are determined by a first decoder model according to coding information of the frame images, the coding information being determined by a first encoder model; the thermodynamic diagram is determined with a second decoder model from the encoded information; the target feature is determined by a second encoder.
In some embodiments, the feature determination method further comprises: and generating a three-dimensional model of the target to be processed according to the comprehensive characteristics of the target to be processed.
According to other embodiments of the present disclosure, there is provided an apparatus for determining characteristics of an object in a video, including: the characteristic determining unit is used for determining the target characteristic of the target to be processed in each frame image according to the image characteristic extracted from each frame image of the video; a difference determining unit for determining a difference feature between the target features; and the feature fusion unit is used for fusing the target features according to the difference features to determine the comprehensive features of the target to be processed.
According to still further embodiments of the present disclosure, there is provided a feature determination apparatus of an object, including: a memory; and a processor coupled to the memory, the processor configured to perform a method of determining a characteristic of a target in any of the above embodiments based on instructions stored in the memory device.
According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the feature determination method of the object in any of the above embodiments.
In the above-described embodiment, a plurality of target features are fused into one unified integrated feature according to the difference features between the target features extracted from the respective frame images. Therefore, the accuracy of determining the target characteristics can be improved by mining the time continuity of each frame of image.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
fig. 1 illustrates a flow diagram of some embodiments of a method of feature determination of an object in a video of the present disclosure;
FIG. 2 illustrates a flow diagram for some embodiments of step 110 in FIG. 1;
FIG. 3 illustrates a schematic diagram of some embodiments of a method of feature determination of an object in a video of the present disclosure;
FIG. 4 illustrates a flow diagram of some embodiments of step 130 in FIG. 1;
FIG. 5 illustrates a schematic diagram of some embodiments of a first machine learning model of the present disclosure;
FIG. 6 illustrates a schematic diagram of some embodiments of a second machine learning model of the present disclosure;
FIG. 7 illustrates a block diagram of some embodiments of an apparatus for feature determination of objects in a video of the present disclosure;
FIG. 8 shows a block diagram of further embodiments of an apparatus for feature determination of objects in a video of the present disclosure;
fig. 9 illustrates a block diagram of still further embodiments of an apparatus for feature determination of an object in a video of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Fig. 1 illustrates a flow diagram of some embodiments of a method of feature determination of an object in a video of the present disclosure.
As shown in fig. 1, the method includes: step 110, determining the target characteristics of each frame of image; step 120, determining the difference characteristics of the target characteristics; and step 130, determining the comprehensive characteristics.
In step 110, the target feature of the target to be processed in each frame image is determined according to the image features extracted from each frame image of the video. That is, a single frame of image processing is performed on the video to obtain the target feature of the target to be processed in each frame of image.
In some embodiments, the user uploads a video including his/her image, that is, the target to be processed in each frame of image of the video is a human body, and the human body in each frame of image of the video is used as the target to be processed, so as to determine the target characteristics of the human body.
In some embodiments, a three-dimensional model of the human body, such as a SMPL (skeletal Multi-person Linear) model with a skeletal skin, may be used to characterize the human body. The SMPL model represents the pose information θ in terms of joint angles, each joint having 3 degrees of freedom. For example, pose information for a human model may be represented with 23 joints for 69 degrees of freedom. The SMPL model represents the shape information β in the top 10-dimensional parameters after the principal component decomposition (i.e., the degree of freedom of the shape information may be 10). For example, the target feature F of the human body may include θ and β.
In some embodiments, the target feature may further comprise acquiring user-influenced camera parameters, i.e. projection parameters of a human body. For example, a camera model of the weak perspective projection may be used to determine the projection parameters R, t and s, which may determine F ═ β, θ, R, t, s. R is a rotation parameter of the human body relative to the camera, t is a translation parameter of the human body projected to the image plane, and s is a scaling parameter of the human body projected to the image plane.
For example, the result of the mapping of a three-dimensional human body model onto a two-dimensional image plane can be calculated using the formula s × ii (R × M (β, θ)) + t, M (β, θ) indicating the generation of the SMPL model, and ii being an orthogonal projection matrix of size 2 × 3.
In some embodiments, step 110 may be performed by the embodiment in fig. 2.
FIG. 2 illustrates a flow diagram for some embodiments of step 110 in FIG. 1.
As shown in fig. 2, step 110 includes: step 1110, determining a segmentation map and a thermodynamic map; and step 1120, determining a target feature.
In step 1110, the background segmentation map of each frame image, the target segmentation map of the target to be processed, and the thermodynamic maps of the key points of the target to be processed are extracted as image features. For example, the background segmentation map and the target segmentation map may be acquired by image processing methods such as edge detection and threshold segmentation.
In some embodiments, the key points may be key joints of a human body, and the gray value of each pixel point in the thermodynamic diagram represents the probability that each pixel point is a certain key joint. For example, 14 of the 23 joints of the human SMPL model may be determined as key joints by a machine learning method. In this case, 14 thermodynamic diagrams can be obtained through a machine learning method, wherein each thermodynamic diagram shows the probability that each pixel in the thermodynamic diagram is a certain key joint, and the higher the gray value is, the higher the probability is.
In step 1120, the target feature of each frame image is determined according to the target segmentation map, the background segmentation map and the thermodynamic map.
In some embodiments, the target feature of each frame image may be determined by the embodiment in fig. 3.
Fig. 3 illustrates a schematic diagram of some embodiments of a method of feature determination of an object in a video of the present disclosure.
As shown in FIG. 3, the frame images I-N may be input into a feature extraction model to determine corresponding target features 1-N, where N is a positive integer greater than 1 and N is a positive integer less than N. For example, a MSE (Mean Square Error) loss function supervised feature extraction model may be employed.
In some embodiments, the feature extraction model may include a first encoder, a first decoder, and a second decoder as two-dimensional processing modules. For example, the first encoder may be utilized to determine encoding information for each frame of image; acquiring a target segmentation image and a background segmentation image of each frame of image by using a first decoder according to the coding information; and determining a thermodynamic diagram of each frame image by using the second decoder model according to the coding information.
In some embodiments, the feature extraction model may further include a second encoder as a three-dimensional processing module. For example, the second encoder may be generated according to a VGG (Visual Geometry Group) network model. The target segmentation map, the background segmentation map and the thermodynamic map can be combined into a tensor of batch size × height × width × 16 as the input of the VGG network model, the batch size is the size of the batch of the neural network, the width and height respectively represent the width and height of each frame of picture, and 16 is the sum of the number of 14 thermodynamic maps, the target segmentation map and the background segmentation map. The last fully-connected layer of the VGG model can be set to an 85-dimensional output to accommodate the dimensions of F (β 10, θ 69, R2, t2, s 2).
After the target features of each frame of image are determined by performing single-frame image processing on the video, the region steps in fig. 1 may be used to perform joint processing on each frame of image in the video to determine the comprehensive features.
In step 120, a difference feature between the target features is determined. For example, a convolutional neural network (e.g., the first machine learning model in fig. 3) may be used to determine the difference features from the thermodynamic diagram, pose parameters, and projection parameters of each frame of image.
In some embodiments, the first machine learning model includes a stitching module to stitch the thermodynamic diagram, the pose parameter, and the projection parameter as first input information, a first convolution module including a plurality of convolution layers to determine first convolution information from the first input information, and a first connection module including a plurality of fully-connected layers to determine the difference feature from the first convolution information.
In step 130, the target features are fused according to the difference features, and the comprehensive features of the target to be processed are determined. Step 130 may be performed, for example, by the embodiment in fig. 4.
Fig. 4 illustrates a flow diagram of some embodiments of step 130 in fig. 1.
As shown in fig. 4, step 130 includes: step 1310, compensating the target characteristics of each frame of image; and step 1320, fusing the compensated target features.
In step 1310, each target feature is compensated for a difference feature. In some embodiments, the difference features may be determined and compensated for by the first machine learning model in fig. 5.
Fig. 5 illustrates a schematic diagram of some embodiments of a first machine learning model of the present disclosure.
As shown in fig. 5, the Concat layer of the first machine learning model concatenates the input thermodynamic diagram, pose parameters, and projection parameters. For example, an N × 75-dimensional tensor Params is formed by the attitude parameters and the projection parameters, an N × 28-dimensional (each pixel in 14-dimensional thermodynamic diagram is represented by horizontal and vertical 2-dimensional coordinates) tensor Kps is generated according to the thermodynamic diagram, and the Input of the first machine learning model is a concatenation result Input (1 × 1 × N × 103-dimensional tensor) of the Params and the Kps.
Input is passed through 4 convolutional layers Cov in sequence (convolutional cores are 3 × 1, 1 × 1, respectively); the convolution result is passed through 3 full connection layers (FC 1024, FC75) in sequence, and an N × 75 tensor is obtained as a difference feature. Adding the difference features of Nx 75 dimensions and the input tensor Params of Nx 75 dimensions to obtain compensated target features (including attitude parameters and projection parameters) Output of Nx 75 dimensions, so as to realize compensation of each target feature.
In the above embodiment, the difference information between the frame images can be learned by performing time-series fusion (i.e., by residual module fusion) on the target features of the frame images through the first machine learning model. This allows for improved accuracy by compensating for differences with continuity between frame images to determine uniform target features.
In some embodiments, a loss function may be constructed from the characteristic error parameters and the keypoint distance parameters for training the first machine learning model. Determining characteristic error parameters according to the compensated error between each target characteristic and the true value of each target characteristic; the distance parameter of the key points can be determined according to the position change of each key point in the two adjacent frames of images.
For example, the loss function may be defined as follows:
Lall=wpLp+wcLc;
Lpas characteristic error parameter, WpIs LpThe weight of (c).βgtIs the true value of the shape information of the object,respectively representing the estimated values of the attitude information, the rotation parameter, the translation parameter and the scaling parameter obtained by the method. Fgt={βgt,θgt,Rgt,tgt,sgt},θgt、Rgt、tgt、sgtRepresenting true values of pose information, rotation parameters, translation parameters, and scaling parameters, respectively.
LcIs a key point distance parameter, wcIs LcThe weight of (c).Coordinates in three-dimensional space of a key point representing a target corresponding to the nth frame, W3dIs composed ofThe weight of (c).The coordinates of the key points of the target corresponding to the nth frame projected on the two-dimensional camera plane are represented by W2dThe weight of (c).
After feature compensation, the composite feature can be obtained through step 1320 in fig. 4.
In step 1320, the compensated target features are fused into a composite target feature of the target. The target features comprise shape parameters, attitude parameters and projection parameters of the target to be processed, and the compensated target features comprise the compensated attitude parameters and projection parameters.
In some embodiments, the shape parameters, the compensated pose parameters, and the projection parameters in each frame image may be used as the second input information, and the shape parameters in each frame image may be fused into a unified shape parameter as the integrated feature by using the second machine learning model in fig. 3.
In some embodiments, K times of fusion processing may be performed by using the second machine learning model, and features of the target in each frame image are fused into one comprehensive feature, where K is a positive integer greater than 1. The number of the fusion results of the K +1 th fusion processing is less than that of the fusion results obtained by the K-th fusion processing, and K is a positive integer less than or equal to K.
For example, the second machine learning model may be configured by the embodiment in fig. 6.
Fig. 6 illustrates a schematic diagram of some embodiments of a second machine learning model of the present disclosure.
As shown in fig. 6, the input of the second machine learning model is second input information of N × 85 dimensions. The second input information includes tensor Output of N × 75 dimensions Output by the first machine learning model and shape parameters of each frame image of N × 10 dimensions Output by the feature extraction model in fig. 3.
The second input information is passed through a second convolution module to determine second convolution information. The second convolution module may include a plurality of convolution layers Cov (e.g., a 1 × 1, 3 × 1, 1 × 1 convolution kernel, respectively, of the convolution layer base layer).
The second convolution information is transposed into a format required by the second connection module (e.g., FC layers having output tensors with dimensions of 512, and 85, respectively) using the transposition module. For example, the size and shape of the second convolution information may be changed by the transpose module through the View function without changing the data in the second convolution information.
In some embodiments, the second machine learning model fuses every M second input information into one fused result in each iteration process, where M is a positive integer greater than 1. For example, the transposing module shown in fig. 6 transposes the first dimension of the second convolution information from N to N/3, i.e., the transposing module outputs a tensor with dimensions (N/3) × (64 × 85).
Determining feature information from the tensor with dimensions (N/3) × (64 × 85) using a second connection module; determining third convolution information with a third convolution module (e.g., a Cov layer that may include a 3 x 1 convolution kernel) based on the second input information; and combining the characteristic information and the third convolution information to obtain a fusion result of the fusion processing. For example, the fusion result is N/3 tensors of 85 dimensions. That is, the second machine learning model fuses every 3 pieces of the second input information into 1 piece of the fusion result every time the fusion process is performed.
In the case where the fusion result is a 1-dimensional tensor, the fusion result is used as the integrated featureIf the fusion result is a plurality of multidimensional tensors, the fusion result is used as second input information, and the second machine learning model is used to continue the fusion through iterative processing until the fusion result is a plurality of multidimensional tensors.
Shape parameters obtained when the fusion result is a tensor of 1 multi-dimensionIs a fusion result that fully considers all information of the whole video. That is to say thatTarget features of each frame of image in the video are fused into a comprehensive feature, so that the shape information of the target can be more accurately reflected
In some embodiments, a three-dimensional model of the object to be processed is generated based on the integrated features of the object. For example, the comprehensive characteristics are the shape information of the fused human body, and a human body three-dimensional model can be generated according to the shape information for virtual fitting, clothing recommendation and other functions.
In some embodiments, the training of the machine learning models involved in the present method may be divided into several different phases.
In the first stage, a machine learning model (training data is sufficient) that processes two-dimensional information, such as the first encoder model, the first decoder model, and the second decoder model, may be trained.
In the second stage, a second encoder for processing three-dimensional information may be trained. For example, computer graphics methods may be employed to generate training data in the second stage. An SMPL model may first be given and then projected onto the camera plane by setting the relevant camera parameters to generate relevant training data.
In the third stage, the first machine learning model and the second machine learning model may be trained. For example, computer graphics methods may be used to generate the training data. Continuous SMPL parameter data may be collected, matched, or downloaded from an open relational database by motion.
For example, first, a motion sequence may be projected into a target segmentation map, a background segmentation map, and a thermodynamic map by setting relevant camera parameters based on these parameter data; then, corresponding processing results and real values can be obtained through the forward calculation of the second encoder model; finally, the first machine learning model and the second machine learning model are trained using the correlated training data. Therefore, a computer graphics method can be used for generating a sufficient amount of training data, and the problem of insufficient training data is overcome.
In the above-described embodiment, a plurality of target features are fused into one unified integrated feature according to the difference features between the target features extracted from the respective frame images. Therefore, the accuracy of determining the target characteristics can be improved by mining the time continuity of each frame of image.
Fig. 7 illustrates a block diagram of some embodiments of an apparatus for feature determination of an object in a video of the present disclosure.
As shown in fig. 7, the feature determination device 7 includes a feature determination unit 71, a difference determination unit 72, and a feature fusion unit 73.
The feature determination unit 71 determines a target feature of an object to be processed in each frame image based on an image feature extracted from each frame image of the video. The difference determination unit 72 determines a difference feature between the target features. The feature fusion unit 73 fuses the features of the targets according to the difference features to determine the comprehensive features of the targets to be processed.
In some embodiments, the feature determination unit 71 extracts, as the image features, the background segmentation map of the frame image, the target segmentation map of the target to be processed, and the thermodynamic maps of the key points of the target to be processed. The feature determination unit 71 determines a target feature of each frame image based on the target segmentation map, the background segmentation map, and the thermodynamic map.
In some embodiments, the feature fusion unit 73 compensates each target feature according to the difference feature. The feature fusion unit 73 fuses the compensated target features into a comprehensive target feature of the target.
In some embodiments, the target feature comprises pose parameters and projection parameters of the target to be processed, and the projection parameters comprise rotation parameters of the target to be processed relative to the camera, translation parameters of the target to be processed projected to the image plane, and zoom parameters. The feature fusion unit 73 determines a difference feature using the first machine learning model based on the thermodynamic diagram, the orientation parameter, and the projection parameter of each frame image.
For example, the first machine learning model includes a concatenation module, a first convolution module, and a first connection module. The splicing module is used for splicing the thermodynamic diagram, the attitude parameters and the projection parameters into first input information. The first convolution module includes a plurality of convolution layers for determining first convolution information from the first input information. The first connection module includes a plurality of fully connected layers for determining a difference characteristic from the first convolution information.
In some embodiments, a loss function may be constructed from the feature error parameters and the keypoint distance parameters for training the first machine learning model. And determining the characteristic error parameters according to the compensated error between each target characteristic and the true value of each target characteristic. And the key point distance parameter is determined according to the position change of each key point in the two adjacent frames of images.
In some embodiments, the target features include shape parameters, pose parameters, and projection parameters of the target to be processed, and the compensated target features include compensated pose parameters and projection parameters. The feature fusion unit 73 uses the shape parameters, the compensated pose parameters, and the projection parameters in each frame image as second input information, and fuses the shape parameters in each frame image into one unified shape parameter as a comprehensive feature using a second machine learning model.
For example, the feature fusion unit 73 fuses the features of the target in each frame image into one integrated feature by K times of fusion processing of the second machine learning model line, where K is a positive integer greater than 1. The number of the fusion results of the K +1 th fusion processing is less than that of the fusion results obtained by the K-th fusion processing, and K is a positive integer less than or equal to K.
For example, the second machine learning model includes a second convolution module, a third convolution module, a transpose module, a second connection module, and a fusion module. The second convolution module includes a plurality of convolution layers for determining second convolution information based on the second input information. And the transposition module transposes the second convolution information into a format required by the second connection module. The second connection module includes a plurality of fully connected layers for determining the characteristic information from the second convolution information. And the third convolution module is used for determining third convolution information according to the second input information, and the third convolution information and the feature information have the same dimensionality. And the fusion module is used for fusing the third convolution information and the characteristic information into fusion information.
In some embodiments, the target segmentation map and the background segmentation map are determined using a first decoder model based on encoding information of each frame image, the encoding information being determined using a first encoder model; a thermodynamic diagram is determined using the second decoder model based on the encoded information; the target feature is determined by the second encoder.
In some embodiments, a three-dimensional model of the object to be processed is generated based on the integrated features of the object to be processed.
In the above-described embodiment, a plurality of target features are fused into one unified integrated feature according to the difference features between the target features extracted from the respective frame images. Therefore, the accuracy of determining the target characteristics can be improved by mining the time continuity of each frame of image.
Fig. 8 illustrates a block diagram of further embodiments of an apparatus for feature determination of an object in a video of the present disclosure.
As shown in fig. 8, the feature determination device 8 of this embodiment includes: a memory 81 and a processor 82 coupled to the memory 81, the processor 82 being configured to perform a feature determination method in any of the embodiments of the present disclosure based on instructions stored in the memory 81.
The memory 81 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.
Fig. 9 illustrates a block diagram of still further embodiments of an apparatus for feature determination of an object in a video of the present disclosure.
As shown in fig. 9, the feature determination device 9 of this embodiment includes: a memory 910 and a processor 920 coupled to the memory 910, the processor 920 being configured to perform a feature determination method in any of the embodiments described above based on instructions stored in the memory 910.
The memory 910 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.
The feature determination apparatus 9 may further include an input-output interface 930, a network interface 940, a storage interface 950, and the like. These interfaces 930, 940, 950 and the memory 910 and the processor 920 may be connected, for example, by a bus 960. The input/output interface 930 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices. The storage interface 950 provides a connection interface for external storage devices such as an SD card and a usb disk.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
So far, the feature determination method of the object in the video, the feature determination apparatus of the object in the video, and the computer-readable storage medium according to the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
The method and system of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.
Claims (14)
1. A method for determining characteristics of an object in a video comprises the following steps:
determining target characteristics of a target to be processed in each frame image according to image characteristics extracted from each frame image of a video;
determining difference features between the target features;
and fusing the target characteristics according to the difference characteristics to determine the comprehensive characteristics of the target to be processed.
2. The feature determination method according to claim 1, wherein the determining of the target feature of the target to be processed in each frame image comprises:
extracting a background segmentation graph of each frame of image, a target segmentation graph of the target to be processed and a thermodynamic diagram of key points of the target to be processed as the image features;
and determining the target feature of each frame image according to the target segmentation map, the background segmentation map and the thermodynamic map.
3. The feature determination method according to claim 2, wherein the determining of the integrated feature of the object to be processed comprises:
compensating each target characteristic according to the difference characteristic;
and fusing the compensated target characteristics into the comprehensive target characteristics of the target.
4. The feature determination method of claim 3, wherein determining a difference feature between the target features comprises:
the target feature comprises an attitude parameter and a projection parameter of the target to be processed, and the projection parameter comprises a rotation parameter of the target to be processed relative to a camera, a translation parameter of the target to be processed projected to an image plane and a zooming parameter;
determining the difference features using a first machine learning model based on the thermodynamic diagram, the pose parameters, and the projection parameters of the respective frame images.
5. The feature determination method according to claim 4,
the first machine learning model comprises a splicing module, a first convolution module and a first connection module, wherein the splicing module is used for splicing the thermodynamic diagram, the attitude parameters and the projection parameters into first input information, the first convolution module comprises a plurality of convolution layers and is used for determining first convolution information according to the first input information, and the first connection module comprises a plurality of full connection layers and is used for determining the difference characteristics according to the first convolution information.
6. The feature determination method according to claim 5,
the first machine learning model is trained according to a loss function, the loss function is constructed according to characteristic error parameters and key point distance parameters, the characteristic error parameters are determined according to the compensated error between each target characteristic and the true value of each target characteristic, and the key point distance parameters are determined according to the position change of each key point in two adjacent frames of images.
7. The feature determination method according to claim 3,
the target features comprise shape parameters, attitude parameters and projection parameters of the target to be processed, and the compensated target features comprise the compensated attitude parameters and projection parameters;
the determining the comprehensive characteristics of the target to be processed comprises the following steps:
and taking the shape parameters, the compensated posture parameters and the projection parameters in the images of the frames as second input information, and utilizing a second machine learning model to fuse the shape parameters in the images of the frames into a unified shape parameter as the comprehensive characteristic.
8. The feature determination method of claim 7, wherein the determining the composite feature of the object to be processed comprises:
and performing fusion processing for K times by using the second machine learning model, and fusing the features of the target in each frame of image into one comprehensive feature, wherein K is a positive integer greater than 1, the number of fusion results of the fusion processing for the (K + 1) th time is less than that of the fusion results obtained by the fusion processing for the kth time, and K is a positive integer less than or equal to K.
9. The feature determination method according to claim 8,
the second machine learning model comprises a second convolution module, a third convolution module, a transposition module, a second connection module and a fusion module, wherein the second convolution module comprises a plurality of convolution layers and is used for determining second convolution information according to the second input information, the transposition module transposes the second convolution information into a format required by the second connection module, the second connection module comprises a plurality of full-connection layers and is used for determining characteristic information according to the second convolution information, the third convolution module is used for determining third convolution information according to the second input information, the third convolution information and the characteristic information have the same dimensionality, and the fusion module is used for fusing the third convolution information and the characteristic information into the fusion information.
10. The feature determination method according to any one of claims 2 to 9,
the target segmentation graph and the background segmentation graph are determined by utilizing a first decoder model according to the coding information of each frame of image, and the coding information is determined by utilizing a first coder model;
the thermodynamic diagram is determined with a second decoder model from the encoded information;
the target feature is determined by a second encoder.
11. The feature determination method according to any one of claims 1 to 9, further comprising:
and generating a three-dimensional model of the target to be processed according to the comprehensive characteristics of the target to be processed.
12. An apparatus for feature determination of an object in a video, comprising:
the characteristic determining unit is used for determining the target characteristic of the target to be processed in each frame image according to the image characteristic extracted from each frame image of the video;
a difference determining unit for determining a difference feature between the target features;
and the feature fusion unit is used for fusing the target features according to the difference features to determine the comprehensive features of the target to be processed.
13. An apparatus for feature determination of an object in a video, comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the method of determining characteristics of an object in video of any of claims 1-11 based on instructions stored in the memory device.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method for determining characteristics of an object in a video according to any one of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910265480.7A CN111783497B (en) | 2019-04-03 | 2019-04-03 | Method, apparatus and computer readable storage medium for determining characteristics of objects in video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910265480.7A CN111783497B (en) | 2019-04-03 | 2019-04-03 | Method, apparatus and computer readable storage medium for determining characteristics of objects in video |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111783497A true CN111783497A (en) | 2020-10-16 |
CN111783497B CN111783497B (en) | 2024-08-20 |
Family
ID=72754779
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910265480.7A Active CN111783497B (en) | 2019-04-03 | 2019-04-03 | Method, apparatus and computer readable storage medium for determining characteristics of objects in video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111783497B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112819849A (en) * | 2021-01-14 | 2021-05-18 | 电子科技大学 | Mark point-free visual motion capture method based on three eyes |
CN114677572A (en) * | 2022-04-08 | 2022-06-28 | 北京百度网讯科技有限公司 | Object description parameter generation method and deep learning model training method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101814040B1 (en) * | 2017-09-06 | 2018-01-02 | 한국기술교육대학교 산학협력단 | An integrated surveillance device using 3D depth information focus control |
CN107644423A (en) * | 2017-09-29 | 2018-01-30 | 北京奇虎科技有限公司 | Video data real-time processing method, device and computing device based on scene cut |
CN107657625A (en) * | 2017-09-11 | 2018-02-02 | 南京信息工程大学 | Merge the unsupervised methods of video segmentation that space-time multiple features represent |
CN108197623A (en) * | 2018-01-19 | 2018-06-22 | 百度在线网络技术(北京)有限公司 | For detecting the method and apparatus of target |
CN108765481A (en) * | 2018-05-25 | 2018-11-06 | 亮风台(上海)信息科技有限公司 | A kind of depth estimation method of monocular video, device, terminal and storage medium |
CN108898086A (en) * | 2018-06-20 | 2018-11-27 | 腾讯科技(深圳)有限公司 | Method of video image processing and device, computer-readable medium and electronic equipment |
CN109472248A (en) * | 2018-11-22 | 2019-03-15 | 广东工业大学 | A kind of pedestrian recognition methods, system and electronic equipment and storage medium again |
-
2019
- 2019-04-03 CN CN201910265480.7A patent/CN111783497B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101814040B1 (en) * | 2017-09-06 | 2018-01-02 | 한국기술교육대학교 산학협력단 | An integrated surveillance device using 3D depth information focus control |
CN107657625A (en) * | 2017-09-11 | 2018-02-02 | 南京信息工程大学 | Merge the unsupervised methods of video segmentation that space-time multiple features represent |
CN107644423A (en) * | 2017-09-29 | 2018-01-30 | 北京奇虎科技有限公司 | Video data real-time processing method, device and computing device based on scene cut |
CN108197623A (en) * | 2018-01-19 | 2018-06-22 | 百度在线网络技术(北京)有限公司 | For detecting the method and apparatus of target |
CN108765481A (en) * | 2018-05-25 | 2018-11-06 | 亮风台(上海)信息科技有限公司 | A kind of depth estimation method of monocular video, device, terminal and storage medium |
CN108898086A (en) * | 2018-06-20 | 2018-11-27 | 腾讯科技(深圳)有限公司 | Method of video image processing and device, computer-readable medium and electronic equipment |
CN109472248A (en) * | 2018-11-22 | 2019-03-15 | 广东工业大学 | A kind of pedestrian recognition methods, system and electronic equipment and storage medium again |
Non-Patent Citations (4)
Title |
---|
周洋;何永健;唐向宏;陆宇;蒋刚毅;: "融合双目多维感知特征的立体视频显著性检测", 中国图象图形学报, no. 03 * |
张娜;魏海平;于红绯;: "一种融合特征点与运动补偿信息的运动目标检测新算法", 计算机应用与软件, no. 11 * |
李元祥;许鹏;敬忠良;魏宪;: "空间目标序列图像识别技术", 哈尔滨工业大学学报, no. 11, pages 114 - 119 * |
陈虹君;赵力衡;罗福强;李瑶;: "单帧图像局部差异特征多目标分离仿真", 计算机仿真, no. 06 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112819849A (en) * | 2021-01-14 | 2021-05-18 | 电子科技大学 | Mark point-free visual motion capture method based on three eyes |
CN112819849B (en) * | 2021-01-14 | 2021-12-03 | 电子科技大学 | Mark point-free visual motion capture method based on three eyes |
CN114677572A (en) * | 2022-04-08 | 2022-06-28 | 北京百度网讯科技有限公司 | Object description parameter generation method and deep learning model training method |
CN114677572B (en) * | 2022-04-08 | 2023-04-18 | 北京百度网讯科技有限公司 | Object description parameter generation method and deep learning model training method |
Also Published As
Publication number | Publication date |
---|---|
CN111783497B (en) | 2024-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111275518B (en) | Video virtual fitting method and device based on mixed optical flow | |
Kanazawa et al. | End-to-end recovery of human shape and pose | |
Zhang et al. | Object-occluded human shape and pose estimation from a single color image | |
Sharp et al. | Accurate, robust, and flexible real-time hand tracking | |
Richardson et al. | Learning detailed face reconstruction from a single image | |
US10334168B2 (en) | Threshold determination in a RANSAC algorithm | |
US20200272806A1 (en) | Real-Time Tracking of Facial Features in Unconstrained Video | |
CN112530019B (en) | Three-dimensional human body reconstruction method and device, computer equipment and storage medium | |
JP6624794B2 (en) | Image processing apparatus, image processing method, and program | |
CN110544301A (en) | Three-dimensional human body action reconstruction system, method and action training system | |
US11961266B2 (en) | Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture | |
CN111783506B (en) | Method, apparatus and computer readable storage medium for determining target characteristics | |
CN115496863B (en) | Short video generation method and system for scene interaction of movie and television intelligent creation | |
EP3185212B1 (en) | Dynamic particle filter parameterization | |
CN111815768B (en) | Three-dimensional face reconstruction method and device | |
WO2022208440A1 (en) | Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture | |
CN111462274A (en) | Human body image synthesis method and system based on SMP L model | |
CN115393519A (en) | Three-dimensional reconstruction method based on infrared and visible light fusion image | |
Li et al. | PoT-GAN: Pose transform GAN for person image synthesis | |
JP5503510B2 (en) | Posture estimation apparatus and posture estimation program | |
CN111783497B (en) | Method, apparatus and computer readable storage medium for determining characteristics of objects in video | |
Peng et al. | 3D hand mesh reconstruction from a monocular RGB image | |
Paterson et al. | 3D head tracking using non-linear optimization. | |
Manda et al. | Image stitching using ransac and bayesian refinement | |
Jian et al. | Realistic face animation generation from videos |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |