CN115994922B - Motion segmentation method, motion segmentation device, electronic equipment and storage medium - Google Patents

Motion segmentation method, motion segmentation device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115994922B
CN115994922B CN202310290423.0A CN202310290423A CN115994922B CN 115994922 B CN115994922 B CN 115994922B CN 202310290423 A CN202310290423 A CN 202310290423A CN 115994922 B CN115994922 B CN 115994922B
Authority
CN
China
Prior art keywords
target
motion segmentation
feature
segmentation model
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310290423.0A
Other languages
Chinese (zh)
Other versions
CN115994922A (en
Inventor
李俊
程靖航
李琦铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quanzhou Institute of Equipment Manufacturing
Original Assignee
Quanzhou Institute of Equipment Manufacturing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quanzhou Institute of Equipment Manufacturing filed Critical Quanzhou Institute of Equipment Manufacturing
Priority to CN202310290423.0A priority Critical patent/CN115994922B/en
Publication of CN115994922A publication Critical patent/CN115994922A/en
Application granted granted Critical
Publication of CN115994922B publication Critical patent/CN115994922B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of video data processing, and provides a motion segmentation method, a motion segmentation device, electronic equipment and a storage medium, wherein the motion segmentation method comprises the steps of firstly obtaining target key point information in each video frame of a video clip; and then inputting the target key point information into the motion segmentation model to obtain the target category of the target object to which the target key point information output by the motion segmentation model belongs. By introducing the motion segmentation model, the realization mode of manual teaching can be replaced, the misclassification rate is reduced, the classification result is more accurate, and the precision bottleneck of non-deep learning motion segmentation can be effectively broken through. Moreover, the motion segmentation model can increase the receptive field by introducing multiple layers of attention feature extraction, extending from the local receptive field of the traditional convolution structure to the entire video frame. By introducing global features, the feature learning dimension of the motion segmentation model can be increased, and the accuracy of the classification result is further improved.

Description

Motion segmentation method, motion segmentation device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of video data processing technologies, and in particular, to a motion segmentation method, a motion segmentation device, an electronic device, and a storage medium.
Background
The rapid development of video compression algorithms and applications brings about massive video data. The video contains rich information, however, because the video data is huge, the abstract concept is not directly represented like characters, and therefore the extraction and the structuring of the video information are relatively complex.
Currently, video information is extracted mainly by classifying moving or stationary target objects in video, so that video motion segmentation is required. The motion segmentation method based on non-deep learning mostly adopts a manual inspiring mode, so that the general misclassification rate is high, and the bottleneck of classification precision is difficult to break through. The motion segmentation method based on the traditional convolutional neural network architecture can only learn the local information of the moving object, can not effectively learn the global information of the moving object, and further can not guarantee the accuracy of the motion segmentation result.
Disclosure of Invention
The invention provides a motion segmentation method, a motion segmentation device, electronic equipment and a storage medium, which are used for solving the defects in the prior art.
The invention provides a motion segmentation method, which comprises the following steps:
acquiring target key point information in each video frame of a video clip;
Inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs;
the motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects;
the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; and decoding the second global feature to obtain the target category.
According to the motion segmentation method provided by the invention, the input of the target key point information to a motion segmentation model to obtain the target category of the target object to which the target key point information output by the motion segmentation model belongs comprises the following steps:
inputting the target key point information to a point embedding module of the motion segmentation model, and performing dimension conversion, linear conversion, normalization and nonlinear conversion on the target key point information by using the point embedding module to obtain initial point characteristics output by the point embedding module;
Inputting the initial point characteristics to an attention module of the motion segmentation model, carrying out multi-layer attention characteristic extraction on the initial point characteristics by utilizing the attention module, and obtaining the target point characteristics output by the attention module based on the attention characteristics extracted from the last layer and the initial point characteristics;
inputting the target point characteristics to a characteristic fusion module of the motion segmentation model, carrying out pooling operation on the target point characteristics by utilizing the characteristic fusion module to obtain first global characteristics, and carrying out characteristic fusion on the target point characteristics and the first global characteristics to obtain the second global characteristics output by the characteristic fusion module;
and inputting the second global features to a classification head module of the motion segmentation model, and decoding the second global features by using the classification head module to obtain the target category.
According to the motion segmentation method provided by the invention, the target point feature output by the attention module is obtained based on the attention feature extracted by the last layer and the initial point feature, and the motion segmentation method comprises the following steps:
determining an offset feature between the initial point feature and the last layer of extracted attention feature;
And performing dimension conversion, linear transformation, normalization and nonlinear transformation on the offset characteristic, and obtaining the target point characteristic based on the attention characteristic extracted by the last layer.
According to the motion segmentation method provided by the invention, the pooling operation is performed on the target point features by using the feature fusion module to obtain a first global feature, which comprises the following steps:
respectively carrying out maximum pooling operation and average pooling operation on the target point features by utilizing the feature fusion module to obtain a maximum pooling result and an average pooling result;
and fusing the maximum pooling result and the average pooling result to obtain the first global feature.
According to the motion segmentation method provided by the invention, the method for decoding the second global feature by using the classification head module to obtain the target category comprises the following steps:
performing dimension conversion, linear transformation, normalization and nonlinear transformation on the second global feature by using the classification head module to obtain an alternative feature;
and carrying out linear transformation on the alternative features to obtain the target category.
According to the motion segmentation method provided by the invention, the linear transformation is realized based on convolution operation.
According to the motion segmentation method provided by the invention, the motion segmentation model is obtained based on training of the following steps:
inputting the sample key point information into an initial segmentation model to obtain an output result of the initial segmentation model;
based on the output result and the class label, calculating model loss by adopting a BCEWITHLogitsLoss function;
and carrying out iterative optimization on the model parameters of the initial segmentation model by taking the minimized model loss as an optimization target to obtain the motion segmentation model.
The invention also provides a motion segmentation device, comprising:
the acquisition module is used for acquiring target key point information in each video frame of the video clip;
the segmentation module is used for inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs;
the motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects;
the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; and decoding the second global feature to obtain the target category.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of motion segmentation as described in any of the above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of motion segmentation as described in any of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method of motion segmentation as described in any one of the above.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a motion segmentation method, a motion segmentation device, electronic equipment and a storage medium, wherein the method comprises the steps of firstly obtaining target key point information in each video frame of a video clip; and then inputting the target key point information into the motion segmentation model to obtain the target category of the target object to which the target key point information output by the motion segmentation model belongs. The motion segmentation model adopted by the method is obtained based on sample key point information of sample objects in each video frame sample of the video samples and class labels of the sample objects. The motion segmentation model is used for sequentially extracting multiple layers of attention features of target key point information to obtain target point features; pooling the target point features to obtain a first global feature; feature fusion is carried out on the target point features and the first global features, so that second global features are obtained; and decoding the second global feature to obtain the target class. By introducing the motion segmentation model, the realization mode of manual teaching can be replaced, the misclassification rate is reduced, the classification result is more accurate, and the precision bottleneck of non-deep learning motion segmentation can be effectively broken through. Moreover, the motion segmentation model can increase the receptive field by introducing multiple layers of attention feature extraction, extending from the local receptive field of the traditional convolution structure to the entire video frame. By introducing global features, the feature learning dimension of the motion segmentation model can be increased, and the accuracy of the classification result is further improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a motion segmentation method according to the present invention;
FIG. 2 is a second flow chart of the motion segmentation method according to the present invention;
FIG. 3 is a schematic diagram of a structure of an attention module in a motion segmentation model used in the motion segmentation method according to the present invention;
FIG. 4 is a third flow chart of the motion segmentation method according to the present invention;
FIG. 5 is a schematic view of a motion segmentation apparatus according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Because the motion segmentation method based on non-deep learning adopted in the prior art for motion segmentation of the video has higher general misclassification rate, the bottleneck of classification precision is difficult to break through. The motion segmentation method based on the traditional convolutional neural network architecture can only learn the local information of the moving object, can not effectively learn the global information of the moving object, and further can not ensure the integrity and accuracy of the motion segmentation result.
Therefore, the embodiment of the invention provides a motion segmentation method.
Fig. 1 is a flow chart of a motion segmentation method provided in an embodiment of the present invention, as shown in fig. 1, the motion segmentation method includes:
s1, acquiring target key point information in each video frame of a video clip;
s2, inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs;
the motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects;
the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; and decoding the second global feature to obtain the target category.
Specifically, in the motion segmentation method provided in the embodiment of the present invention, the execution body is a motion segmentation device, and the motion segmentation device may be configured in a computer, where the computer may be a local computer or a cloud computer, and the local computer may be a computer, a tablet, or the like, and is not limited herein specifically.
Step S1 is executed first, and target key point information in each video frame of the video clip is obtained. The length of the video clip can be set according to the need, and the number of video frames contained in the video clip can also be set according to the need, which is not particularly limited herein.
Target key point information can be extracted from each video frame, the target key point information can comprise position information and related information of target key points in the video frames, and the related information can comprise apparent information such as color information, neighborhood information and the like corresponding to the target key points. Here, the target keypoint information in each video frame of the video clip may constitute an information set.
And then, executing step S2, namely inputting the target key point information into a motion segmentation model, namely inputting the information set into the motion segmentation model, and encoding and decoding the target key point information by utilizing the motion segmentation model so as to classify the target key points in each video frame, thereby obtaining the target category of the target object to which the target key point information belongs.
One or more target objects may be included in each video frame, and target keypoint information in each video frame may cover all target objects in that video frame. Further, the object categories of all the object objects in each video frame can be obtained by the motion segmentation model.
Here, the output result of the motion segmentation model may be in the form of a target key point with a target class in each video frame.
The motion segmentation model can be obtained by training an initial segmentation model by utilizing sample key point information of a sample object and a class label of the sample object in each video frame sample of a video sample.
When the initial segmentation model is trained to obtain a motion segmentation model, sample key point information can be input into the initial segmentation model to obtain an output result of the initial segmentation model, then a loss function is calculated through a class label and the output result, structural parameters of the initial segmentation model are updated, the loss function is recalculated, and the trained initial segmentation model, namely the motion segmentation model, is obtained when the loss function converges.
The motion segmentation model can encode target key point information, namely, the attention mechanism is utilized to sequentially extract multiple layers of attention features from the target key point information to obtain target point features. The target point feature may be characterized by a form of a feature map, which may be referred to as a target point feature map.
The multi-layer attention feature extraction can be realized by adopting multi-layer self-attention operation, the association degree of input data to output data can be automatically calculated through the self-attention operation, and the semantic relation between the target key points is learned, so that the target point feature is obtained. Through the semantic relation between the target key points, the method can help capture the global information characterization among all target objects, establish long-distance dependency relation, extract more sufficient second global features and solve the problem of global information access limitation of the traditional deep neural network. Here, the number of layers of the self-attention operation may be set as required, for example, may be set to be more than 2 layers, or may be set to other layers, and is not particularly limited herein.
Then, the motion segmentation model may perform a pooling operation on the target point features to obtain a first global feature. The first global feature may be characterized by a form of a feature map, which may be referred to as a first global feature map. The pooling operation may be a maximum pooling operation, an average pooling operation, or a maximum pooling operation and an average pooling operation. The first global features may include a plurality of first global features, each for characterizing preliminary global information of one target object.
Then, the motion segmentation model may perform feature fusion on the target point feature and the first global feature to obtain a second global feature. The second global features may include a plurality of second global features, each second global feature being used to characterize more complete global information of one target object. The second global feature may be characterized by a form of a feature map, which may be referred to as a second global feature map. The feature fusion may be performed by stitching the target point feature with the first global feature, for example, the target point feature map may be stitched directly after the first global feature map, or may be stitched by other manners, which is not limited herein specifically.
The coding process of the motion segmentation model on the target key point information is finished, and the obtained coding result is the second global feature.
And finally, the motion segmentation model can decode the second global feature to obtain the target category of the target object to which the target key point information belongs. The decoding process can be understood as mapping the feature space to the class space to obtain the target class of the target object in the class space.
The motion segmentation method provided by the embodiment of the invention comprises the steps of firstly, acquiring target key point information in each video frame of a video clip; and then inputting the target key point information into the motion segmentation model to obtain the target category of the target object to which the target key point information output by the motion segmentation model belongs. The motion segmentation model adopted by the method is obtained based on sample key point information of sample objects in each video frame sample of the video samples and class labels of the sample objects. The motion segmentation model is used for sequentially extracting multiple layers of attention features of target key point information to obtain target point features; pooling the target point features to obtain a first global feature; feature fusion is carried out on the target point features and the first global features, so that second global features are obtained; and decoding the second global feature to obtain the target class. By introducing the motion segmentation model, the realization mode of manual teaching can be replaced, the misclassification rate is reduced, the classification result is more accurate, and the precision bottleneck of non-deep learning motion segmentation can be effectively broken through. Moreover, the motion segmentation model can increase the receptive field by introducing multiple layers of attention feature extraction, extending from the local receptive field of the traditional convolution structure to the entire video frame. By introducing global features, the feature learning dimension of the motion segmentation model can be increased, and the accuracy of the classification result is further improved.
On the basis of the foregoing embodiment, in the motion segmentation method provided in the embodiment of the present invention, the inputting the target keypoint information into a motion segmentation model to obtain a target class of a target object to which the target keypoint information output by the motion segmentation model belongs includes:
inputting the target key point information to a point embedding module of the motion segmentation model, and performing dimension conversion, linear conversion, normalization and nonlinear conversion on the target key point information by using the point embedding module to obtain initial point characteristics output by the point embedding module;
inputting the initial point characteristics to an attention module of the motion segmentation model, carrying out multi-layer attention characteristic extraction on the initial point characteristics by utilizing the attention module, and obtaining the target point characteristics output by the attention module based on the attention characteristics extracted from the last layer and the initial point characteristics;
inputting the target point characteristics to a characteristic fusion module of the motion segmentation model, carrying out pooling operation on the target point characteristics by utilizing the characteristic fusion module to obtain first global characteristics, and carrying out characteristic fusion on the target point characteristics and the first global characteristics to obtain the second global characteristics output by the characteristic fusion module;
And inputting the second global features to a classification head module of the motion segmentation model, and decoding the second global features by using the classification head module to obtain the target category.
Specifically, the motion segmentation model adopted in the embodiment of the invention can comprise a point embedding module, an attention module, a feature fusion module and a classification head module which are connected in sequence.
When the motion segmentation model is utilized, target key point information can be input to a point embedding module, and the point embedding module is utilized to perform dimension conversion, linear (Linear) conversion, normalization (Batch-Norm) and nonlinear conversion on the target key point information to obtain initial point characteristics output by the point embedding module. The point embedding module may be a two-stage cascade feed forward neural network (LBR), which may be similar to word embedding in natural language processing. Each target keypoint may be an integral part of the respective target object, in addition to residing in space with its unique location information. The point embedding module can map the target key point information from a low-dimensional sparse feature space to a high-dimensional feature space, and then perform linear and nonlinear operation on the target key point information so that the target key point information is easier to fit to a target function.
The embedding dimension of the point embedding module can be 128, the number of linear transformation layers can be 2, and the point embedding module has obtained an excellent misclassification rate on a real automatic driving data set KT3 DMoSeg. In addition, the linear conversion process is completely realized by convolution operation, so that the huge quantity of parameters brought by full-connection operation is effectively avoided. The nonlinear transformation is implemented by an activation function (Relu).
After that, the initial point characteristics output by the point embedding module can be input to the attention module, the attention module can be utilized to extract the attention characteristics of multiple layers of iteration of the initial point characteristics, and the final layer of extracted attention characteristics and the initial point characteristics are combined, so that the target point characteristics output by the attention module can be obtained.
The attention module can feature the initial point
Figure SMS_1
And inputting the data into corresponding linear transformation to obtain a multi-head self-attention query matrix Q, a key matrix K and a value matrix V. Where N is the total number of target keypoints in each video frame,
Figure SMS_2
the embedding dimension of the point embedding module. Q may assist in searching for a target object, K represents a feature in the feature space of a target keypoint belonging to the target object, and V is used to search for all information about the target object.
The scores of all target key points related to the target object can be obtained through Q multiplied by K, the characteristics of all target key points are decomposed into a part with stronger relevance to the target object and a part with weaker relevance, and the attention module focuses on the content with stronger relevance. The detailed mathematical procedure is as follows:
Figure SMS_3
Figure SMS_4
Figure SMS_5
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_7
、/>
Figure SMS_9
and->
Figure SMS_11
Are all shared learnable linear transformations, +.>
Figure SMS_8
Is the dimension of the query vector and the key vector. In the embodiment of the invention, < > a->
Figure SMS_10
May not be equal to->
Figure SMS_12
. To increase the computational efficiency, +.>
Figure SMS_13
Is->
Figure SMS_6
。/>
Thereafter, when the attention features of each layer are extracted, the attention weight of points in the target object is obtained through matrix dot product by using a query matrix Q and a key matrix K:
Figure SMS_14
attention features extracted per layer
Figure SMS_15
Is a weighted sum of a matrix of values of corresponding attention weights:
Figure SMS_16
After the last layer of extracted attention features is obtained, an offset feature between the initial point feature and the last layer of extracted attention features may be determined, which may be expressed as
Figure SMS_17
The offset feature may then be dimension transformed, linear transformed, normalized, and non-linear transformed, and combined with the last layer of extracted attention feature to obtain the target point feature. Here, the offset feature may be subjected to one-stage cascaded LBR to perform dimensional transformations, linear transformations, normalization, and nonlinear transformations. The first order cascade of LBR is similar to point embedding, with coordinate points embedded into another dimensional space using a shared neural network to show semantic affinity. Similarly, the linear transformation herein also uses convolution operations.
To this end, the target point feature output by the attention module
Figure SMS_18
Can be expressed as:
Figure SMS_19
the attention module may output the target point characteristics of each target object in the video frames.
And then, the target point features output by the attention module can be input to a feature fusion module, and the feature fusion module is utilized to pool the target point features so as to obtain a first global feature.
The feature fusion module can respectively perform maximum pooling operation and average pooling operation on the target point features to obtain a maximum pooling result and an average pooling result. The main purpose of pooling is to integrate target point features and obtain better feature responses.
Then, the maximum pooling result and the average pooling result can be fused to obtain the first global feature.
And then, carrying out feature fusion on the target point features and the first global features to obtain second global features output by the feature fusion module.
And finally, inputting the second global features into a classification head module of the motion segmentation model, and decoding the second global features by using the classification head module to obtain target categories.
The classification head module can be realized by adopting two-stage cascaded LBR and one-stage general linear transformation operation, namely, dimension conversion, linear transformation, normalization and nonlinear transformation are firstly carried out on the second global feature to obtain an alternative feature; and then carrying out linear transformation on the alternative features to obtain the target category.
The two-level LBR operates by embedding a second global feature of each target object into a low-dimensional space, representing the semantic affinity between multiple target objects in the space. Furthermore, the combination of linear and nonlinear operations allows the learned information about the target object to better fit the objective function. Then, category information of the point feature of each target object is embedded into the category space by linear transformation. Here, the linear transformation also uses convolution operations.
The number of linear translation layers in the classification head module may be 4, and the results obtained on KT3 dmegseg prove that the classification head module has achieved good gains.
In the embodiment of the invention, the motion segmentation model can be trained in a highly parallelized manner, so that the excellent performance is ensured, the high-efficiency motion segmentation speed can be displayed, and the motion segmentation efficiency is improved.
On the basis of the above embodiment, the motion segmentation method provided in the embodiment of the present invention, the motion segmentation model is trained based on the following steps:
inputting the sample key point information into an initial segmentation model to obtain an output result of the initial segmentation model;
based on the output result and the class label, calculating model loss by adopting a BCEWITHLogitsLoss function;
And carrying out iterative optimization on the model parameters of the initial segmentation model by taking the minimized model loss as an optimization target to obtain the motion segmentation model.
Specifically, in the embodiment of the present invention, in the process of training the initial segmentation model to obtain the motion segmentation model, a BCEWithLogitsLoss function may be used as a loss function (L) of the initial segmentation model, where the loss function is formed by combining Sigmoid and BCELoss. The loss function is numerically more stable than ordinary BCELoss, sigmoid combines operations in one layer, and improves numerical stability with logarithmic and exponential characteristics.
The BCEWithLogitsLoss function can be expressed as:
Figure SMS_20
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_21
is the weight of the loss function, +.>
Figure SMS_22
Is the output result of the initial segmentation model, namely the category predicted value of each sample key point, ++>
Figure SMS_23
Is the class label of the sample object to which each sample key point belongs, < ->
Figure SMS_24
Is the number of sample keypoints for the sample objects in the batch.
In the embodiment of the invention, an Adam optimizer is adopted to minimize model loss.
In the embodiment of the invention, in order to verify whether the motion segmentation method can be applied to the multi-class segmentation problem, a standard K-means algorithm is applied to the output result of the motion segmentation model in the test process. During the test, it is not necessary to specify the class of the target object to be segmented. If the number of categories of the target object to be estimated is KDefinition ofKMean value ofThe residual is:
Figure SMS_25
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_26
is thatKThe center of the cluster, used for characterization ofKClass of clusters, number of samples in each cluster expressed as
Figure SMS_27
It has been demonstrated that forKReasonable estimates of (a) will result in lower
Figure SMS_28
WhileKDoes not significantly decrease +.>
Figure SMS_29
. From this, it can be seen that the above-described motion segmentation method can be applied to a variety of segmentation problems.
As shown in fig. 2, on the basis of the above embodiment, the motion segmentation method provided in the embodiment of the present invention includes:
for the target key point information X (N X D) in each video frame of the input video segment, D is the dimension of the target key point information in each video frame, and the initial point characteristics are obtained by the point embedding module of the motion segmentation model
Figure SMS_30
. The point embedding module is composed of two stages of cascaded LBRs, wherein the dimension of each stage of cascaded LBRs is the embedding dimension of the point embedding module and is 128.
Initial point characteristics
Figure SMS_31
The attention module of the motion segmentation model obtains the target point characteristic +.>
Figure SMS_32
. The attention module may include four self-attention layers and a cascade of stagesLBR composition, four self-attention layers and primary cascade LBR dimensions were 128./>
Target point feature
Figure SMS_33
The first global feature is obtained through pooling operation of a feature fusion module of the motion segmentation model, and the target point feature is combined >
Figure SMS_34
And obtaining a second global feature. R in fig. 2 is a copy operation.
And the second global feature obtains an output target class Z (N multiplied by K) after the LBR of the two-stage cascade of the classification heads of the motion segmentation model and the one-stage general linear transformation operation. The dimension of the LBR of the first stage of the LBR of the two stages of the cascade is 384, the dimension of the LBR of the second stage of the cascade is 128, and the number of the target categories is K.
Fig. 3 is a schematic structural diagram of an attention module in a motion segmentation model used in the motion segmentation method according to an embodiment of the present invention.
As shown in FIG. 3, the input of the attention module is an initial point feature
Figure SMS_35
The method comprises the steps of respectively obtaining a query matrix Q and a value matrix V after convolution, obtaining a key matrix K after transposition, obtaining scores of all target key points related to a target object through matrix multiplication Q multiplied by K, and decomposing the characteristics of all the target key points into a part (SL) with stronger correlation with the target object and a part (SS) with weaker correlation with the target object through the size of the scores. SL is selected to obtain an attention map, and then matrix multiplication is performed on the value matrix V to obtain the extracted attention feature of each layer->
Figure SMS_36
. Then iterating into the second self-attention layer, the third self-attention layer and the fourth self-attention layer, and performing matrix subtraction on the initial point characteristic and the attention characteristic obtained by the fourth self-attention layer to obtain an offset characteristic, wherein the offset characteristic is obtained after passing through the LBR of the first-stage cascade and the initial point The characteristics are added in matrix to obtain the target point characteristics +.>
Figure SMS_37
In FIG. 3, T represents the transpose and SS representsscale+softmaxSL representssoftmax+l 1 Norm"X" means matrix multiplication, "+" means matrix addition, "-" means matrix subtraction, and the sign between SS and SL and the attention-seeking diagram
Figure SMS_38
The symbol between "-" and LBR->
Figure SMS_39
All represent choices.
In summary, as shown in fig. 4, in the process of obtaining the target category, the motion segmentation model sequentially performs multi-layer attention feature extraction on the input target key point information, calculates the weight information of the key point, and focuses on the complex relationship of the target key point belonging to the target object to obtain the target point feature. And based on the multi-layer structure of the motion segmentation model, long-distance dependence is established, and the description capability of the motion segmentation model on the target object is greatly improved. Secondly, pooling operation is carried out on the target point characteristics to obtain first global characteristics; feature fusion is carried out on the target point features and the first global features to obtain second global features, and the segmentation precision of the target object is improved by fully fusing the key point features of the high and low layers; then, decoding the second global feature through a classification head module, mapping the second global feature to a low-dimensional classification space, expressing semantic affinity between target key points, and decoding the feature points into target categories; and finally outputting the target key points with the target category of the target object.
The motion segmentation model in the embodiment of the invention is realized by using PyTorch and runs on A100 PCIE-40GB-GPU equipment. The motion segmentation model does not use any pre-trained model, but rather starts training using random model weights. In addition, for all the motion segmentation tasks mentioned in the embodiment of the invention, training is performedTraining 300 times, keeping learning rate at 1×10 all the time -3 . In order to optimize the network structure, adam optimization algorithm is adopted. The mean value of the misclassification rate was used as an evaluation index on KT3 dmeoseg and AdelaideRMF datasets.
Table 1 shows the experimental results of the motion segmentation model in the embodiment of the invention on the KT3DMoSeg data set, and compares the experimental results with the existing mainstream motion segmentation algorithm on the misclassification rate. Whether in a normal setting or an enhanced setting, the motion segmentation model in the embodiment of the invention realizes more advanced performance than the existing method in 22 KT3DMoSeg sequences. Specifically, the motion segmentation model in the embodiment of the invention achieves 7.23% of the mean value of the misclassification rate in the common setting, and achieves 5.22% of the mean value of the misclassification rate in the enhancement setting. This recognition rate fully demonstrates the advantage of the attention module of the motion segmentation model in embodiments of the present invention in learning complex relationships of multiple target objects.
TABLE 1 Performance comparison Table of the motion segmentation model and the mainstream motion segmentation algorithm on KT3DMoSeg in the embodiment of the present invention
Figure SMS_40
The performance of the normal and enhanced settings are separated by "/" in table 1.
Table 2 shows the experimental results of the motion segmentation model in the embodiment of the invention on the AdelaideRMF dataset, and compared with the existing excellent method, the motion segmentation model in the embodiment of the invention obtains the mean value of the misclassification rate of 4.76% in terms of misclassification rate, has excellent segmentation effect on the moving object, and is far higher than the existing excellent method, and the method is also beneficial to the attention module in the motion segmentation model to capture the multiscale information of the moving object and fully learn the more comprehensive characteristic characterization of the multiscale information.
TABLE 2 comparison of the performance of the motion segmentation model in the examples of the present invention with the existing excellent method on AdelaideRMF
Figure SMS_41
As shown in fig. 5, on the basis of the above embodiment, there is provided a motion segmentation apparatus according to an embodiment of the present invention, including:
an obtaining module 51, configured to obtain target key point information in each video frame of the video clip;
the segmentation module 52 is configured to input the target key point information to a motion segmentation model, and obtain a target class of a target object to which the target key point information output by the motion segmentation model belongs;
The motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects;
the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; and decoding the second global feature to obtain the target category.
On the basis of the foregoing embodiments, the motion segmentation apparatus provided in the embodiments of the present invention, the segmentation module is specifically configured to:
inputting the target key point information to a point embedding module of the motion segmentation model, and performing dimension conversion, linear conversion, normalization and nonlinear conversion on the target key point information by using the point embedding module to obtain initial point characteristics output by the point embedding module;
inputting the initial point characteristics to an attention module of the motion segmentation model, carrying out multi-layer attention characteristic extraction on the initial point characteristics by utilizing the attention module, and obtaining the target point characteristics output by the attention module based on the attention characteristics extracted from the last layer and the initial point characteristics;
Inputting the target point characteristics to a characteristic fusion module of the motion segmentation model, carrying out pooling operation on the target point characteristics by utilizing the characteristic fusion module to obtain first global characteristics, and carrying out characteristic fusion on the target point characteristics and the first global characteristics to obtain the second global characteristics output by the characteristic fusion module;
and inputting the second global features to a classification head module of the motion segmentation model, and decoding the second global features by using the classification head module to obtain the target category.
On the basis of the foregoing embodiments, the motion segmentation apparatus provided in the embodiments of the present invention, the segmentation module is specifically configured to:
determining an offset feature between the initial point feature and the last layer of extracted attention feature;
and performing dimension conversion, linear transformation, normalization and nonlinear transformation on the offset characteristic, and obtaining the target point characteristic based on the attention characteristic extracted by the last layer.
On the basis of the foregoing embodiments, the motion segmentation apparatus provided in the embodiments of the present invention, the segmentation module is specifically configured to:
respectively carrying out maximum pooling operation and average pooling operation on the target point features by utilizing the feature fusion module to obtain a maximum pooling result and an average pooling result;
And fusing the maximum pooling result and the average pooling result to obtain the first global feature.
On the basis of the foregoing embodiments, the motion segmentation apparatus provided in the embodiments of the present invention, the segmentation module is specifically configured to:
performing dimension conversion, linear transformation, normalization and nonlinear transformation on the second global feature by using the classification head module to obtain an alternative feature;
and carrying out linear transformation on the alternative features to obtain the target category.
On the basis of the above embodiments, the motion segmentation device provided in the embodiments of the present invention, the linear transformation is implemented based on convolution operation.
On the basis of the above embodiment, the motion segmentation apparatus provided in the embodiment of the present invention further includes a training module, configured to:
inputting the sample key point information into an initial segmentation model to obtain an output result of the initial segmentation model;
based on the output result and the class label, calculating model loss by adopting a BCEWITHLogitsLoss function;
and carrying out iterative optimization on the model parameters of the initial segmentation model by taking the minimized model loss as an optimization target to obtain the motion segmentation model.
Specifically, the functions of each module in the motion segmentation device provided in the embodiment of the present invention are in one-to-one correspondence with the operation flow of each step in the above method embodiment, and the achieved effects are consistent.
Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor (Processor) 610, communication interface (Communications Interface) 620, memory (Memory) 630, and communication bus 640, wherein Processor 610, communication interface 620, and Memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform the motion segmentation method provided in the embodiments described above, the method comprising: acquiring target key point information in each video frame of a video clip; inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs; the motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects; the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; and decoding the second global feature to obtain the target category.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the motion segmentation method provided in the above embodiments, the method comprising: acquiring target key point information in each video frame of a video clip; inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs; the motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects; the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; and decoding the second global feature to obtain the target category.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the motion segmentation method provided in the above embodiments, the method comprising: acquiring target key point information in each video frame of a video clip; inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs; the motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects; the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; and decoding the second global feature to obtain the target category.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method of motion segmentation, comprising:
acquiring target key point information in each video frame of a video clip;
inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs;
the motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects;
the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; decoding the second global feature to obtain the target class;
the step of inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs, including:
inputting the target key point information to a point embedding module of the motion segmentation model, and performing dimension conversion, linear conversion, normalization and nonlinear conversion on the target key point information by using the point embedding module to obtain initial point characteristics output by the point embedding module;
Inputting the initial point characteristics to an attention module of the motion segmentation model, carrying out multi-layer attention characteristic extraction on the initial point characteristics by utilizing the attention module, and obtaining the target point characteristics output by the attention module based on the attention characteristics extracted from the last layer and the initial point characteristics;
inputting the target point characteristics to a characteristic fusion module of the motion segmentation model, carrying out pooling operation on the target point characteristics by utilizing the characteristic fusion module to obtain first global characteristics, and carrying out characteristic fusion on the target point characteristics and the first global characteristics to obtain the second global characteristics output by the characteristic fusion module;
and inputting the second global features to a classification head module of the motion segmentation model, and decoding the second global features by using the classification head module to obtain the target category.
2. The motion segmentation method according to claim 1, wherein the deriving the target point feature output by the attention module based on the attention feature extracted by the last layer and the initial point feature comprises:
Determining an offset feature between the initial point feature and the last layer of extracted attention feature;
and performing dimension conversion, linear transformation, normalization and nonlinear transformation on the offset characteristic, and obtaining the target point characteristic based on the attention characteristic extracted by the last layer.
3. The motion segmentation method according to claim 1, wherein the pooling the target point features with the feature fusion module to obtain a first global feature comprises:
respectively carrying out maximum pooling operation and average pooling operation on the target point features by utilizing the feature fusion module to obtain a maximum pooling result and an average pooling result;
and fusing the maximum pooling result and the average pooling result to obtain the first global feature.
4. The motion segmentation method according to claim 1, wherein the decoding the second global feature with the classification head module to obtain the target class comprises:
performing dimension conversion, linear transformation, normalization and nonlinear transformation on the second global feature by using the classification head module to obtain an alternative feature;
And carrying out linear transformation on the alternative features to obtain the target category.
5. The motion segmentation method according to any one of claims 1-4, characterized in that the linear transformation is implemented based on a convolution operation.
6. The method of motion segmentation according to any one of claims 1-4, wherein the motion segmentation model is trained based on:
inputting the sample key point information into an initial segmentation model to obtain an output result of the initial segmentation model;
based on the output result and the class label, calculating model loss by adopting a BCEWITHLogitsLoss function;
and carrying out iterative optimization on the model parameters of the initial segmentation model by taking the minimized model loss as an optimization target to obtain the motion segmentation model.
7. A motion segmentation apparatus, comprising:
the acquisition module is used for acquiring target key point information in each video frame of the video clip;
the segmentation module is used for inputting the target key point information into a motion segmentation model to obtain a target category of a target object to which the target key point information output by the motion segmentation model belongs;
The motion segmentation model is obtained based on sample key point information of sample objects in each video frame sample of a video sample and class labels of the sample objects;
the motion segmentation model is used for sequentially extracting multi-layer attention features of the target key point information to obtain target point features; pooling the target point features to obtain a first global feature; performing feature fusion on the target point feature and the first global feature to obtain a second global feature; decoding the second global feature to obtain the target class;
the segmentation module is specifically configured to:
inputting the target key point information to a point embedding module of the motion segmentation model, and performing dimension conversion, linear conversion, normalization and nonlinear conversion on the target key point information by using the point embedding module to obtain initial point characteristics output by the point embedding module;
inputting the initial point characteristics to an attention module of the motion segmentation model, carrying out multi-layer attention characteristic extraction on the initial point characteristics by utilizing the attention module, and obtaining the target point characteristics output by the attention module based on the attention characteristics extracted from the last layer and the initial point characteristics;
Inputting the target point characteristics to a characteristic fusion module of the motion segmentation model, carrying out pooling operation on the target point characteristics by utilizing the characteristic fusion module to obtain first global characteristics, and carrying out characteristic fusion on the target point characteristics and the first global characteristics to obtain the second global characteristics output by the characteristic fusion module;
and inputting the second global features to a classification head module of the motion segmentation model, and decoding the second global features by using the classification head module to obtain the target category.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the motion segmentation method of any one of claims 1-6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium, having stored thereon a computer program, which when executed by a processor implements the motion segmentation method according to any one of claims 1-6.
CN202310290423.0A 2023-03-23 2023-03-23 Motion segmentation method, motion segmentation device, electronic equipment and storage medium Active CN115994922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310290423.0A CN115994922B (en) 2023-03-23 2023-03-23 Motion segmentation method, motion segmentation device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310290423.0A CN115994922B (en) 2023-03-23 2023-03-23 Motion segmentation method, motion segmentation device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115994922A CN115994922A (en) 2023-04-21
CN115994922B true CN115994922B (en) 2023-06-02

Family

ID=85995362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310290423.0A Active CN115994922B (en) 2023-03-23 2023-03-23 Motion segmentation method, motion segmentation device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115994922B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022041596A1 (en) * 2020-08-31 2022-03-03 同济人工智能研究院(苏州)有限公司 Visual slam method applicable to indoor dynamic environment
CN114399708A (en) * 2021-12-30 2022-04-26 复旦大学 Video motion migration deep learning system and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492608B (en) * 2018-11-27 2019-11-05 腾讯科技(深圳)有限公司 Image partition method, device, computer equipment and storage medium
CN111079658B (en) * 2019-12-19 2023-10-31 北京海国华创云科技有限公司 Multi-target continuous behavior analysis method, system and device based on video
CN111325190B (en) * 2020-04-01 2023-06-30 京东方科技集团股份有限公司 Expression recognition method and device, computer equipment and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022041596A1 (en) * 2020-08-31 2022-03-03 同济人工智能研究院(苏州)有限公司 Visual slam method applicable to indoor dynamic environment
CN114399708A (en) * 2021-12-30 2022-04-26 复旦大学 Video motion migration deep learning system and method

Also Published As

Publication number Publication date
CN115994922A (en) 2023-04-21

Similar Documents

Publication Publication Date Title
CN107122809B (en) Neural network feature learning method based on image self-coding
CN111242033B (en) Video feature learning method based on discriminant analysis of video and text pairs
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN110188827B (en) Scene recognition method based on convolutional neural network and recursive automatic encoder model
CN106033426A (en) A latent semantic min-Hash-based image retrieval method
CN111159485A (en) Tail entity linking method, device, server and storage medium
CN110598022B (en) Image retrieval system and method based on robust deep hash network
CN112800768A (en) Training method and device for nested named entity recognition model
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN113095314A (en) Formula identification method and device, storage medium and equipment
CN111611413B (en) Deep hashing method based on metric learning
CN114372465A (en) Legal named entity identification method based on Mixup and BQRNN
CN116432655A (en) Method and device for identifying named entities with few samples based on language knowledge learning
CN113688955B (en) Text recognition method, device, equipment and medium
CN113869005A (en) Pre-training model method and system based on sentence similarity
CN115994922B (en) Motion segmentation method, motion segmentation device, electronic equipment and storage medium
CN115376195B (en) Method for training multi-scale network model and face key point detection method
CN116258938A (en) Image retrieval and identification method based on autonomous evolution loss
CN113792121B (en) Training method and device of reading and understanding model, reading and understanding method and device
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN115100694A (en) Fingerprint quick retrieval method based on self-supervision neural network
CN112487231B (en) Automatic image labeling method based on double-image regularization constraint and dictionary learning
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
Malhotra et al. End-to-End Historical Handwritten Ethiopic Text Recognition using Deep Learning
CN114647717A (en) Intelligent question and answer method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant