CN112307883B

CN112307883B - Training method, training device, electronic equipment and computer readable storage medium

Info

Publication number: CN112307883B
Application number: CN202010763380.XA
Authority: CN
Inventors: 潘滢炜; 姚霆; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2023-11-07
Anticipated expiration: 2040-07-31
Also published as: CN112307883A

Abstract

The present disclosure relates to a training method, apparatus, electronic device, and computer-readable storage medium, and relates to the field of computer technology. The method of the present disclosure comprises: selecting a plurality of frames of images of each sample video, respectively extracting image blocks from the plurality of frames of images, and taking one image block in the extracted image blocks as a query image block; inputting each image block into a visual feature extraction model to obtain codes corresponding to each image block, wherein the codes corresponding to the query image blocks are used as query codes; and determining a first contrast loss function according to the similarity between the query codes of each sample video and codes corresponding to other image blocks in the same sample video and the similarity between the query codes of each sample video and codes corresponding to image blocks in different sample videos, and adjusting parameters of the visual feature extraction model according to the loss function of the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises the first contrast loss function.

Description

Training method, training device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method, a training apparatus, an electronic device, and a computer readable storage medium.

Background

In recent years, artificial intelligence technology has rapidly evolved. Computer vision is an important branch of the field of artificial intelligence, and a certain achievement is achieved at present. Computer vision includes the computer's understanding and processing of images, videos, and the like. Among other things, video understanding and processing is more complex.

Extracting visual features of a video from the understanding of the video is a very critical part, and accuracy of the visual feature extraction directly relates to the understanding of the video and accuracy of the results of downstream tasks (e.g., motion recognition, object tracking). The visual features may be extracted using a deep learning method. Deep learning includes supervised learning, unsupervised learning, etc. Currently, supervised learning has made significant progress and is dominant in visual feature learning of video.

Disclosure of Invention

The inventors found that: the outcome of supervised learning is largely dependent on the need for a large number of specialized labels to train the deep neural network. The labeling process is complex and cumbersome. In addition, supervised learning is performed for very specific tasks, and the obtained visual feature extraction model is difficult to be applied to other tasks, so that the problem of generalization exists.

One technical problem to be solved by the present disclosure is: a new training method for an unsupervised visual feature extraction model is provided.

According to some embodiments of the present disclosure, there is provided a training method comprising: selecting a plurality of frames of images of each sample video, respectively extracting image blocks from the plurality of frames of images, and taking one image block in the extracted image blocks as a query image block; inputting each image block into a visual feature extraction model to obtain codes corresponding to each image block, wherein the codes corresponding to the query image blocks are used as query codes; determining a first contrast loss function according to the similarity between the query codes of each sample video and codes corresponding to other image blocks in the same sample video and the similarity between the query codes of each sample video and codes corresponding to image blocks in different sample videos, wherein the higher the similarity between the query codes and the codes corresponding to other image blocks in the same sample video is, the lower the similarity between the query codes and the codes corresponding to image blocks in different sample videos is, and the smaller the value of the first contrast function is; and adjusting parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, and training the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises a first contrast loss function.

In some embodiments, the frame in which the query image block is located is taken as an anchor frame, another image block different from the query image block extracted from the anchor frame is further included in the extracted image block, and the method further includes, as a first key image block: determining a second contrast loss function according to the similarity between the query codes of each sample video and the codes corresponding to the first key value image blocks and the similarity between the query codes and the codes corresponding to the image blocks extracted by other frames in the same sample video, wherein the higher the similarity between the query codes and the codes corresponding to the first key value image blocks is, the lower the similarity between the query codes and the codes corresponding to the image blocks extracted by other frames in the same sample video is, and the value of the second contrast loss function is smaller; wherein the loss function of the visual feature extraction model further comprises a second contrast loss function.

In some embodiments, the frame in which the query image block is located is used as an anchor frame, where the anchor frame is a first frame or a last frame in a multi-frame image that is arranged in time sequence, and the method further includes: for each sample video, combining query codes and codes corresponding to image blocks extracted from other frames in the same sample video into sequence codes according to a preset sequence; inputting the sequence codes into a classification model to obtain the prediction time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video; determining a third loss function according to the prediction time sequence corresponding to each sample video and the real time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video; wherein the loss function of the visual feature extraction model further comprises a third contrast loss function.

In some embodiments, the visual feature extraction model includes a query encoder for obtaining a query code and a key encoder for obtaining codes corresponding to image blocks other than the query image block; adjusting parameters of the visual feature extraction model according to a loss function of the visual feature extraction model includes: in each iteration, the parameters of the current iteration of the query encoder are adjusted according to the loss function of the visual characteristic extraction model, and the parameters of the current iteration of the key encoder are adjusted according to the parameters of the last iteration of the query encoder and the parameters of the last iteration of the key encoder.

In some embodiments, the frame where the query image block is located is used as an anchor frame, another image block which is different from the query image block and is extracted from the anchor frame is also included in the extracted image block, and is used as a first key value image block, and one image block is respectively extracted from two other frames of the same sample video and is used as a second key value image block and a third key value image block; determining a first contrast loss function according to the similarity between the query codes of each sample video and codes corresponding to other image blocks in the same sample video and the similarity between the query codes of each sample video and codes corresponding to image blocks in different sample videos comprises: for each sample video, determining an interframe loss function corresponding to the sample video according to the similarity of the query codes corresponding to the first key value code corresponding to the first key value image block, the second key value code corresponding to the second key value image block and the third key value code corresponding to the third key value image block and the similarity of the query codes corresponding to the negative key value codes, wherein each negative key value code comprises the first key value code, the second key value code and the third key value code corresponding to other sample videos; and determining a first contrast loss function according to the interframe loss function corresponding to each sample video.

In some embodiments, the image blocks extracted from other frames in the same sample video include extracting one image block from two other frames in the same sample video respectively, as a second key value image block and a third key value image block corresponding to the sample video; determining a second contrast loss function according to the similarity between the query codes of the sample videos and the codes corresponding to the first key-value image blocks and the similarity between the query codes and the codes corresponding to the image blocks extracted by other frames in the same sample video comprises: for each sample video, determining an intra-frame loss function corresponding to the sample video according to the similarity of the query code and the first key value code corresponding to the first key value image block and the similarity of the query code and the third key value code corresponding to the second key value image block respectively; and determining a second contrast loss function according to the intra-frame loss function corresponding to each sample video.

In some embodiments, the extracting image blocks further include another image block extracted from the anchor frame and different from the query image block, and as a first key image block, extracting one image block from two other frames of the same sample video respectively, and as a second key image block and a third key image block, combining the query encoding and the encoding corresponding to the image blocks extracted from the other frames of the same sample video into a sequence encoding according to a preset sequence includes: generating a sequence code according to the second key value code corresponding to the second key value image block and the sequence of the third key value code corresponding to the third key value image block; inputting the sequence codes into a classification model, and obtaining the prediction time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video comprises the following steps: inputting the sequence codes into a classification model to obtain the results of the query image blocks before or after the second key value image block and the third key value image block as a prediction time sequence; according to the prediction time sequence corresponding to each sample video and the real time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video, determining the third loss function comprises: and determining a cross entropy loss function corresponding to each sample video according to the prediction time sequence and the real time sequence of the query image block, the second key value image block and the third key value image block in the sample video, and determining a third loss function according to the cross entropy loss function corresponding to each sample video.

In some embodiments, the method further comprises: determining similarity between the query code and the first, second and third key value codes according to dot products of the query code and the first, second and third key value codes respectively; and determining the similarity of the query code and each negative key value code according to the dot product of the query code and each negative key value code.

In some embodiments, the corresponding interframe loss function for each sample video is determined using the following formula:

wherein s is _q For query coding, i is more than or equal to 1 and less than or equal to 3,i and is a positive integer,encoding the first key->Coding for the second key->For the third key value code, 1.ltoreq.j.ltoreq.K, j is a positive integer, K is the total number of negative key value codes,>for the j-th negative key value code, τ is the hyper-parameter.

In some embodiments, the intra-loss function for each sample video is determined using the following formula:

wherein s is _q In order to query the code(s),encoding the first key->Coding for the second key->For the third key value encoding, τ is the super parameter.

In some embodiments, the cross entropy loss function for each sample video is determined using the following formula:

wherein s is _q In order to query the code(s), Coding for the second key->For the third key value encoding, y ε {0,1} represents querying s in real time order in the sample video _q Is encoded in the second key value and the third key value +.>Before or after.

In some embodiments, the loss function of the visual feature extraction model is a weighted result of the first contrast loss function, the second contrast loss function, and the third loss function.

According to other embodiments of the present disclosure, there is provided an action recognition method including: extracting a first preset number of frames from the video to be identified; determining the codes of the images of each frame by utilizing the visual characteristic extraction model obtained by the training method of any embodiment; and inputting codes of the images of each frame into an action classification model to obtain the action type in the video to be identified.

According to still further embodiments of the present disclosure, there is provided a behavior recognition method including: extracting a second preset number of frames from the video to be identified; determining the codes of the images of each frame by utilizing the visual characteristic extraction model obtained by the training method of any embodiment; and inputting codes of the images of each frame into a behavior classification model to obtain the behavior type in the video to be identified.

According to still further embodiments of the present disclosure, there is provided an object tracking method including: determining the codes of all frame images of the video to be identified by utilizing the visual feature extraction model obtained by the training method in any embodiment, wherein the position information of the marked object in the first frame image of the video to be identified; and inputting the codes of the frame images into an object tracking model to obtain the position information of the object in each frame image.

According to still further embodiments of the present disclosure, there is provided a feature extraction method of a video, including: extracting a third preset number of frames from the video; and determining the codes of the images of each frame by using the visual characteristic extraction model obtained by the training method of any embodiment.

According to still further embodiments of the present disclosure, there is provided a training device comprising: the extraction module is configured to select multiple frames of images of each sample video, respectively extract image blocks from the multiple frames of images, and take one image block in the extracted image blocks as a query image block; the coding module is configured to input each image block into the visual characteristic extraction model to obtain codes corresponding to each image block, wherein the codes corresponding to the query image blocks are used as query codes; the loss function determining module is configured to determine a first contrast loss function according to the similarity between the query codes of each sample video and codes corresponding to other image blocks in the same sample video and the similarity between the query codes of each sample video and codes corresponding to image blocks in different sample videos, wherein the higher the similarity between the query codes and the codes corresponding to other image blocks in the same sample video is, the lower the similarity between the query codes and the codes corresponding to image blocks in the different sample videos is, and the value of the first contrast function is smaller; and a parameter adjustment module configured to adjust parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, the visual feature extraction model being trained, wherein the loss function of the visual feature extraction model comprises a first contrast loss function.

According to still further embodiments of the present disclosure, there is provided an action recognition apparatus including: the extraction module is configured to extract a first preset number of frames from the video to be identified; the coding module is configured to determine the coding of each frame of image by using the visual characteristic extraction model obtained by the training method of any embodiment; and the motion classification module is configured to input the codes of the images of each frame into the motion classification model to obtain the motion type in the video to be identified.

According to still further embodiments of the present disclosure, there is provided a behavior recognition apparatus including: the extraction module is configured to extract a second preset number of frames from the video to be identified; the coding module is configured to determine the coding of each frame of image by using the visual characteristic extraction model obtained by the training method of any embodiment; the behavior classification module is configured to input codes of the images of each frame into the behavior classification model to obtain behavior types in the video to be identified.

According to still further embodiments of the present disclosure, there is provided an object tracking apparatus including: the coding module is configured to determine the coding of each frame image of the video to be identified by utilizing the visual characteristic extraction model obtained by the training method in any embodiment, wherein the first frame image of the video to be identified is marked with the position information of the target; the object tracking module is configured to encode each frame of image, input an object tracking model and obtain the position information of the target in each frame of image.

According to still further embodiments of the present disclosure, there is provided a feature extraction apparatus of a video, including: an extraction module configured to extract a third preset number of frames from the video; and the coding module is configured to determine the coding of each frame of image by using the visual characteristic extraction model obtained by the training method of any embodiment.

According to still further embodiments of the present disclosure, there is provided an electronic device including: a processor; and a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the training method of any of the foregoing embodiments, the action recognition method of any of the foregoing embodiments, the behavior recognition method of any of the foregoing embodiments, the object tracking method of any of the foregoing embodiments, or the feature extraction method of the video of any of the foregoing embodiments.

According to still further embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the training method of any of the foregoing embodiments, or the action recognition method of any of the foregoing embodiments, or the object tracking method of any of the foregoing embodiments, or the feature extraction method of the video of any of the foregoing embodiments.

According to the method, labeling is not needed for each sample video, image blocks are extracted from multi-frame images, each image block is encoded by utilizing a visual feature extraction model, one of the codes corresponding to one image block is used as query code, and a first contrast loss function is determined by querying the similarity between the codes and the codes corresponding to other image blocks in the same sample video and the similarity between the query codes and the codes corresponding to the image blocks in different sample videos, so that parameters of the visual feature extraction model are adjusted according to the first contrast loss function, and the visual feature extraction model is trained. The method omits a labeling process, improves training efficiency, and only fully utilizes the inherent structure and correlation of the data to perform unsupervised training, so that the visual extraction model can have good generalization capability. According to the method disclosed by the invention, according to the space-time consistency of the video, the loss function is constructed to train the visual feature extraction model based on the relevance of multiple frames of images in the same sample video and the independence of images in different videos, so that the visual feature extraction model can well learn the features of the video, and the trained visual feature extraction model can more accurately extract the features of the video.

Other features of the present disclosure and its advantages will become apparent from the following detailed description of exemplary embodiments of the disclosure, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 illustrates a flow diagram of a training method of some embodiments of the present disclosure.

Fig. 2 shows a flow diagram of a training method of other embodiments of the present disclosure.

Fig. 3 illustrates a flow diagram of a method of action recognition of some embodiments of the present disclosure.

Fig. 4 illustrates a flow diagram of a behavior recognition method of some embodiments of the present disclosure.

Fig. 5 illustrates a flow diagram of an object tracking method of some embodiments of the present disclosure.

Fig. 6 illustrates a structural schematic of a training device of some embodiments of the present disclosure.

Fig. 7 illustrates a schematic structural diagram of an action recognition device of some embodiments of the present disclosure.

Fig. 8 illustrates a schematic structural diagram of a behavior recognition apparatus of some embodiments of the present disclosure.

Fig. 9 illustrates a schematic structural diagram of an object tracking device of some embodiments of the present disclosure.

Fig. 10 illustrates a structural schematic diagram of an electronic device of some embodiments of the present disclosure.

Fig. 11 shows a schematic structural diagram of an electronic device of other embodiments of the present disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

The present disclosure presents an unsupervised training method for extracting a visual feature extraction model of video features, described below in connection with fig. 1-2.

FIG. 1 is a flow chart of some embodiments of the training method of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S108.

In step S102, for each sample video, a plurality of frame images of the sample video are selected, image blocks are extracted from the plurality of frame images, and one of the extracted image blocks is used as a query image block.

A large number of sample videos form a training sample setMultiple frames of images, i.e., more than two frames of images, may be randomly selected for each sample video. The image blocks are extracted by data enhancement (Data Augmentation) of each frame of image. One of the extracted image blocks is used as a Query (Query) image block for use as a contrast reference in subsequent contrast loss. The frame of image in which the query image block is located may be used as an anchor frame. Only one image block may be extracted for each frame of image other than the anchor frame, which is sufficient for training, although a plurality of image blocks may be extracted. An additional picture block may be extracted for the anchor frame. Other tiles than the query tile may be used as Key value (Key) tiles.

In some embodiments, three frames of images (s may be extracted for each sample video v ¹ ,s2,s ³ ) Query image block x extracted from anchor frame _q Another image block different from the first key value image block x ₁ Respectively extracting one image block from two other frames of the same sample video as a second key value image block x ₂ And third key value image block x ₃ 。

And extracting each image block by a random data enhancement method, namely cutting out each image block randomly according to a random proportion, and carrying out random color dithering, random gray scale, random blurring processing, random mirror image processing and the like. If a plurality of image blocks are extracted from one frame of image, the plurality of image blocks are extracted in different enhancement modes. Different enhancement modes refer to different random parameters adopted in enhancement, such as different clipping positions and sizes adopted in random clipping, different jitter amplitudes randomly adopted in color dithering, and the like.

In step S104, each image block is input into the visual feature extraction model, and the codes corresponding to each image block are obtained.

The visual feature extraction model may include a query encoder and a key encoder to map each sample video to a corresponding query image block x _q Input query encoder, input a key-value image block (e.g., x ₁ ，x ₂ ，x ₃ ) An input key encoder. The query encoder is used for obtaining the code corresponding to the query image block as a query code s _q The key value encoder is configured to obtain codes corresponding to other image blocks than the query image block, i.e., key value codes of other key value image blocks (e.g., )。

In step S106, a first contrast loss function is determined according to the similarity between the query codes of each sample video and the codes corresponding to other image blocks in the same sample video, and the similarity between the query codes of each sample video and the codes corresponding to image blocks in different sample videos.

The higher the similarity between the query code and the codes corresponding to other image blocks in the same sample video, the lower the similarity between the query code and the codes corresponding to image blocks in different sample videos, the smaller the value of the first comparison function.

Based on the space-time coherent characteristics of the video, inter-frame instance discrimination is setA task that checks the matching of the query code and the key code at the video level. From a spatiotemporal perspective, the query encodes s _q With all key codes in the same video (e.g.,) Similarly, and with the key codes sampled in other videos (e.g., expressed as +.>) Different. A determination method of a first contrast loss function is designed based on the inter-frame instance discrimination task.

In some embodiments, obtaining, for each sample video, a query code, a first key value code, a second key value code, and a third key value code, and determining, for each sample video, an inter-frame loss function corresponding to the sample video according to a similarity of the query code to the first key value code, the second key value code, and the third key value code, respectively, corresponding to the first key value image block, the second key value code, and the third key value code, respectively, corresponding to the second key value image block, and the similarity of the query code to each negative key value code, respectively, wherein each negative key value code includes the first key value code, the second key value code, and the third key value code, respectively, corresponding to other sample videos; and determining a first contrast loss function according to the interframe loss function corresponding to each sample video.

For each sample video, a query image block can be extracted from one frame, so as to further obtain a query code, and a key value image block can be extracted from another frame, so as to obtain a key value code. And for each sample video, determining an interframe loss function corresponding to the sample video according to the similarity of the query code and the key value code and the similarity of the query code and each negative key value code. Each negative key code includes key codes corresponding to other sample videos. The number of frames and the number of image blocks extracted from the sample video can be set according to actual requirements, and the loss function can be built by referring to the construction principle of the first contrast loss function in the above embodiment for the inter-frame instance discrimination task.

In some embodiments, the similarity between two encodings may be measured by dot product, not limited to the illustrated example. For example, determining similarity of the query code to the first, second, and third key value codes based on dot products of the query code and the first, second, and third key value codes, respectively; and determining the similarity of the query code and each negative key value code according to the dot product of the query code and each negative key value code.

For example, the query corresponding to the same frame is encoded as s _q And key value is encoded asAnd two key values from other frames in the same video are encoded as +.>In the inter-instance discrimination task, the goal is to determine whether two image blocks are from the same video. All key values in the same video can be encoded +.>As positive key value code and taking the image blocks sampled in other videos as negative samples, corresponding to the negative key value code +.>If the training process divides the sample video into a plurality of batches (Batch), each Batch containing a preset number of sample videos, and trains through a plurality of Batch iterations, then the image blocks sampled in other videos in adjacent batches can be used as negative samples, corresponding to the negative key value codes>Not limited to the examples illustrated.

Query encoding s _q Requiring matching to multiple key value encodingsEach of this task can be put into effectThe interframe loss function corresponding to the individual sample video is defined as all query code and positive key code pairs (s _q ，/>) For example, using the following formula.

s _q For query coding, i is more than or equal to 1 and less than or equal to 3,i and is a positive integer,encoding the first key->Coding for the second key->For the third key value code, 1.ltoreq.j.ltoreq.K, j is a positive integer, K is the total number of negative key value codes, >For the j-th negative key value code, τ is the hyper-parameter. The first contrast loss function may be determined by weighting or summing the interframe loss functions corresponding to each sample video. The visual characteristic extraction model can distinguish all positive key value codes ++in the same video by minimizing the first contrast loss function value>And query code s _q All negative key value encodings +.>

The interframe loss function for each sample video can also be defined as the number of query code and positive key code pairs (s _q ，) The weighted result of the contrast loss of (c) is not limited to the illustrated example.

In step S108, parameters of the visual feature extraction model are adjusted according to the loss function of the visual feature extraction model, and the visual feature extraction model is trained.

The loss function of the visual feature extraction model includes a first contrast loss function. In some embodiments, the visual feature extraction model includes a query encoder and a key encoder. The query encoder and the key encoder may employ different parameter adjustment strategies. For example, in each iteration, the parameters of the current iteration of the query encoder are adjusted according to the loss function of the visual feature extraction model, and the parameters of the current iteration of the key encoder are adjusted according to the parameters of the last iteration of the query encoder and the parameters of the last iteration of the key encoder.

Further, the parameters (weights) of the query encoder may be adjusted and updated with SGD (random gradient descent) by minimizing the value of the loss function of the visual feature extraction model. The key value encoder can be adjusted and updated by a Momentum Update (Momentum Update) strategy on the condition of inquiring the parameters of the encoder. The momentum update strategy can reduce the loss of consistency of the features of different key codes caused by the drastic changes of the key encoder, and can also keep the key encoder in update all the time. The parameters of the key encoder may be updated according to the following formula.

t is the number of iterations and,for the parameter of the key encoder of the t-th iteration, f _k Representing key value encoder, ">Parameter of key encoder for t-1 iteration>Querying the parameters of the encoder for the t-1 th iteration, f _q Representing the query encoder, α is the momentum coefficient.

The inter-instance discrimination task aims to learn the compatibility of video-level query image blocks and key-value image blocks. In this task, the trained visual feature extraction model can not only distinguish the query image block of the same frame in the video from the image blocks in other videos (as negative samples or unmatched samples), but also identify the image blocks in other frames in the video as positive samples or matched samples. Such a design goes beyond traditional still image monitoring and more positive sample image blocks are acquired in the same video. By contrast learning, new ideas are provided for learning objects (e.g., new views/poses of the object) with temporal evolution. The method well utilizes the advantages of space-time structures in the video, thereby enhancing the unsupervised visual feature learning of the video understanding.

According to the method, labeling is not needed for each sample video, image blocks are extracted from multi-frame images, each image block is encoded by utilizing a visual feature extraction model, one of the codes corresponding to one image block is used as query code, and a first contrast loss function is determined by querying similarity between the codes and codes corresponding to other image blocks in the same sample video and similarity between the query codes and codes corresponding to image blocks in different sample videos, and then parameters of the visual feature extraction model are adjusted according to the first contrast loss function, so that the visual feature extraction model is trained. The method of the embodiment omits a labeling process, improves training efficiency, and only fully utilizes the inherent structure and correlation of the data to carry out unsupervised training, so that the visual extraction model can have good generalization capability. According to the method, according to the space-time consistency of the video, the loss function is built to train the visual feature extraction model based on the relevance of multiple frames of images in the same sample video and the independence of images in different videos, so that the visual feature extraction model can well learn the features of the video, and the trained visual feature extraction model can accurately extract the features of the video.

In addition to having space-time consistency, the video also has characteristics of cross-frame variation and sequence fixation of frames, etc., in order to further improve learning accuracy of the visual feature extraction model, the present disclosure further provides a further improvement of the foregoing training method, which is described below in conjunction with fig. 2.

FIG. 2 is a flow chart of further embodiments of the training method of the present disclosure. As shown in fig. 2, the method of this embodiment includes: steps S202 to S220.

In step S202, for each sample video, a plurality of frame images of the sample video are selected, image blocks are extracted from the plurality of frame images, and one of the extracted image blocks is used as a query image block.

In step S204, each image block is input into the visual feature extraction model, and the codes corresponding to each image block are obtained.

In step S206, a first contrast loss function is determined according to the similarity between the query codes of each sample video and the codes corresponding to other image blocks in the same sample video, and the similarity between the query codes of each sample video and the codes corresponding to image blocks in different sample videos.

In step S208, a second contrast loss function is determined according to the similarity between the query codes of the respective sample videos and the codes corresponding to the first key image blocks, and the similarity between the query codes and the codes corresponding to the image blocks extracted from other frames in the same sample video.

The higher the similarity between the query codes and the codes corresponding to the first key-value image blocks, the lower the similarity between the query codes and the codes corresponding to the image blocks extracted by other frames in the same sample video, and the smaller the value of the second contrast loss function.

Based on the cross-frame variation characteristics of the video, an intra-frame instance discrimination task is designed, which determines from a spatial perspective whether two image blocks are derived from the same frame. Query encoding s _q The key codes corresponding to the same frame (e.g.,) Similarly, key value codes corresponding to other frames +.> Mismatch.

In some embodiments, in the case of obtaining a query code, a first key value code, a second key value code, and a third key value code for each sample video, determining, for each sample video, an intra-frame loss function corresponding to the sample video from a similarity of the query code to the first key value code corresponding to the first key value image block and a similarity of the query code to the second key value code corresponding to the second key value image block and the third key value code corresponding to the third key value image block, respectively; and determining a second contrast loss function according to the intra-frame loss function corresponding to each sample video.

For each sample video, a query image block and a key value image block are extracted from one frame to serve as first key value image blocks, query codes and first key value codes are further obtained, and a key value image block is extracted from the other frame to serve as second key value image blocks, so that second key value codes are obtained. For each sample video, determining an interframe loss function corresponding to the sample video according to the similarity of the query code and the first key value code and the similarity of the query code and the second key value code. The intra-frame instance discrimination task needs to extract at least one additional image block for comparison in the frame where the query image block is located. It is also necessary to extract at least one image block in at least one other frame of the same video. In addition, the number of frames extracted from the same video, the number of image blocks extracted from the same frame other than the query image block, and the number of image blocks extracted from other frames are not limited. For the intra-frame instance discrimination task, the loss function is constructed by referring to the construction principle of the second contrast loss function in the above embodiment.

In some embodiments, the similarity between two encodings may be measured by dot product, not limited to the illustrated example. For example, in the codes corresponding to four image blocks sampled from one video (query codes s corresponding to the same frame) _q And a first key value encodingTwo key value codes corresponding to the other two frames +.>) Will->As positive key value encoding, willAs a negative key value encoding. Since the inter-frame instance discrimination task has utilized key-value codes derived from other videos, the key-value codes of other videos that have been applied in this task are excluded from contrast learning for simplicity. Specifically, the intra-frame loss function corresponding to each sample video may be determined using the following formula.

s _q In order to query the code(s),encoding the first key->Coding for the second key->For the third key value encoding, τ is the super parameter. Weighting or summing the intra-frame loss functions corresponding to each sample videoA second contrast loss function is determined. The second contrast loss function may be designed such that the query encodes s _q Positive key coding similar to extension from the same frame +.>And negative key value coding with other frames +.>And remain different, a temporally distinct visual representation is obtained.

In the inter-frame instance discrimination task, all image blocks sampled at the video level are grouped together into a generic class without exploiting the inherent spatial variation between frames within the same video. In order to alleviate this problem, the above-described intra-frame instance discrimination task is proposed to distinguish image blocks of the same frame from image blocks of other frames in the video, and to clearly display a change from a spatial perspective. In this way, unsupervised feature learning is further guided by spatial supervision between frames, and it is desirable that the learned visual representation be differentiated between frames in the video.

In step S210, for each sample video, the query codes and the codes corresponding to the image blocks extracted from other frames in the same sample video are combined into a sequence code according to a preset sequence.

The frame where the query image block is located may be used as an anchor frame, and in order to more easily determine the order of each image block, the first frame or the last frame, which is arranged in time sequence, in the multi-frame image extracted from the video may be selected as an anchor frame. In some embodiments, the sequence code is generated in accordance with a query code, a second key code corresponding to a second key image block, and an order of a third key code corresponding to a third key image block. The query code, the second key value code, and the third key value code may be concatenated, although the order may be reversed, and is not limited to the illustrated example.

Based on the order among the frames of the video, a temporal order verification task is designed to learn the inherent order structure of the video by predicting the correct temporal order of the sequence of image blocks. Specifically, given s is encoded by a query _q And two key value encodingsAnd (5) forming sequence codes. The first key code may no longer be used here because the query code and the first key code belong to the same frame and the order cannot be distinguished.

In step S212, the sequence code is input into the classification model, so as to obtain the prediction time sequence of the query image block and the image blocks extracted from other frames in the same sample video.

In some embodiments, the sequence encoding is input into a classification model to obtain results of the query image block before or after the second key-value image block and the third key-value image block as a prediction time order. There are two cases of the output of the classification model, one is that the query image block precedes the second key value image block and the third key value image block, and the other is that the query image block follows the second key value image block and the third key value image block.

In step S214, a third loss function is determined according to the predicted time sequence corresponding to each sample video and the real time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video.

In some embodiments, the cross entropy loss function corresponding to each sample video is determined based on the predicted temporal order and the real temporal order of the query image block, the second key-value image block, and the third key-value image block in the sample video, and the third loss function is determined based on the cross entropy loss function corresponding to each sample video.

A temporal order verification task is designed from the perspective of the order between video frames, aimed at verifying whether a series of image blocks are in the correct temporal order. The underlying rationale is to encourage the visual feature extraction model to infer the temporal order of image blocks, thereby utilizing the sequential structure of the video for unsupervised feature learning.

For example, three frames are randomly sampled from an unlabeled video, and the first or last frame in time order is used as an anchor frame, given that the query encodess _q And two key value encodingsThe concatenation is in the form of an overall sequence representation, i.e. a sequence code, and is input into a classifier g (), which can predict whether the query code precedes or follows the key code. The cross entropy loss function corresponding to each sample video may be determined using the following formula:

s _q in order to query the code(s), Coding for the second key->For the third key value encoding, y ε {0,1} represents querying s in real time order in the sample video _q Is encoded in the second key value and the third key value +.>Before or after. The third loss function may be determined by weighting or summing the cross entropy loss functions corresponding to each sample video. The visual feature extraction model may be made to distinguish between the order of the different frames by minimizing the third loss function value.

The steps S206, S208, S210 to S214 may be executed in parallel, and S208, S210 to S214 are optional steps.

In step S216, parameters of the visual feature extraction model are adjusted according to the first contrast loss function and the second contrast loss function, and the visual feature extraction model is trained.

For example, the loss function of the visual feature extraction model is a weighted result of the first contrast loss function and the second contrast loss function.

In step S218, parameters of the visual feature extraction model are adjusted according to the first contrast loss function and the third loss function, and the visual feature extraction model is trained.

For example, the loss function of the visual feature extraction model is a weighted result of the first contrast loss function and the third loss function.

In step S220, parameters of the visual feature extraction model are adjusted according to the first contrast loss function, the second contrast loss function, and the third loss function, and the visual feature extraction model is trained.

For example, the loss function of the visual feature extraction model is a weighted result of the first contrast loss function, the second contrast loss function, and the third loss function. For example, the loss function of the visual feature extraction model may be determined using the following formula.

How the parameters of the visual feature extraction model are updated is described in the foregoing embodiments, and is not described in detail herein. The inter-frame instance discrimination task, the intra-frame instance discrimination task and the time sequence verification task can be combined to train the visual feature extraction model, and under the condition that the three tasks are implemented, the accuracy of the visual feature extraction model is highest, and the effect is best, because various characteristics of space-time consistency, inter-frame variability and inter-frame sequence of the video are comprehensively utilized, the visual feature extraction model comprehensively learns the features of the video. And the training process utilizes the inherent characteristics of the video, and the visual characteristic extraction model has good generalization capability.

According to the embodiment, the sampling methods in the inter-frame instance discriminating task, the intra-frame instance discriminating task and the time sequence verifying task can be different, and if the tasks need to be combined and applied, the sampling modes of the different tasks need to be unified, for example, in the embodiment, three frames are sampled for each video, a first frame or a last frame is used as an anchor frame, the anchor frame extracts a query image block and a first key value image block, and the other two frames respectively extract a second key value image block and a third key value image block. However, the sampling method is not limited to the example, and may be any method as long as the determination policy of each loss function is satisfied.

The trained visual feature extraction model may be used to extract features of the video. In some embodiments, extracting the video a third preset number of frames; and determining the codes of the images of each frame by using the visual characteristic extraction model obtained by the training method of any embodiment.

Optionally, the method may further include determining a characteristic of the video based on the encoding of each frame of image. For example, the average value of each frame image encoding may be taken as a feature of the video, or each frame image may be directly encoded as a feature of the video, not limited to the illustrated example.

In the above embodiment, the inter-frame instance discriminating task, the intra-frame instance discriminating task, and the time sequence verifying task are designed, and the visual feature extraction model is trained based on at least one characteristic of space-time continuity, inter-frame variability, and inter-frame sequence, so that the visual feature extraction model can learn the most characteristic feature in the video. For example, according to the task of inter-frame instance discrimination, the similarity of image blocks of different frames in the same video is close, and the image blocks of each frame in different videos are dissimilar, so that the visual feature extraction model can learn main features of a main body (target) in each video, for example, aiming at a video ridden by a person and other videos (a person walks or slides, etc.), the visual feature extraction model can distinguish contents in different videos through training, thereby extracting the most main features expressing the video.

For another example, according to the intra-frame instance discrimination task, the similarity of image blocks in the same frame is close, and the image blocks in different frames are dissimilar, so that the visual feature extraction model can learn the detail change features of the main body (target) in each frame, and the detail features can further improve the accuracy of the features extracted by the visual feature extraction model on the basis of the inter-frame instance discrimination task. For example, if the task is verified according to the time sequence, the sequence of each frame needs to be kept accurate, so that the visual feature extraction model can learn the feature change rule of the main body (target) in each frame, and further enriches the features learned by the visual feature extraction model on the basis of the two tasks, so that the extracted features are more accurate. The application of the three tasks can be that the visual characteristic extraction model can accurately learn the content which the whole video wants to express no matter aiming at the video of any content. Combining with any downstream task (e.g., motion recognition, behavior recognition, object tracking, etc.) gives very good performance based on the visual feature extraction model accurately understanding the content of the video.

Some embodiments of how the visual feature extraction model trained according to the previous embodiments is applied are described below in connection with fig. 3-5.

Fig. 3 is a flow chart of some embodiments of the method of action recognition (Action Recognition) of the present disclosure. As shown in fig. 3, the method of this embodiment includes: steps S302 to S306.

In step S302, a video to be identified is extracted for a first preset number of frames.

For example, 30 or 50 frames of video to be identified may also be extracted for each frame of image, where the extracted image blocks may be in a fixed manner, such as adjusting each frame to a preset size and cropping the image blocks of preset length and width from the center.

In step S304, the coding of each frame image is determined using the pre-trained visual feature extraction model.

The visual characteristic extraction model comprises a query encoder and a key value encoder, the output results of the two encoders are required to be compared in the training process, and when the visual characteristic extraction model is used, the query encoder is only used for encoding each frame of image (or each image block) because the comparison is not required.

In step S306, the codes of the images of each frame are input into an action classification model to obtain the action type in the video to be identified.

The codes of each frame of image can be averaged and then input into the action classification model. The motion recognition model may be composed of a visual feature extraction model and a motion classification model, which may be a simple linear model, not limited to the illustrated example.

The visual feature extraction model is trained by the method of the embodiment, so that the features of the video can be extracted very accurately, and the accuracy of final action recognition is improved.

Fig. 4 is a flow chart of some embodiments of the behavior recognition (Activity Recognition) method of the present disclosure. As shown in fig. 4, the method of this embodiment includes: steps S402 to S406.

In step S402, the video to be identified is extracted for a second preset number of frames.

In step S404, the coding of each frame image is determined using the pre-trained visual feature extraction model.

In step S406, the codes of the images of each frame are input into a behavior classification model to obtain the behavior type in the video to be identified.

The codes of the images of each frame can be averaged and then input into a behavior classification model. The behavior recognition model may be composed of a visual feature extraction model and a behavior classification model, which may be a simple linear model, not limited to the illustrated example.

The visual characteristic extraction model can accurately extract the characteristics of the video after being trained by the method in the embodiment, so that the accuracy of final behavior recognition is improved.

Since both the method and the model of motion recognition and behavior recognition are relatively similar, both methods are described below with the same application example.

Some embodiments of the visual characteristics extraction model are first described. The visual characteristic extraction model comprises a query encoder and a key value encoder, wherein the two encoders can adopt similar structures and can adopt a neural network structure. For example, two encoders use the structure of ResNet50 (residual network 50) +MLP (Multi-layer perceptron). Further, a global pooling layer may be added between ResNet50 and the MLP. The MLP may only affect the training process and not participate in downstream tasks. In the training process, the MLP is added with the discrimination network structure of the three tasks of the inter-frame instance discrimination task, the intra-frame instance discrimination task and the time sequence verification task in the previous embodiment. When the visual feature extraction model is used as the feature extraction part in the action recognition model and the behavior recognition model, only the structure of ResNet50+MLP may be applied. The visual feature extraction model can be pre-trained by adopting a training set containing various types of sample videos, so that the video feature extraction model can learn the features of various types of videos, and the training process does not need to be marked.

Further, the classification parts of the motion recognition model and the behavior recognition model, that is, the motion classification model and the behavior classification model, may employ linear models, for example, SVMs (support vector machines) may be employed. The overall structure of the action recognition model and the behavior recognition model may be ResNet50+MLP+SVM. The visual feature extraction model may be pre-trained according to the method of the previous embodiment, and then combined with other linear models to obtain the motion recognition model and the behavior recognition model.

The action classification model and the behavior classification model need to be trained by using a training set so that the whole model can complete action recognition or behavior recognition. The action classification model may be trained using an action class training set, such as the Kinetics400 dataset, and the action classification model may be trained using an action class training set, such as the actiginet dataset, etc., without being limited to the illustrated examples. In the process, the visual feature extraction model does not need to be trained again, and the number of samples of the training sets of the action classification model and the behavior classification model can be far smaller than that of the training sets of the video feature extraction model, so that the labeling quantity is greatly reduced, and the efficiency is improved. Taking an action classification model as an example, in the training process of the action classification model, a preset number of frames can be extracted for each sample video, image blocks (which can be images of the whole frame are used as image blocks and are determined according to the training requirement of a specific action classification model) are extracted according to a preset mode, a visual characteristic extraction model is input to obtain codes of all the image blocks, the codes of all the image blocks are averaged and input into the action classification model to obtain a classification result, a loss function is determined according to the classification result and the marked action type, and parameters of the action classification model are adjusted according to the loss function until convergence conditions are reached to complete training. The specific loss function determining method and the parameter adjusting method may be implemented by using the prior art, which is not described herein.

The training of the action classification model and the behavior classification model is relatively simple, the visual characteristic extraction model can be combined with various downstream tasks after one-time training, the training for different downstream tasks is not needed, and the efficiency is improved under the condition of multiple applications.

FIG. 5 is a flow chart of some embodiments of an Object Tracking (Object Tracking) method of the present disclosure. As shown in fig. 5, the method of this embodiment includes: steps S502 to S504.

In step S502, the coding of each frame image of the video to be identified is determined using the pre-trained visual feature extraction model.

Object tracking may label the position information of the object, e.g., the position of the bounding box of the object, in the first frame image. The visual feature extraction model may be input after preprocessing each frame of image, for example, the spatial resolution may be adjusted to a preset resolution or the like.

In step S504, the encoding of each frame image is input into the object tracking model, and the positional information of the object in each frame image is obtained.

Object tracking may be based on SiamFC (target tracking algorithm based on a full convolution twin network). As in the previous embodiment, the res net50+mlp may be used as an encoder for the visual feature extraction model, and the query encoder is used to determine the encoding of each frame of image, so as to adapt to the SiamFC algorithm and evaluate the effect of the visual feature extraction model more accurately. After the query encoder of the visual feature extraction model, 1x1 convolution is added, and in the training process, the learning of the tracking features is completed by optimizing the parameters of the 1x1 convolution only. The structure of adding 1x1 convolutions after querying the encoder can be used as a feature extraction part in the sialmfc algorithm. Meanwhile, the configuration of the ResNet50 can be modified to be more suitable for the SiamFC algorithm, the convolution with the step length of 2 in { res4, res5} of the ResNet50 is changed to be 1, and the expansion rate of the convolution with the step length of 3x 3 in res4 and res5 is respectively changed from 1 to 2 and 4. The query encoder and the 1x1 convolution portion may be configured to transform the first frame image and the other frames of images, and then input the transformed code of the first frame image and the transformed code of the other frames of images into an object tracking portion (i.e., an object tracking model) of the sialmfc algorithm. The specific sialmfc algorithm and the training method of the object tracking part model in the algorithm can refer to the prior art, and are not described herein.

The visual feature extraction model is trained by the method of the embodiment, so that the features of the video can be extracted very accurately, and the accuracy of final object tracking is improved.

The inventor carries out a comparison experiment on the visual characteristic extraction model trained by the training method in the disclosure and the visual characteristic extraction model trained by the prior various training methods, and has higher accuracy in various downstream task scenes, so that the training method in the disclosure can reduce the tedious process of labeling and improve the accuracy of the model.

The present disclosure also provides a training device, described below in connection with fig. 6.

Fig. 6 is a block diagram of some embodiments of the training device of the present disclosure. As shown in fig. 6, the apparatus 60 of this embodiment includes: the system comprises an extraction module 610, an encoding module 620, a loss function determination module 630 and a parameter adjustment module 640.

The extraction module 610 is configured to select, for each sample video, a plurality of frame images of the sample video, and extract image blocks from the plurality of frame images, respectively, and take one of the extracted image blocks as a query image block.

The encoding module 620 is configured to input each image block into the visual feature extraction model to obtain a code corresponding to each image block, where the code corresponding to the query image block is used as the query code.

The loss function determining module 630 is configured to determine a first contrast loss function according to a similarity between the query encoding of each sample video and the encoding corresponding to the other image blocks in the same sample video, and a similarity between the query encoding of each sample video and the encoding corresponding to the image blocks in different sample videos, wherein the higher the similarity between the query encoding and the encoding corresponding to the other image blocks in the same sample video, the lower the similarity between the query encoding and the encoding corresponding to the image blocks in the different sample videos, and the smaller the value of the first contrast function.

In some embodiments, the frame in which the query image block is located is used as an anchor frame, and the extracted image block further includes another image block different from the query image block extracted from the anchor frame, which is used as a first key value image block, and one image block is respectively extracted from two other frames of the same sample video, which is used as a second key value image block and a third key value image block. The loss function determining module 630 is configured to determine, for each sample video, an inter-frame loss function corresponding to the sample video according to a similarity of a query code to a first key code corresponding to the first key-value image block, a second key-value code corresponding to the second key-value image block, and a third key-value code corresponding to the third key-value image block, respectively, and a similarity of the query code to each negative key-value code, respectively, wherein each negative key-value code includes the first key-value code, the second key-value code, and the third key-value code corresponding to other sample videos; and determining a first contrast loss function according to the interframe loss function corresponding to each sample video.

In some embodiments, the similarity of the query code to the first, second, and third key value codes is determined from dot products of the query code to the first, second, and third key value codes, respectively; and determining the similarity of the query code and each negative key value code according to the dot product of the query code and each negative key value code.

In some embodiments, the frame in which the query image block is located is used as an anchor frame, and another image block different from the query image block extracted from the anchor frame is further included in the extracted image block as the first key value image block. The loss function determining module 630 is further configured to determine a second contrast loss function according to a similarity between the query codes of the respective sample videos and codes corresponding to the first key-value image blocks, and a similarity between the query codes and codes corresponding to the image blocks extracted by other frames in the same sample video, wherein the higher the similarity between the query codes and the codes corresponding to the first key-value image blocks, the lower the similarity between the query codes and the codes corresponding to the image blocks extracted by other frames in the same sample video, and the smaller the value of the second contrast loss function. The loss function of the visual feature extraction model further includes a second contrast loss function.

In some embodiments, the image blocks extracted from other frames in the same sample video include extracting one image block from two other frames in the same sample video respectively as the second key value image block and the third key value image block corresponding to the sample video. The loss function determining module 630 is configured to determine, for each sample video, an intra-frame loss function corresponding to the sample video according to a similarity of the query code to the first key code corresponding to the first key-value image block and a similarity of the query code to the third key-value code corresponding to the second key-value image block and the third key-value code corresponding to the third key-value image block, respectively; and determining a second contrast loss function according to the intra-frame loss function corresponding to each sample video.

In some embodiments, the anchor frame is a chronologically first frame or a last frame in the multi-frame image. The loss function determining module 630 is further configured to combine, for each sample video, the query codes and codes corresponding to image blocks extracted from other frames in the same sample video into a sequence code according to a preset order; inputting the sequence codes into a classification model to obtain the prediction time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video; determining a third loss function according to the prediction time sequence corresponding to each sample video and the real time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video; the loss function of the visual feature extraction model further includes a third contrast loss function.

In some embodiments, the extracted image block further includes another image block extracted from the anchor frame and different from the query image block, as a first key image block, and one image block is extracted from two other frames of the same sample video, respectively, as a second key image block and a third key image block. The loss function determining module 630 is configured to generate a sequence code according to the query code, the second key value code corresponding to the second key value image block, and the order of the third key value code corresponding to the third key value image block; inputting the sequence codes into a classification model to obtain the results of the query image blocks before or after the second key value image block and the third key value image block as a prediction time sequence; and determining a cross entropy loss function corresponding to each sample video according to the prediction time sequence and the real time sequence of the query image block, the second key value image block and the third key value image block in the sample video, and determining a third loss function according to the cross entropy loss function corresponding to each sample video.

The parameter adjustment module 640 is configured to adjust parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises a first contrast loss function.

In some embodiments, the visual feature extraction model includes a query encoder for obtaining a query encoding and a key encoder for obtaining encodings corresponding to image blocks other than the query image block. The parameter adjustment module 640 is configured to adjust parameters of the current iteration of the query encoder according to the loss function of the visual feature extraction model in each iteration, and adjust parameters of the current iteration of the key encoder according to parameters of the last iteration of the query encoder and parameters of the last iteration of the key encoder.

The present disclosure also provides an action recognition device, described below in connection with fig. 7.

Fig. 7 is a block diagram of some embodiments of the motion recognition device of the present disclosure. As shown in fig. 7, the apparatus 70 of this embodiment includes: the system comprises an extraction module 710, an encoding module 720 and an action classification module 730.

The extraction module 710 is configured to extract a first preset number of frames from the video to be identified.

The encoding module 720 is configured to determine the encoding of each frame of images using the visual feature extraction model obtained by the training method of any of the previous embodiments.

The motion classification module 730 is configured to input the codes of the images of each frame into a motion classification model to obtain the motion type in the video to be identified.

The present disclosure also provides a behavior recognition apparatus, described below in connection with fig. 8.

Fig. 8 is a block diagram of some embodiments of a behavior recognition apparatus of the present disclosure. As shown in fig. 8, the apparatus 80 of this embodiment includes: the system comprises an extraction module 810, an encoding module 820 and a behavior classification module 830.

The extraction module 810 is configured to extract the video to be identified by a second preset number of frames.

The encoding module 820 is configured to determine the encoding of each frame of images using the visual feature extraction model obtained by the training method of any of the previous embodiments

The behavior classification module 830 is configured to input the codes of the images of each frame into a behavior classification model to obtain the behavior type in the video to be identified.

The present disclosure also provides an object tracking device, described below in connection with fig. 9.

Fig. 9 is a block diagram of some embodiments of an object tracking device of the present disclosure. As shown in fig. 9, the apparatus 90 of this embodiment includes: encoding module 910, object tracking module 920.

The encoding module 910 is configured to determine an encoding of each frame image of the video to be identified by using the visual feature extraction model obtained by the training method of any of the foregoing embodiments, where the first frame image of the video to be identified is labeled with the location information of the target.

The object tracking module 920 is configured to encode each frame of image, input an object tracking model, and obtain position information of the object in each frame of image.

The present disclosure also provides a feature extraction apparatus of a video, including: an extraction module configured to extract a third preset number of frames from the video; and the coding module is configured to determine the coding of each frame of image by using the visual characteristic extraction model obtained by the training method of any embodiment. Optionally, the apparatus may further comprise a feature determining module configured to determine a feature of the video from the encoding of each frame of image.

The electronic devices in embodiments of the present disclosure may each be implemented by various computing devices or computer systems, described below in conjunction with fig. 10 and 11.

Fig. 10 is a block diagram of some embodiments of the disclosed electronic device. As shown in fig. 10, the electronic apparatus 100 of this embodiment includes: a memory 1010 and a processor 1020 coupled to the memory 1010, the processor 1020 being configured to perform the training method, the action recognition method, the behavior recognition method, the object tracking method, the feature extraction method of the video in any of the embodiments of the present disclosure based on instructions stored in the memory 1010.

The memory 1010 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), database, and other programs.

Fig. 11 is a block diagram of other embodiments of the electronic device of the present disclosure. As shown in fig. 11, the electronic device 110 of this embodiment includes: memory 1110 and processor 1120 are similar to memory 1010 and processor 1020, respectively. Input/output interfaces 1130, network interfaces 1140, storage interfaces 1150, and the like may also be included. These interfaces 1130, 1140, 1150 and the memory 1110 and the processor 1120 may be connected by, for example, a bus 1160. The input/output interface 1130 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, and the like. The network interface 1140 provides a connection interface for various networking devices, such as may be connected to a database server or cloud storage server. The storage interface 1150 provides a connection interface for external storage devices such as SD cards, U discs, and the like.

It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but rather to enable any modification, equivalent replacement, improvement or the like, which fall within the spirit and principles of the present disclosure.

Claims

1. A training method, comprising:

selecting a plurality of frames of images of each sample video, respectively extracting image blocks from the plurality of frames of images, and taking one image block in the extracted image blocks as a query image block;

Inputting each image block into a visual feature extraction model to obtain codes corresponding to each image block, wherein the codes corresponding to the query image blocks are used as query codes;

determining a first contrast loss function according to the similarity between the query codes of each sample video and codes corresponding to other image blocks in the same sample video and the similarity between the query codes of each sample video and codes corresponding to image blocks in different sample videos, wherein the higher the similarity between the query codes and codes corresponding to other image blocks in the same sample video is, the lower the similarity between the query codes and codes corresponding to image blocks in different sample videos is, and the smaller the value of the first contrast function is;

and adjusting parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, and training the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises the first contrast loss function.

2. The training method according to claim 1, wherein a frame in which the query image block is located is taken as an anchor frame, and another image block different from the query image block extracted from the anchor frame is further included as a first key image block, and the method further includes:

Determining a second contrast loss function according to the similarity between the query codes of each sample video and the codes corresponding to the first key value image blocks and the similarity between the query codes and the codes corresponding to the image blocks extracted by other frames in the same sample video, wherein the higher the similarity between the query codes and the codes corresponding to the first key value image blocks is, the lower the similarity between the query codes and the codes corresponding to the image blocks extracted by other frames in the same sample video is, and the smaller the value of the second contrast loss function is;

wherein the loss function of the visual feature extraction model further comprises a second contrast loss function.

3. The training method according to claim 1 or 2, wherein the frame in which the query image block is located is used as an anchor frame, the anchor frame being a first frame or a last frame in chronological order in the multi-frame image, the method further comprising:

for each sample video, combining the query codes and codes corresponding to image blocks extracted from other frames in the same sample video into a sequence code according to a preset sequence;

inputting the sequence codes into a classification model to obtain the prediction time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video;

Determining a third loss function according to the prediction time sequence corresponding to each sample video and the real time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video;

wherein the loss function of the visual feature extraction model further comprises a third contrast loss function.

4. The training method of claim 1, wherein the visual feature extraction model includes a query encoder for obtaining the query code and a key encoder for obtaining codes corresponding to image blocks other than the query image block;

the adjusting parameters of the visual feature extraction model according to the loss function of the visual feature extraction model comprises:

in each iteration, the parameter of the current iteration of the query encoder is adjusted according to the loss function of the visual feature extraction model, and the parameter of the current iteration of the key encoder is adjusted according to the parameter of the last iteration of the query encoder and the parameter of the last iteration of the key encoder.

5. The training method according to claim 1, wherein the frame where the query image block is located is taken as an anchor frame, the extracted image block further includes another image block different from the query image block extracted from the anchor frame, and the another image block is taken as a first key value image block, and one image block is respectively extracted from two other frames of the same sample video, and the another image block is taken as a second key value image block and a third key value image block;

Determining a first contrast loss function according to the similarity between the query codes of each sample video and codes corresponding to other image blocks in the same sample video and the similarity between the query codes of each sample video and codes corresponding to image blocks in different sample videos, wherein the determining the first contrast loss function comprises:

for each sample video, determining an interframe loss function corresponding to the sample video according to the similarity of the query codes corresponding to the first key value code, the second key value code corresponding to the second key value image block and the third key value code corresponding to the third key value image block respectively and the similarity of the query codes corresponding to each negative key value code respectively, wherein each negative key value code comprises the first key value code, the second key value code and the third key value code corresponding to other sample videos;

and determining a first contrast loss function according to the interframe loss function corresponding to each sample video.

6. The training method according to claim 2, wherein the image blocks extracted from other frames in the same sample video include extracting one image block from two other frames in the same sample video respectively as a second key value image block and a third key value image block corresponding to the sample video;

Determining a second contrast loss function according to the similarity between the query codes of the sample videos and the codes corresponding to the first key-value image blocks and the similarity between the query codes and the codes corresponding to the image blocks extracted from other frames in the same sample video comprises:

for each sample video, determining an intra-frame loss function corresponding to the sample video according to the similarity of the query code and the first key value code corresponding to the first key value image block and the similarity of the query code and the second key value code corresponding to the second key value image block and the third key value code corresponding to the third key value image block respectively;

and determining the second contrast loss function according to the intra-frame loss function corresponding to each sample video.

7. A training method according to claim 3, wherein the extracted image blocks further include another image block different from the query image block extracted from the anchor frame as a first key value image block, and one image block is extracted from two other frames of the same sample video as a second key value image block and a third key value image block, respectively;

the combining the query code and the codes corresponding to the image blocks extracted from other frames in the same sample video into a sequence code according to a preset sequence comprises the following steps:

Generating a sequence code according to the query code, the second key value code corresponding to the second key value image block, and the third key value code corresponding to the third key value image block;

inputting the sequence codes into a classification model, and obtaining the prediction time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video comprises the following steps:

inputting the sequence codes into a classification model to obtain the result of the query image block before or after the second key value image block and the third key value image block as the prediction time sequence;

determining a third loss function according to the prediction time sequence corresponding to each sample video and the real time sequence of the query image block and the image blocks extracted from other frames in the same sample video in the sample video comprises:

and determining a cross entropy loss function corresponding to each sample video according to the prediction time sequence and the real time sequence of the query image block, the second key value image block and the third key value image block in the sample video, and determining a third loss function according to the cross entropy loss function corresponding to each sample video.

8. The training method of claim 5 or 6, further comprising:

determining similarity of the query code to the first, second and third key value codes according to dot products of the query code to the first, second and third key value codes, respectively;

and determining the similarity between the query code and each negative key value code according to the dot product of the query code and each negative key value code.

9. The training method of claim 8, wherein the interframe loss function for each sample video is determined using the following formula:

10. The training method of claim 8, wherein the intra-frame loss function for each sample video is determined using the following formula:

11. The training method of claim 7, wherein the cross entropy loss function for each sample video is determined using the following formula:

wherein s is _q In order to query the code(s),coding for the second key->For the third key value encoding, y ε {0,1} represents querying s in real time order in the sample video _q Is encoded in the second key value and the third key value +.>Before or after.

12. The training method according to claim 3, wherein,

the loss function of the visual feature extraction model is a weighted result of the first, second, and third contrast loss functions.

13. A method of action recognition, comprising:

extracting a first preset number of frames from the video to be identified;

determining the coding of each frame of images by using a visual feature extraction model obtained by the training method according to any one of claims 1 to 12;

and inputting codes of the images of each frame into an action classification model to obtain the action type in the video to be identified.

14. A behavior recognition method, comprising:

extracting a second preset number of frames from the video to be identified;

And inputting codes of the images of each frame into a behavior classification model to obtain the behavior type in the video to be identified.

15. An object tracking method, comprising:

determining the codes of all frame images of the video to be identified by using the visual characteristic extraction model obtained by the training method according to any one of claims 1-12, wherein the position information of the marked object in the first frame image of the video to be identified;

and inputting the codes of the frame images into an object tracking model to obtain the position information of the object in each frame image.

16. A method for feature extraction of video, comprising:

extracting a third preset number of frames from the video;

a visual feature extraction model obtained by the training method according to any one of claims 1 to 12, for determining the coding of each frame of images.

17. A training device, comprising:

the extraction module is configured to select multiple frames of images of each sample video, respectively extract image blocks from the multiple frames of images, and take one image block in the extracted image blocks as a query image block;

the coding module is configured to input each image block into the visual characteristic extraction model to obtain codes corresponding to each image block, wherein the codes corresponding to the query image blocks are used as query codes;

A loss function determining module configured to determine a first contrast loss function according to the similarity between the query code of each sample video and codes corresponding to other image blocks in the same sample video, and the similarity between the query code of each sample video and codes corresponding to image blocks in different sample videos, wherein the higher the similarity between the query code and codes corresponding to other image blocks in the same sample video, the lower the similarity between the query code and codes corresponding to image blocks in different sample videos, and the smaller the value of the first contrast function;

and a parameter adjustment module configured to adjust parameters of the visual feature extraction model according to a loss function of the visual feature extraction model, and train the visual feature extraction model, wherein the loss function of the visual feature extraction model comprises the first contrast loss function.

18. An action recognition device, comprising:

the extraction module is configured to extract a first preset number of frames from the video to be identified;

an encoding module configured to determine an encoding of each frame of images using the visual feature extraction model obtained by the training method of any one of claims 1-12;

And the motion classification module is configured to input the codes of the images of each frame into a motion classification model to obtain the motion type in the video to be identified.

19. A behavior recognition apparatus comprising:

the extraction module is configured to extract a second preset number of frames from the video to be identified;

and the behavior classification module is configured to input codes of the images of each frame into the behavior classification model to obtain the behavior type in the video to be identified.

20. An object tracking device, comprising:

the coding module is configured to determine the coding of each frame image of the video to be identified by using the visual characteristic extraction model obtained by the training method according to any one of claims 1-12, wherein the first frame image of the video to be identified is marked with the position information of the target;

and the object tracking module is configured to encode each frame of image, input an object tracking model and obtain the position information of the target in each frame of image.

21. A feature extraction apparatus of a video, comprising:

an extraction module configured to extract a third preset number of frames from the video;

An encoding module configured to determine an encoding of each frame of images using the visual feature extraction model obtained by the training method of any one of claims 1-12.

22. An electronic device, comprising:

a processor; and

a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform the training method of any one of claims 1-12, or the action recognition method of claim 13, or the action recognition method of claim 14, or the object tracking method of claim 15, or the feature extraction method of the video of claim 16.

23. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the training method of any one of claims 1-12, or the action recognition method of claim 13, or the action recognition method of claim 14, or the object tracking method of claim 15, or the feature extraction method of the video of claim 16.