CN108235116B

CN108235116B - Feature propagation method and apparatus, electronic device, and medium

Info

Publication number: CN108235116B
Application number: CN201711455916.6A
Authority: CN
Inventors: 石建萍; 李玉乐; 林达华
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2020-06-16
Anticipated expiration: 2037-12-27
Also published as: CN108235116A

Abstract

The embodiment of the invention discloses a feature propagation method and device, electronic equipment and a medium, wherein the method comprises the following steps: judging whether the current frame is a key frame; responding to the fact that the current frame is a non-key frame in a video, and acquiring high-layer characteristics of the current frame according to the low-layer characteristics of an adjacent previous key frame and the low-layer characteristics of the current frame; in the neural network, the network depth of the first network layer corresponding to the low-level features of the previous key frame is extracted and obtained, and is shallower than the network depth of the second network layer corresponding to the high-level features of the previous key frame. The embodiment of the invention utilizes the consistency information between the video frames and the characteristic that the semantic labels between adjacent frames are similar to spread the semantic features of the video from the adjacent previous key frame to the current frame, thereby reducing the repeated calculation time and improving the accuracy of semantic segmentation.

Description

Feature propagation method and apparatus, electronic device, and medium

Technical Field

The present invention relates to computer vision technology, and in particular, to a feature propagation method and apparatus, an electronic device, a program, and a medium.

Background

Video semantic segmentation is an important issue in computer vision and video semantic understanding tasks. The video semantic segmentation model has important application in many fields, such as automatic driving, video monitoring, video target analysis and the like.

Currently, although semantic segmentation techniques for images are studied more, semantic segmentation techniques for videos are studied less. The video semantic segmentation requires high real-time performance, and meanwhile, enough precision can be guaranteed.

Disclosure of Invention

The embodiment of the invention provides a technical scheme for characteristic propagation in a video.

According to an aspect of the embodiments of the present invention, there is provided a feature propagation method, including:

judging whether the current frame is a key frame;

responding to the fact that the current frame is a non-key frame in a video, and acquiring high-level features of the current frame according to low-level features of a previous key frame adjacent to the current frame and the low-level features of the current frame; in the neural network, the network depth of the first network layer corresponding to the low-level features of the previous key frame is extracted and obtained, and is shallower than the network depth of the second network layer corresponding to the high-level features of the previous key frame.

Optionally, in any one of the above method embodiments of the present invention, the obtaining, according to the low-layer feature of the previous key frame adjacent to the current frame and the low-layer feature of the current frame, the high-layer feature of the current frame from the high-layer feature of the previous key frame includes:

acquiring a conversion weight value converted from the low-layer feature of the previous key frame to the low-layer feature of the current frame according to the low-layer feature of the adjacent previous key frame and the low-layer feature of the current frame;

and converting the high-level features of the previous key frame into the high-level features of the current frame according to the high-level features of the previous key frame and the conversion weight.

Optionally, in any of the above method embodiments of the present invention, in response to that the current frame is a non-key frame in a video, the method further includes:

and performing semantic segmentation on the current frame at least based on the high-level features of the current frame to obtain a semantic label of the current frame.

Optionally, in any one of the above method embodiments of the present invention, the performing semantic segmentation on the current frame based on at least a high-layer feature of the current frame includes:

and performing semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame to obtain a semantic label of the current frame.

Optionally, in any one of the above method embodiments of the present invention, performing semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame includes:

converting the low-layer features of the current frame to obtain features consistent with the number of channels of the high-layer features of the current frame;

splicing or fusing the features obtained by converting the current frame with the high-level features of the current frame to obtain the features of the current frame;

and performing semantic segmentation on the current frame based on the current frame characteristics.

Optionally, in any one of the method embodiments of the present invention, the determining whether the current frame is a key frame includes:

and judging whether the current frame is a key frame or not by using a key frame scheduling strategy.

Optionally, in any one of the method embodiments of the present invention, the determining whether the current frame is a key frame by using a key frame scheduling policy includes: judging whether the current frame is a key frame by using a fixed-length scheduling method;

in response to the current frame being a non-key frame in a video, the method further comprises: and extracting the features of the current frame to obtain the low-level features of the current frame.

Optionally, in any of the method embodiments of the present invention, determining whether the current frame is a key frame by using a key frame scheduling policy includes:

extracting the features of the current frame to obtain the low-level features of the current frame;

acquiring a scheduling probability value of the current frame scheduled as a key frame according to the low-layer characteristics of the previous key frame and the low-layer characteristics of the current frame;

and determining whether the current frame is scheduled as a key frame according to the scheduling probability value of the current frame.

Optionally, in any of the above method embodiments of the present invention, acquiring a scheduling probability value that the current frame is scheduled as a key frame according to the low-layer feature of the previous key frame and the low-layer feature of the current frame includes:

splicing the low-level features of the previous key frame and the low-level features of the current frame to obtain splicing features;

and acquiring a scheduling probability value of whether the current frame is scheduled as a key frame or not based on the splicing characteristics through a key frame scheduling network.

Optionally, in any of the above method embodiments of the present invention, the method further includes:

responding to the fact that the current frame is a key frame in a video, extracting features of the current frame, obtaining low-layer features of the current frame and caching the low-layer features;

and extracting the features of the low-level features of the current frame to obtain the high-level features of the current frame and caching the high-level features.

and responding to the fact that the current frame is a key frame in the video, performing semantic segmentation on the current frame based on the high-level features of the current frame, and obtaining a semantic label of the current frame.

According to another aspect of the embodiments of the present invention, there is provided a feature propagation device including:

the judging module is used for judging whether the current frame is a key frame;

the feature propagation module is used for responding to the fact that the current frame is a non-key frame in the video according to the judgment result of the judgment module, and acquiring the high-level feature of the current frame according to the low-level feature of a previous key frame adjacent to the current frame and the low-level feature of the current frame and the high-level feature of the previous key frame; in the neural network, the network depth of the first network layer of the low-level features of the previous key frame is extracted and obtained, and is shallower than the network depth of the second network layer corresponding to the high-level features of the previous key frame.

Optionally, in any one of the apparatus embodiments of the present invention, the feature propagation module is specifically configured to:

acquiring a conversion weight value converted from the low-layer feature of the previous key frame to the low-layer feature of the current frame according to the low-layer feature of the previous key frame and the low-layer feature of the current frame; and

Optionally, in any one of the apparatus embodiments of the present invention, the apparatus further includes:

and the semantic segmentation module is used for responding to the current frame as a non-key frame in the video according to the judgment result of the judgment module, performing semantic segmentation on the current frame at least based on the high-level features of the current frame, and obtaining the semantic label of the current frame.

Optionally, in any one of the apparatus embodiments of the present invention, when the semantic segmentation module performs semantic segmentation on the current frame based on at least a high-level feature of the current frame, the semantic segmentation module is specifically configured to: and performing semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame.

Optionally, in any one of the apparatus embodiments of the present invention, when performing semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame, the semantic segmentation module is specifically configured to:

splicing or fusing the features obtained by converting the current frame with the high-level features of the current frame to obtain the features of the current frame; and

Optionally, in any one of the apparatus embodiments of the present invention, the determining module is specifically configured to determine whether the current frame is a key frame by using a key frame scheduling policy.

Optionally, in any one of the apparatus embodiments of the present invention, the determining module is specifically configured to determine whether the current frame is a key frame by using a fixed-length scheduling method;

the device further comprises:

and the first feature extraction module is used for responding to the current frame as a non-key frame in the video according to the judgment result of the judgment module, and performing feature extraction on the current frame to obtain the low-level features of the current frame.

the first feature extraction module is used for extracting features of the current frame to obtain low-level features of the current frame;

the acquisition module is used for acquiring the scheduling probability value of the current frame scheduled as the key frame according to the low-layer characteristics of the adjacent previous key frame and the low-layer characteristics of the current frame;

the judging module is specifically configured to determine whether the current frame is scheduled as a key frame according to the scheduling probability value of the current frame.

Optionally, in any one of the apparatus embodiments of the present invention, the obtaining module includes:

the splicing unit is used for splicing the low-layer feature of the previous key frame and the low-layer feature of the current frame to obtain a splicing feature;

and the key frame scheduling network is used for acquiring a scheduling probability value of whether the current frame is scheduled as a key frame or not based on the splicing characteristics.

Optionally, in any one of the apparatus embodiments of the present invention, the first feature extraction module is further configured to, according to a determination result of the determination module, respond that a current frame is a key frame in a video, perform feature extraction on the current frame, obtain a low-level feature of the current frame, and cache the low-level feature;

the device further comprises:

and the second feature extraction module is used for extracting features of the low-level features of the key frames, obtaining the high-level features of the key frames and caching the high-level features.

Optionally, in any one of the apparatus embodiments of the present invention, the semantic segmentation module is further configured to, according to a determination result of the determination module, respond that a current frame is a key frame in a video, and perform semantic segmentation on the current frame based on a high-level feature of the current frame to obtain a semantic tag of the current frame.

According to still another aspect of the embodiments of the present invention, there is provided an electronic apparatus including: the feature propagation device according to any of the above embodiments of the present invention.

According to another aspect of the embodiments of the present invention, there is provided another electronic device, including:

a processor and a feature propagation apparatus according to any of the above embodiments of the invention;

when the processor runs the feature propagation device, the units in the feature propagation device according to any of the above embodiments of the present invention are run.

According to still another aspect of the embodiments of the present invention, there is provided still another electronic device including: a processor and a memory;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation of each step in the feature propagation method according to any one of the above embodiments of the invention.

According to yet another aspect of the embodiments of the present invention, there is provided a computer program, including computer readable code, characterized in that when the computer readable code runs on a device, a processor in the device executes instructions for implementing the steps in the feature propagation method according to any one of the above-mentioned embodiments of the present invention.

According to another aspect of the embodiments of the present invention, there is provided a computer-readable medium for storing computer-readable instructions, which when executed, implement the operations of the steps in the feature propagation method according to any one of the above-mentioned embodiments of the present invention.

Based on the feature propagation method and apparatus, the electronic device, the program, and the medium provided in the above embodiments of the present invention, when the current frame is a non-key frame in a video, according to a low-level feature of a previous key frame adjacent to the current frame and a low-level feature of the current frame, a high-level feature of the current frame is obtained from a high-level feature of the previous key frame, so that semantic segmentation is performed on the non-key frame based on the high-level feature. The embodiment of the invention utilizes the consistency information between video frames and the characteristic that semantic labels between adjacent frames are similar, and spreads the high-level features for video semantic segmentation from the adjacent previous key frame to the current frame so as to carry out semantic segmentation on the current frame based on the high-level features of the current frame, and the high-level features for semantic segmentation are not required to be extracted frame by frame from the continuous frames of the video, so that the repeated calculation time is reduced compared with the mode of extracting the high-level features for semantic segmentation frame by frame; in addition, the embodiment of the invention transmits the high-level characteristics of the previous key frame to the current frame for semantic segmentation instead of directly transmitting the semantic labels, and improves the accuracy of semantic segmentation compared with the method of transmitting the semantic labels of the key frame by optical flow.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

The invention will be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of one embodiment of a feature propagation method of the present invention.

FIG. 2 is a flow chart of another embodiment of a feature propagation method of the present invention.

Fig. 3 is a flow chart of another embodiment of the feature propagation method of the present invention.

Fig. 4 is a schematic structural diagram of an embodiment of the feature dissemination device of the present invention.

Fig. 5 is a schematic structural view of another embodiment of the feature dissemination device of the present invention.

Fig. 6 is a schematic structural view of another embodiment of a feature dissemination device of the present invention.

Fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In the process of implementing the invention, the inventor finds out through research that in the existing video semantic segmentation method, a model for image semantic segmentation is directly applied to a video, and because continuous frames of the video have much redundant information, the information is not utilized in frame-by-frame processing, so that the calculation complexity is high; in another video semantic segmentation method, optical flow is used to propagate features from key frames to non-key frames, which uses a deep neural network to calculate semantic tags of key frames, then uses a small network to calculate optical flows of key frames and current frames, i.e. pixel-by-pixel displacement vectors in key frames and current frames, and then propagates semantic tags from key frames to current frames through optical flow, i.e.: the key frame propagation semantic tags are processed based on the pixel-by-pixel displacement vectors to obtain semantic tags of a current frame, and due to the fact that the images of the images in the video shake and blur caused by target motion and the like in the video, the obtained light stream is not accurate, and therefore semantic segmentation precision is reduced.

FIG. 1 is a flow chart of one embodiment of a feature propagation method of the present invention. As shown in fig. 1, the feature propagation method of this embodiment includes:

and 102, judging whether the current frame is a key frame.

For example, a key frame scheduling policy may be utilized to determine whether the current frame is a key frame.

And 104, responding to that the current frame is a non-key frame in the video, and acquiring the high-layer feature of the current frame according to the low-layer feature of the previous key frame adjacent to the current frame and the low-layer feature of the current frame.

In the neural network, the network depth of a first network layer for extracting the low-level features of the previous key frame and the current frame and the low-level features of the current frame is shallower than the network depth of a second network layer for extracting the low-level features in the neural network to obtain the high-level features.

In each embodiment of the invention, the neural network comprises more than two network layers with different network depths, wherein the network layer for extracting features in the network layers included in the neural network can be called as a feature layer, after the neural network receives a frame, the first feature layer is used for extracting the features of the input frame and inputting the frame into the second feature layer, each feature layer sequentially extracts the features of the input frame from the second feature layer, and the extracted features are input into the next network layer for feature extraction until the features for performing semantic segmentation are obtained. The network depth of each feature layer in the neural network is from shallow to deep according to the sequence of feature extraction, and the feature layer for feature extraction in the neural network can be divided into a low-layer feature layer and a high-layer feature layer according to the network depth, namely the first network layer and the second network layer. The features which are sequentially subjected to feature extraction by each feature layer in the low-layer feature layers and are finally output are called low-layer features, and the features which are sequentially subjected to feature extraction by each feature layer in the high-layer feature layers and are finally output are called high-layer features. Compared with a feature layer with shallow network depth in the same neural network, the feature layer with the deep network depth has a larger visual field and focuses on more spatial structure information, and when the extracted features are used for semantic segmentation, the semantic segmentation is more accurate, however, the deeper the network depth is, the higher the calculation difficulty and complexity are. In practical applications, the feature layers in the neural network may be divided into a lower-layer feature layer and a higher-layer feature layer according to a preset standard, such as a calculated amount, where the preset standard may be adjusted according to actual requirements. For example, for a neural network including 101 feature layers connected in sequence, the first 30 (or other numbers) feature layers from 1 st to 30 th among the 100 feature layers may be set as the lower feature layer, and the last 70 feature layers from 31 st to 100 th may be set as the upper feature layer according to the preset setting. For example, for a Pyramid Scene Parsing Network (PSPN), the neural Network may include four partial convolutional networks (conv1 to conv4) and a classification layer, each partial convolutional Network further includes a plurality of convolutional layers, and according to the size of the calculated amount, a convolutional layer from conv1 to conv4_3 in the PSPN may be used as a lower layer feature layer, which occupies the calculated amount of the PSPN about 1/8, and each convolutional layer from conv4_4 to the last classification layer in the PSPN may be used as an upper layer feature layer, which occupies the calculated amount of the PSPN about 7/8; and the classification layer is used for carrying out semantic segmentation on the high-level features output by the high-level feature layer.

Based on the feature propagation and display method provided by the above embodiment of the present invention, when the current frame is a non-key frame in the video, according to the low-level feature of the previous key frame adjacent to the current frame and the low-level feature of the current frame, the high-level feature of the current frame is obtained from the high-level feature of the previous key frame, so as to perform semantic segmentation on the non-key frame based on the high-level feature. The embodiment of the invention utilizes the consistency information between video frames and the characteristic that semantic labels between adjacent frames are similar, and spreads the high-level features for video semantic segmentation from the adjacent previous key frame to the current frame so as to carry out semantic segmentation on the current frame based on the high-level features of the current frame, and the high-level features for semantic segmentation are not required to be extracted frame by frame from the continuous frames of the video, so that the repeated calculation time is reduced compared with the mode of extracting the high-level features for semantic segmentation frame by frame; in addition, the embodiment of the invention transmits the high-level characteristics of the previous key frame to the current frame for semantic segmentation instead of directly transmitting the semantic labels, and improves the accuracy of semantic segmentation compared with the method of transmitting the semantic labels of the key frame by optical flow.

In one implementation manner of the embodiments of the present invention, in the operation 102, acquiring a high-level feature of the current frame from a high-level feature of a previous key frame according to a low-level feature of a previous key frame adjacent to the current frame and a low-level feature of the current frame may include:

and converting the high-level features of the previous key frame into the high-level features of the current frame according to the high-level features and the conversion weight values of the previous key frame, wherein the features are propagated from the previous key frame and are also called propagation features.

In one alternative example, the transform weights for transforming from the low-level features of the previous key frame to the low-level features of the current frame may be obtained through a plurality of convolutional layers.

In another embodiment of the feature propagation method of the present invention, the method may further include: and responding to the fact that the current frame is a non-key frame in the video, and performing semantic segmentation on the current frame at least based on the high-level features of the current frame to obtain a semantic label of the current frame.

In one embodiment, performing semantic segmentation on the current frame based on at least the high-level features of the current frame may include: and performing semantic segmentation on the current frame based on the low-layer features and the high-layer features of the current frame to obtain a semantic label of the current frame.

In practical applications, the number of channels of the second network layer from which the high-level features are extracted is usually greater than the number of channels of the first network layer from which the low-level features are extracted, and in order to fuse the low-level features and the high-level features of the current frame, in one example, semantic segmentation is performed on the current frame based on the low-level features and the high-level features of the current frame, which may include:

converting the low-layer characteristics of the current frame to obtain characteristics consistent with the number of channels of the high-layer characteristics of the current frame;

In the embodiment of the invention, the high-level features of the current frame and the features of the current frame, which are acquired by the high-level features of the previous key frame, are fused for semantic segmentation, and the features of the non-key frame are acquired without using a single-frame model with high calculation cost, so that the calculation amount is reduced, and the accuracy of the semantic segmentation is ensured.

In addition, in another embodiment of the feature propagation method of the present invention, high-level features of non-key frames after a previous key frame may be cached, when a current frame is a non-key frame, features obtained by conversion of the current frame are spliced or fused with the high-level features of the current frame, the high-level features of the previous key frame, and the high-level features of the previous key frame and the non-key frames before the current frame, so as to obtain the features of the current frame, and based on the features of the current frame, the current frame is semantically segmented.

Based on the embodiment, all cached high-level features between the previous key frame and the current frame can be transmitted to the current frame, and the concatenation or fusion is carried out to carry out semantic segmentation, so that a more robust semantic segmentation effect can be obtained at a very low fusion cost.

In an implementation manner of each embodiment of the present invention, the key frame scheduling policy may be a fixed length scheduling method, for example, determining that a key frame is determined every l to 5 frames, that is: whether the current frame is a key frame can be judged by using a fixed-length scheduling method.

FIG. 2 is a flow chart of another embodiment of a feature propagation method of the present invention. As shown in fig. 2, the feature propagation method of this embodiment includes:

202, using a fixed length scheduling method to determine whether the current frame is a key frame.

If the current frame is a key frame, operation 212 is performed. Otherwise, if the current frame is a non-key frame in the video, operation 204 is performed.

And 204, extracting the features of the current frame (also called as the current non-key frame) to obtain the low-level features of the current frame.

In one example of the embodiments of the present invention, feature extraction may be performed on a current frame through a lower layer feature layer (i.e., a first network layer) of a neural network, so as to obtain a lower layer feature of the current frame.

And 206, acquiring a conversion weight value converted from the low-layer feature of the previous key frame to the low-layer feature of the current frame according to the low-layer feature of the previous key frame adjacent to the current frame and the low-layer feature of the current frame.

The conversion weight may be a conversion matrix between two features, namely, a low-level feature of a previous key frame and a low-level feature of a current frame, and includes conversion elements between pixel-by-pixel features in the low-level feature of the previous key frame and the low-level feature of the current frame.

208, according to the high-level feature of the previous key frame and the conversion weight, converting the high-level feature of the previous key frame into the high-level feature of the current frame.

And 210, performing semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame to obtain a semantic label of the current frame.

The flow to semantic segmentation in step 210 is finished, and then the subsequent flow of this embodiment is not executed.

And 212, extracting features of the current frame (also called as the current key frame), obtaining low-level features of the current frame and caching the low-level features.

In one example, the current frame may be feature extracted by a lower feature layer (i.e., the first network layer) of the neural network.

214, extracting the features of the low layer of the current frame, obtaining the features of the high layer of the current frame and caching the features.

In one example, the feature extraction of the low-level features of the current frame may be performed by a high-level feature layer (i.e., a second network layer) of the neural network.

And 216, performing semantic segmentation on the current frame based on the high-level features of the current frame to obtain a semantic label of the current frame.

In the embodiments of the present invention, the key frames and the non-key frames may share a low-level feature layer of a neural network for low-level feature extraction, where the neural network may adopt PSPN, and the neural network may include four parts of convolutional networks (conv1 to conv4) and a classification layer, each part of convolutional network is further divided into a plurality of convolutional layers, where the low-level feature layer of the neural network may include convolutional layers from conv1 to conv4_3 in PSPN, and occupies a computation amount of PSPN about 7/8; the high-level feature layer of the neural network can comprise convolution layers from conv4_4 to the layer before the last classification layer, occupies the calculation amount of about 1/8 of the PSPN, and is used for extracting the high-level features of the key frames; the classification layer is used for correspondingly identifying the category of at least one pixel in the key frame or the non-key frame based on the high-level features of the key frame or the non-key frame, so that the semantic segmentation of the key frame or the non-key frame is realized.

In each embodiment of the invention, for the key frame, a single frame model with high calculation cost can be called, for example, PSPN is used for semantic segmentation, so that a high-precision semantic segmentation result is obtained. For non-key frames, the high-level features of the key frames can be adaptively transmitted to the current frame to obtain the high-level features of the current frame, consistency information between continuous frames of the video is fully utilized, repeated calculation time is avoided, and the current frame is subjected to semantic segmentation based on the low-level features and the high-level features of the current frame to obtain the semantic label of the current frame. The embodiment ensures the semantic segmentation precision of the key frame, does not need to adopt a single-frame model with high calculation cost to perform semantic segmentation on non-key frames frame by frame, reduces the calculation complexity and the calculation time, and saves the calculation resources.

Fig. 3 is a flow chart of another embodiment of the feature propagation method of the present invention. As shown in fig. 3, the feature propagation method of this embodiment includes:

and 302, extracting the features of the current frame to obtain the low-level features of the current frame.

And 304, acquiring a scheduling probability value of the current frame scheduled as the key frame according to the low-layer characteristics of the previous key frame adjacent to the current frame and the low-layer characteristics of the current frame.

In one example, the low-level features of the previous key frame and the low-level features of the current frame may be spliced, and the obtained splicing features are input into a key frame scheduling network, and the key frame scheduling network obtains a scheduling probability value of whether the current frame should be scheduled as a key frame based on the splicing features.

And 306, determining whether the current frame is scheduled as a key frame according to the scheduling probability value of the current frame.

If the current frame is a key frame, operation 314 is performed. Otherwise, if the current frame is a non-key frame in the video, operation 308 is performed.

308, according to the low-level features of the previous key frame adjacent to the current frame (also called current non-key frame) and the low-level features of the current frame, obtaining the conversion weight from the low-level features of the previous key frame to the low-level features of the current frame.

And 310, converting the high-level features of the previous key frame into the high-level features of the current frame according to the high-level features of the previous key frame and the conversion weight values.

312, based on the low-level features and the high-level features of the current frame, performing semantic segmentation on the current frame to obtain a semantic tag of the current frame.

Thereafter, the subsequent flow of the present embodiment is not executed.

And 314, extracting features of the current frame (also called as the current key frame), obtaining low-layer features of the current frame and caching the low-layer features.

And 316, extracting the features of the low layer of the current frame, obtaining the high layer of the current frame and caching.

And 318, performing semantic segmentation on the current frame based on the high-level features of the current frame to obtain a semantic label of the current frame.

The embodiment of the invention can be used for internet entertainment products such as automatic driving scenes, video monitoring scenes, portrait segmentation and the like, for example:

1, under the scene of automatic driving, the embodiment of the invention can be utilized to rapidly segment the targets in the video, such as people and vehicles;

2, in a video monitoring scene, people can be quickly segmented;

3, in the internet entertainment products such as portrait segmentation and the like, people can be rapidly segmented from video frames.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Fig. 4 is a schematic structural diagram of an embodiment of the feature dissemination device of the present invention. The feature propagation device of each embodiment of the present invention can be used to implement the feature propagation method of each embodiment described above. As shown in fig. 4, the feature propagation device of one embodiment includes: a judging module and a characteristic transmission module. Wherein:

and the judging module is used for judging whether the current frame is a key frame.

And the feature propagation module is used for responding to the current frame as a non-key frame in the video according to the judgment result of the judgment module, and acquiring the high-level feature of the current frame from the high-level feature of the previous key frame according to the low-level feature of the previous key frame adjacent to the current frame and the low-level feature of the current frame.

In the neural network, the network depth of the first network layer of the low-level features of the previous key frame is extracted and obtained, and is shallower than the network depth of the second network layer corresponding to the high-level features of the previous key frame.

Based on the feature transmission playing device provided in the above embodiment of the present invention, when the current frame is a non-key frame in the video, according to the low-level feature of the previous key frame adjacent to the current frame and the low-level feature of the current frame, the high-level feature of the previous key frame is used to obtain the high-level feature of the current frame, so as to perform semantic segmentation on the non-key frame based on the high-level feature. The embodiment of the invention utilizes the consistency information between video frames and the characteristic that semantic labels between adjacent frames are similar, and spreads the high-level features for video semantic segmentation from the adjacent previous key frame to the current frame so as to carry out semantic segmentation on the current frame based on the high-level features of the current frame, and the high-level features for semantic segmentation are not required to be extracted frame by frame from the continuous frames of the video, so that the repeated calculation time is reduced compared with the mode of extracting the high-level features for semantic segmentation frame by frame; in addition, the embodiment of the invention transmits the high-level characteristics of the previous key frame to the current frame for semantic segmentation instead of directly transmitting the semantic labels, and improves the accuracy of semantic segmentation compared with the method of transmitting the semantic labels of the key frame by optical flow.

In one embodiment, the feature propagation module is specifically configured to: acquiring a conversion weight value converted from the low-layer feature of the previous key frame to the low-layer feature of the current frame according to the low-layer feature of the previous key frame and the low-layer feature of the current frame; and converting the high-level features of the previous key frame into the high-level features of the current frame according to the high-level features of the previous key frame and the conversion weight.

Fig. 5 is a schematic structural view of another embodiment of the feature dissemination device of the present invention. As shown in fig. 5, compared with the embodiment shown in fig. 4, the feature propagation device of this embodiment further includes: and the semantic segmentation module is used for responding to the current frame as a non-key frame in the video according to the judgment result of the judgment module, performing semantic segmentation on the current frame at least based on the high-level features of the current frame, and obtaining a semantic label of the current frame.

In one embodiment, the semantic segmentation module is specifically configured to perform semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame when performing semantic segmentation on the current frame based on at least the high-layer feature of the current frame.

In one optional example, when performing semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame, the semantic segmentation module is specifically configured to: converting the low-layer characteristics of the current frame to obtain characteristics consistent with the number of channels of the high-layer characteristics of the current frame; splicing or fusing the features obtained by converting the current frame with the high-level features of the current frame to obtain the features of the current frame; and performing semantic segmentation on the current frame based on the current frame characteristics.

In an implementation manner of each of the above feature propagation device embodiments of the present invention, the determining module is specifically configured to determine whether the current frame is a key frame by using a key frame scheduling policy.

In one optional example, the determining module is specifically configured to determine whether the current frame is a key frame by using a fixed-length scheduling method. Accordingly, referring again to fig. 5, the feature propagation device of the further embodiment may further include: and the first feature extraction module is used for responding to the current frame as a non-key frame in the video according to the judgment result of the judgment module, and extracting features of the current frame to obtain low-level features of the current frame.

Alternatively, referring to fig. 6, in the feature propagation apparatus according to still another embodiment, a first feature extraction module and an acquisition module may be further included. Wherein: and the first feature extraction module is used for extracting the features of the current frame to obtain the low-level features of the current frame. And the obtaining module is used for obtaining the scheduling probability value of the current frame scheduled as the key frame according to the low-layer characteristics of the adjacent previous key frame and the low-layer characteristics of the current frame. Accordingly, in this embodiment, the determining module is specifically configured to determine whether the current frame is scheduled as a key frame according to the scheduling probability value of the current frame.

In one embodiment, the obtaining module may include: the splicing unit is used for splicing the low-layer characteristics of the previous key frame and the low-layer characteristics of the current frame to obtain splicing characteristics; and the key frame scheduling network is used for acquiring a scheduling probability value of whether the current frame is scheduled as the key frame or not based on the splicing characteristics.

For example, in the feature propagation device in each of the above embodiments, the first feature extraction module may be further configured to, according to the determination result of the determination module, perform feature extraction on the current frame in response to that the current frame is a key frame in the video, obtain a low-level feature of the current frame, and buffer the low-level feature. Referring again to fig. 5 or 6, in a further embodiment of the feature propagation device, it may further include: and the second feature extraction module is used for extracting features of the low-level features of the key frames according to the judgment result of the judgment module, obtaining the high-level features of the key frames and caching the high-level features.

Optionally, in the feature propagation device in each of the embodiments, the semantic segmentation module may be further configured to, according to a determination result of the determination module, respond that the current frame is a key frame in the video, perform semantic segmentation on the current frame based on a high-level feature of the current frame, and obtain a semantic tag of the current frame.

In addition, an embodiment of the present invention further provides an electronic device, including the feature propagation device according to any of the above embodiments of the present invention.

In addition, another electronic device is provided in an embodiment of the present invention, including:

a memory for storing executable instructions; and

one or more processors in communication with the memory for executing the executable instructions to perform the operations of the feature propagation method of any of the above embodiments of the present invention.

In addition, an embodiment of the present invention further provides another electronic device, including:

a processor and feature propagation apparatus of any of the above embodiments of the invention;

the elements of the feature propagation apparatus of any of the above embodiments of the invention are operated when the processor operates the feature propagation apparatus.

Fig. 7 is a schematic structural diagram of an embodiment of an electronic device according to the present invention. Referring now to fig. 7, shown is a schematic diagram of an electronic device suitable for use in implementing a terminal device or server of an embodiment of the present application. As shown in fig. 7, the electronic device includes one or more processors, a communication section, and the like, for example: one or more Central Processing Units (CPUs), and/or one or more image processors (GPUs), etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM) or loaded from a storage section into a Random Access Memory (RAM). The communication part may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card, and the processor may communicate with the read-only memory and/or the random access memory to execute the executable instructions, connect with the communication part through the bus, and communicate with other target devices through the communication part, so as to complete operations corresponding to any method provided by the embodiments of the present application, for example, determine whether the current frame is a key frame; responding to the fact that the current frame is a non-key frame in a video, and acquiring high-level features of the current frame according to low-level features of a previous key frame adjacent to the current frame and the low-level features of the current frame; in the neural network, the network depth of the first network layer corresponding to the low-level features of the previous key frame is extracted and obtained, and is shallower than the network depth of the second network layer corresponding to the high-level features of the previous key frame.

In addition, in the RAM, various programs and data necessary for the operation of the apparatus can also be stored. The CPU, ROM, and RAM are connected to each other via a bus. In the case of RAM, ROM is an optional module. The RAM stores executable instructions or writes executable instructions into the ROM during operation, and the executable instructions cause the processor to execute operations corresponding to any one of the methods of the invention. An input/output (I/O) interface is also connected to the bus. The communication unit may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.

The following components are connected to the I/O interface: an input section including a keyboard, a mouse, and the like; an output section including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as necessary, so that a computer program read out therefrom is mounted into the storage section as necessary.

It should be noted that the architecture shown in fig. 7 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 7 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication part may be separately set or integrated on the CPU or the GPU, and so on. These alternative embodiments are all within the scope of the present disclosure.

In addition, the embodiment of the present invention further provides a computer storage medium for storing computer readable instructions, which when executed, implement the operations of the feature propagation method according to any one of the above embodiments of the present invention.

In addition, the embodiment of the present invention further provides a computer program, which includes computer readable instructions, and when the computer readable instructions are run in a device, a processor in the device executes executable instructions for implementing the steps in the feature propagation method according to any one of the above embodiments of the present invention.

In an alternative embodiment, the computer program is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

In one or more alternative embodiments, the embodiment of the present invention further provides a computer program product for storing computer readable instructions, which when executed, make a computer execute the feature propagation method described in any one of the above possible implementation manners.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative example, the computer program product is embodied as a computer storage medium, and in another alternative example, the computer program product is embodied as a software product, such as an SDK or the like.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the device embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The method and apparatus of the present invention may be implemented in a number of ways. For example, the methods and apparatus of the present invention may be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A feature propagation method, comprising:

judging whether the current frame is a key frame;

2. The method according to claim 1, wherein said obtaining the high-level features of the current frame from the high-level features of the previous key frame according to the low-level features of the previous key frame adjacent to the current frame and the low-level features of the current frame comprises:

3. The method of claim 2, wherein in response to the current frame being a non-key frame in a video, further comprising:

4. The method of claim 3, wherein the semantically segmenting the current frame based on at least the high-level features of the current frame comprises:

5. The method of claim 4, wherein performing semantic segmentation on the current frame based on the low-layer features and the high-layer features of the current frame comprises:

6. The method according to any one of claims 1-5, wherein said determining whether the current frame is a key frame comprises:

7. The method of claim 6, wherein the determining whether the current frame is a key frame using a key frame scheduling policy comprises: judging whether the current frame is a key frame by using a fixed-length scheduling method;

8. The method of claim 6, wherein determining whether the current frame is a key frame using a key frame scheduling policy comprises:

9. The method of claim 8, wherein obtaining a scheduling probability value that the current frame is scheduled as a key frame according to the low-level features of the previous key frame and the low-level features of the current frame comprises:

10. The method of claim 6, further comprising:

11. The method of any of claims 1-5, further comprising:

12. A feature propagation device, comprising:

13. The apparatus of claim 12, wherein the feature propagation module is specifically configured to:

14. The apparatus of claim 13, further comprising:

15. The apparatus according to claim 14, wherein the semantic segmentation module, when performing semantic segmentation on the current frame based on at least the high-layer features of the current frame, is specifically configured to: and performing semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame.

16. The apparatus of claim 15, wherein the semantic segmentation module, when performing semantic segmentation on the current frame based on the low-layer feature and the high-layer feature of the current frame, is specifically configured to:

17. The apparatus according to any of claims 12-16, wherein the determining module is specifically configured to determine whether the current frame is a key frame by using a key frame scheduling policy.

18. The apparatus according to claim 17, wherein the determining module is specifically configured to determine whether the current frame is a key frame by using a fixed length scheduling method;

the device further comprises:

19. The apparatus of claim 17, further comprising:

20. The apparatus of claim 19, wherein the obtaining module comprises:

21. The apparatus according to claim 18, wherein the first feature extraction module is further configured to, according to the determination result of the determination module, perform feature extraction on a current frame in response to that the current frame is a key frame in a video, obtain a low-level feature of the current frame, and buffer the low-level feature;

the device further comprises:

22. The apparatus according to any one of claims 14 to 16, wherein the semantic segmentation module is further configured to, according to the determination result of the determination module, in response to that a current frame is a key frame in a video, perform semantic segmentation on the current frame based on a high-level feature of the current frame, and obtain a semantic tag of the current frame.

23. An electronic device, comprising: the feature propagation device of any one of claims 12 to 22.

24. An electronic device, comprising:

a processor and the feature propagation apparatus of any one of claims 12 to 22;

the elements of the signature propagation device as claimed in any one of claims 12-22 are operated when the processor operates the signature propagation device.

25. An electronic device, comprising: a processor and a memory;

the memory is configured to store at least one executable instruction that causes the processor to perform the operations of the steps of the feature propagation method of any of claims 1-11.

26. A computer-readable medium storing computer-readable instructions that, when executed, perform the operations of the steps of the feature propagation method of any one of claims 1-11.