WO2022062344A1

WO2022062344A1 - Method, system, and device for detecting salient target in compressed video, and storage medium

Info

Publication number: WO2022062344A1
Application number: PCT/CN2021/082752
Authority: WO
Inventors: 邹文艺; 章勇; 曹李军
Original assignee: 苏州科达科技股份有限公司
Priority date: 2020-09-24
Filing date: 2021-03-24
Publication date: 2022-03-31
Also published as: CN111931732A; CN111931732B

Abstract

Provided are a method, system, and device for detecting a salient target in a compressed video, and a storage medium. The compressed video comprises multi-frame data and the multi frame data comprises I-frame data and at least one piece of P-frame data. The method comprises: inputting the I-frame data into a feature extraction network and extracting a first feature of the I-frame data, the feature extraction network comprising a convolutional neural network; for each piece of P-frame data, inputting a corresponding first feature of frame data at a previous moment as well as the P-frame data into a long short-term memory network and extracting a first feature of the P-frame data; and according to the first feature of each piece of frame data, using a saliency recognition network to obtain a salient target region in each frame. By using the present invention and by means of introducing a long short-term memory network, only the features of an I-frame need to be extracted, while the features of a P-frame can be extracted by using features of a previous frame, P-frame data, and the long short-term memory network, thus increasing the speed in detecting a salient target in a compressed video.

Description

Salient object detection method, system, device and storage medium for compressed video

technical field

The present invention relates to the technical field of video processing, and in particular, to a method, system, device and storage medium for salient target detection of compressed video.

Background technique

Video saliency detection is mainly divided into two categories, one is visual attention detection, which is used to estimate the change trajectory of the gaze point when the human eye observes an image, which has been widely studied in neurology, and the other is saliency target detection. , to segment the most important or visually prominent objects from the background noise. For the second type of salient target detection, there is no salient target detection method for compressed video that can take into account both detection speed and detection effect in the prior art.

SUMMARY OF THE INVENTION

In view of the problems in the prior art, the purpose of the present invention is to provide a method, system, device and storage medium for salient target detection in compressed video, which can improve the detection speed of salient target in compressed video on the basis of ensuring the detection effect.

An embodiment of the present invention provides a method for detecting a salient object in a compressed video, where the compressed video includes multiple frames of data, and the multiple frames of data include I-frame data and at least one P-frame data, and the method includes the following steps:

Inputting the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;

For each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the first feature of the P frame data;

According to the first feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.

Optionally, the I-frame data includes I-frame image data obtained by decoding the I-frame code stream of the compressed video, and the P-frame data includes motion information and residuals in the P-frame code stream of the compressed video. information.

Optionally, the feature extraction network further includes a first residual network connected in series with the convolutional neural network.

Optionally, according to the first feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:

Inputting the first feature of each frame of data into the second residual network, the third residual network and the fourth residual network in series in sequence, to obtain the second feature of each frame of data;

According to the second feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.

Optionally, according to the second feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:

Input the second feature of each frame of data into the hole space convolution pooling pyramid network to obtain the third feature of each frame of data;

According to the third feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.

Optionally, the atrous spatial convolutional pooling pyramid network includes five modules connected in parallel, and the five modules include a global average pooling layer, a 1x1 convolutional layer and three 3x3 atrous convolutional layers, so The outputs of the five modules are combined to obtain the third feature of each frame of data.

Optionally, according to the third feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:

Inputting the third feature of each frame of data into the saliency recognition network to obtain a probability map corresponding to each frame of data;

Binarize the probability map according to the probability threshold to obtain a binarized map;

A saliency region is extracted from the binarized map.

Optionally, the saliency recognition network includes first to fifth deconvolution layers and activation function layers, the third feature of each frame of data is input to the first deconvolution layer, and the third feature of each frame of data is input to the first deconvolution layer. The first feature is input into the second deconvolution layer, and the outputs of the first deconvolution layer and the second deconvolution layer are combined and input into the third deconvolution layer and the fourth deconvolution layer in series in sequence layer and a fifth deconvolution layer, the output of the fifth deconvolution layer outputs the probability map of each frame of data after passing through the activation function layer.

By adopting the salient target detection method for compressed video of the present invention, a long and short-term memory network is introduced, and it is only necessary to extract features for the I frame, while the P frame can use the features of the previous frame, the P frame data and the long and short-term memory network for feature extraction. , salient object detection can be performed on the extracted features, which greatly improves the detection speed of salient objects in compressed video.

The embodiment of the present invention also provides a salient target detection system for compressed video, which is applied to the salient target detection method for compressed video, and the system includes:

A first feature extraction module, configured to input the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;

The second feature extraction module is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into the long-short-term memory network, and extract the first feature of the P frame data. a feature;

The saliency detection module is used for obtaining the saliency target area of each frame by adopting the saliency identification network according to the first feature of the data of each frame.

By adopting the salient target detection system for compressed video of the present invention, a long and short-term memory network is introduced, and only the features of the I frame need to be extracted, while the P frame can use the features of the previous frame, the P frame data and the long and short-term memory network for feature extraction. , salient object detection can be performed on the extracted features, which greatly improves the detection speed of salient objects in compressed video.

The embodiment of the present invention also provides a salient object detection device for compressed video, including:

processor;

a memory in which executable instructions for the processor are stored;

Wherein, the processor is configured to execute the steps of the salient object detection method for compressed video by executing the executable instructions.

By adopting the apparatus for detecting salient objects in compressed video provided by the present invention, the processor executes the method for detecting salient objects in compressed video when executing the executable instructions, so that the salient objects in the compressed video can be obtained. Beneficial effects of sexual object detection methods.

Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and when the program is executed, the steps of the method for detecting a salient object in a compressed video are implemented.

By using the computer-readable storage medium provided by the present invention, the stored program realizes the steps of the method for detecting salient objects in compressed video when it is executed, so that the above-mentioned method for detecting salient objects in compressed video can be obtained. beneficial effect.

Description of drawings

Other features, objects and advantages of the present invention will become more apparent upon reading the detailed description of non-limiting embodiments with reference to the following drawings.

1 is a flowchart of a method for detecting a salient object in a compressed video according to an embodiment of the present invention;

Fig. 2 is the structure diagram of the salient target detection network of the compressed video of a specific example of the present invention;

3 is a structural diagram of a long-short-term memory network according to an embodiment of the present invention;

4 is a schematic diagram of a salient object detection system for compressed video according to an embodiment of the present invention;

5 is a schematic structural diagram of a salient object detection device for compressed video according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention.

detailed description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their repeated descriptions will be omitted.

In one embodiment, the present invention provides a salient object detection method for compressed video, where the compressed video includes multiple frames of data, and the multiple frames of data include I-frame data and at least one P-frame data. Video is generally regarded as a sequence of independent images, which can be stored and transmitted in compressed form. Codec divides video into I frame and P/B frame. I frame is a complete image frame, and P/B frame is only reserved and referenced. image changes. In the compressed video using I frame and P frame, the P frame data at time t+k only records the motion information m _t+k of the object and the residual information r _t+k , so the consecutive frames are highly correlated, and the frames between the frames are highly correlated. The change in time is also recorded in the video stream.

As shown in Figure 1, the salient target detection method for the compressed video includes the following steps:

S100: Input the I-frame data into a feature extraction network to extract the first feature of the I-frame data, where the feature extraction network includes a convolutional neural network, and the convolutional neural network can extract the complete image frame based on the I-frame Complete features, where the first feature corresponds to the form of the feature map;

S200: For each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the first feature of the P frame data, where P The frame data includes the motion vector and residual data of the P frame relative to the frame at the previous moment;

S300: According to the first feature of the data of each frame, a saliency identification network is used to obtain the saliency target area of each frame.

By adopting the salient target detection method for compressed video of the present invention, when step S100 is used to perform feature extraction on I frame data, the extraction is performed through a convolutional neural network, and when step S200 is used to perform feature extraction on P frame data, then The long-short-term memory network is introduced, and the features of the previous frame and the long-short-term memory network can be used for feature extraction, and step S300 can be used for salient target detection for the extracted features. Therefore, only complete features need to be extracted for the I frame, while P The frame only needs to pass through the long-term memory network and the P-frame data in the video stream to quickly extract the features of the P-frame.

In this embodiment, the I-frame data includes I-frame image data obtained by decoding the I-frame code stream of the compressed video, and the P-frame data includes motion information in the P-frame code stream of the compressed video and Therefore, the features of P frames can be quickly extracted through motion information and residual information, thereby effectively improving the feature extraction speed of compressed video and greatly improving the detection speed of salient objects in compressed video. Specifically, the motion information may include motion vectors, and the residual information may include residual coefficients.

Specifically, in the video coding sequence, in a group of continuous images (GOP, Group Of Pictures), the I-frame data retains complete information, the I-frame data is decoded to obtain complete image information, and feature extraction and processing are performed in step S100. The saliency target detection is carried out through step S300. For the P frame, through step S200, a motion-assisted long and short-term memory network (Nm_lstm) is used to extract features from the continuous P frame data, and then salient target detection is performed on the extracted features. For t +P frame data at time k, the feature Residual_1 extracted by the long short-term memory network (LSTM, Long Short-Term Memory) with the image data of the previous I frame or the feature c _t+k-1 and h extracted from the P frame at the previous time _t+k-1 and the motion information and residual information in the video stream are used as input to extract the features of P frames, and then perform saliency target detection on the extracted features.

As shown in FIG. 2, in this embodiment, the convolutional neural network is a head convolutional neural network HeadConv, and the feature extraction network further includes a first residual network Residual_1t connected in series with the convolutional neural network HeadConv, The output features of the first residual network are input into the long-short-term memory network Nm-lstm of the P-frame data at time t+1, and the features output by the long-short-term memory network Nm-lstm of the P-frame data at time t+1 are input into The long and short-term memory network Nm-lstm of the P frame data at time t+2, and so on. By combining the convolutional neural network and the residual network to extract the features of the I frame, the feature map of the I frame can be better extracted. Residual networks are characterized by being easy to optimize and capable of increasing accuracy by adding considerable depth. The internal residual block uses skip connections to alleviate the gradient disappearance problem caused by increasing depth in deep neural networks.

In the application, the first feature Residual_1t is extracted from the I frame image data, and input to the motion-assisted long and short-term memory network to obtain the first feature [c _t+1 ,..., c _t+n ] of the subsequent frames. The specific process is as follows: Show:

where [c _t , c _t+1 , . . . , c _t+n ] represents a set of GOP extraction features.

As shown in Figure 2, in this embodiment, the first residual network Residual_1t further includes a second residual network Residual_2t, a third residual network Residual_3t and a fourth residual network Residual_4t, by increasing the residual network Depth further improves the accuracy of feature extraction. The first feature of the I frame output by the first residual network Residual_1t is also input to the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t to obtain the second feature of the I frame. For the I frame, the feature extraction part adopts Resnet101 as the skeleton network, including the convolutional neural network Headconv and four residual networks (residual_i i...{1,2,3,4}). The output of the long-short-term memory network Nm-lstm of each P frame is also input to the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t to obtain the second feature of the P frame. For P-frames, the feature extraction part includes a motion-assisted long short-term memory network and the same three residual networks as I-frames.

In this embodiment, the convolutional neural network HeadConv adopts a convolution kernel with a size of 7×7, a stride of 2, and a channel of 64. The four residual networks Residual_1t to Residual_4t respectively include 3, 4, 23 and 3 based on The residual learning network of the "bottleneck block" has 256, 512, 1024 and 2048 output channels, respectively.

The step S300: According to the first feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:

Inputting the first feature of each frame of data into the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t in series in sequence to obtain the second feature of each frame of data;

Further, as shown in FIG. 2 , in this embodiment, the fourth residual network Residual_4t is also connected in series with a hole space convolution pooling pyramid network. The atrous spatial convolution pooling pyramid network (Atrous Spatial Pyramid Pooling (ASPP)) can further expand the perceptual field of feature extraction and further improve the feature extraction effect.

According to the second feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:

Input the second feature of each frame of data into the hole space convolution pooling pyramid network ASPP to obtain the third feature of each frame of data;

As shown in FIG. 2, in this embodiment, the atrous spatial convolutional pooling pyramid network includes five modules connected in parallel, the five modules include a global average pooling layer GAP, a 1x1 convolutional layer and Three 3x3 atrous convolutional layers with a sampling rate of rates={6, 12, 18}, the outputs of the five modules are combined by concat to obtain the third feature of each frame of data, through a 1×1 volume Layer up and reduce the number of channels to the desired value.

As shown in FIG. 2, in this embodiment, the saliency recognition network includes the first to fifth deconvolution layers conv-1 to conv5 and the activation function layer Sigmoid, and the atrous spatial convolution pooling pyramid network ASPP The third feature of each frame of data output is input to the first deconvolution layer conv-1, the first residual network Residual_1t or the long-short-term memory network Nm-1stm outputs the first feature of each frame of data. The features are input to the second deconvolution layer conv-2, and the outputs of the first deconvolution layer conv-1 and the second deconvolution layer conv-2 are combined by concat and input into the third inverse series in series. The convolution layer conv-3, the fourth deconvolution layer conv-4 and the fifth deconvolution layer conv-5, the obtained third feature is the feature map with the same resolution as the input I-frame image. The output of the fifth deconvolution layer conv-5 is passed through the activation function layer Sigmoid to output the probability map of each frame of data. In the feature extraction process, the convolutional network and residual network are used to make the resolution of the feature map smaller than that of the input frame image. Therefore, the resolution of the feature map is restored to the input frame through five deconvolution layers. The resolution of the image.

After the probability map of each frame of data is obtained, a saliency region can be extracted according to the probability map. Specifically, in this embodiment, according to the third feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:

Inputting the third feature of each frame of data into the saliency recognition network to obtain a probability map corresponding to each frame of data, that is, the probability map of each frame of data output by the activation function layer Sigmoid;

A saliency region is extracted from the binarized map.

As shown in FIG. 3 , it is a structural diagram of the long and short-term memory network in this embodiment. The long-short-term memory network is configured to obtain the first feature of the current frame by using the motion information and the first feature of the adjacent frame, and the specific formula is as follows:

c _t+k-1→t+k =W(c _t+k-1 ,m _t+k )

h _t+k-1→t+k =W(h _t+k-1 ,m _t+k ) (2)

where c _t+k-1 and h _t+k-1 are the outputs of t+k-1 motion-assisted long-term memory network, c _t and h _t are Residual_1 _t , k∈[1,n], n is a GOP The number of frames within the P frame. The correction operation W performs bilinear interpolation on each position of the feature map, and maps the p+Δp position of the t+k-1 frame to the p position of the t+k frame. The specific formula is as follows:

Δp=m _t+k (p)

c _t+k-1→t+k (p)=∑ _q G(q,p+Δp)c _t+k-1 (q) (3)

where Δp is obtained by m _t+k , q represents the spatial position information of the feature map ct _+k-1 , G(.) represents the bilinear interpolation kernel, and the specific formula is as follows:

G(q,p+Δp)=max(0,1-||q-(p+Δp)|| (4)

The hidden layer feature h _t+k-1→t+k is processed in the same way as c _t+k-1→t+k , h _t+k-1→t+k and c _t+k-1→t+k as The input of the short-term memory network from the previous frame to the current frame.

The specific formula of the long-term memory network is as follows:

g _t+k =σ(W _g (h _t+k-1→t+k ,r _t+k ))

i _t+k =σ(W _i (h _t+k-1→t+k ,r _t+k ))

in

and

Represents pixel-by-pixel addition and multiplication, W _g , Wi _, W _c learned weights, σ() represents sigmoid, which is to map variables between 0 and 1.

o _t+k =σ(W _o (h _t+k-1→t+k ,r _t+k ))

Therefore, the present invention can quickly extract the features of P frames through the long-short-term memory network and the motion information and residual information in the video code stream, thereby effectively improving the feature extraction speed of compressed video.

As shown in FIG. 4 , an embodiment of the present invention further provides a salient target detection system for compressed video, which is applied to the salient target detection method for compressed video, and the system includes:

A first feature extraction module M100, configured to input the I frame data into a feature extraction network, to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;

The second feature extraction module M200 is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the P frame data of the first feature and the P frame data. first characteristic;

The saliency detection module M300 is used for obtaining the saliency target area of each frame by adopting a saliency identification network according to the first feature of the data of each frame.

By adopting the salient target detection system for compressed video of the present invention, when the first feature extraction module M100 is used to perform feature extraction on the I frame data, the convolutional neural network is used for extraction, and the second feature extraction module M200 is used to extract the P When performing feature extraction on frame data, a long and short-term memory network is introduced, and the features of the previous frame, P frame data and long-term memory network can be used for feature extraction, and the saliency detection module M300 can be used for saliency target detection for the extracted features. , therefore, it is only necessary to extract the complete features of the I frame, and the P frame can quickly extract the features of the P frame only through the long and short-term memory network and the P frame data in the video stream, thereby effectively improving the compressed video feature extraction speed. It also greatly improves the speed of saliency target detection in compressed video.

An embodiment of the present invention further provides a salient object detection device for compressed video, including a processor; a memory, in which executable instructions of the processor are stored; wherein the processor is configured to execute the executable instructions by to perform the steps of the saliency target detection method for compressed video.

As will be appreciated by one skilled in the art, various aspects of the present invention may be implemented as a system, method or program product. Therefore, various aspects of the present invention can be embodied in the following forms: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, which may be collectively referred to herein as implementations "circuit", "module" or "system".

The electronic device 600 according to this embodiment of the present invention is described below with reference to FIG. 5 . The electronic device 600 shown in FIG. 5 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present invention.

As shown in FIG. 5, electronic device 600 takes the form of a general-purpose computing device. Components of the electronic device 600 may include, but are not limited to, at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.

Wherein, the storage unit stores program codes, and the program codes can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary embodiments according to the present invention described in the above-mentioned part of the electronic prescription circulation processing method of this specification. Implementation steps. For example, the processing unit 610 may perform the steps shown in FIG. 1 .

The storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 6201 and/or a cache storage unit 6202 , and may further include a read only storage unit (ROM) 6203 .

The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205 including, but not limited to, an operating system, one or more application programs, other program modules, and programs Data, each or some combination of these examples may include an implementation of a network environment.

The bus 630 may be representative of one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures bus.

The electronic device 600 may also communicate with one or more external devices 700 (eg, keyboards, pointing devices, Bluetooth devices, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with Any device (eg, router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 650 . Also, the electronic device 600 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 660 . Network adapter 660 may communicate with other modules of electronic device 600 through bus 630 . It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives and data backup storage systems.

Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and when the program is executed, the steps of the method for detecting a salient object in a compressed video are implemented. In some possible implementations, aspects of the present invention can also be implemented in the form of a program product comprising program code for enabling the program product to run on a terminal device The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification.

Referring to FIG. 6, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which can adopt a portable compact disk read only memory (CD-ROM) and include program codes, and can be used in a terminal device, For example running on a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

The computer-readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, carrying readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A readable storage medium can also be any readable medium other than a readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming Language - such as the "C" language or similar programming language. The program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or cluster execute on. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).

The above content is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be considered that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deductions or substitutions can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

A salient object detection method for compressed video, characterized in that, the compressed video includes multiple frames of data, and the multiple frames of data include I frame data and at least one P frame data, and the method includes the following steps:

Inputting the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;

For each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the first feature of the P frame data;

According to the first feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
The salient object detection method for compressed video according to claim 1, wherein the I-frame data includes I-frame image data obtained by decoding an I-frame code stream of the compressed video, and the P-frame data includes The motion information and residual information in the P-frame code stream of the compressed video.
The salient object detection method for compressed video according to claim 1, wherein the feature extraction network further comprises a first residual network connected in series with the convolutional neural network.
The salient target detection method for compressed video according to claim 1, wherein, according to the first feature of each frame of data, a saliency recognition network is used to obtain the salient target area of each frame, comprising the following steps:

Inputting the first feature of each frame of data into the second residual network, the third residual network and the fourth residual network in series in sequence, to obtain the second feature of each frame of data;

According to the second feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
The salient target detection method for compressed video according to claim 4, wherein, according to the second feature of each frame of data, a saliency recognition network is used to obtain the salient target area of each frame, comprising the following steps:

Input the second feature of each frame of data into the hole space convolution pooling pyramid network to obtain the third feature of each frame of data;

According to the third feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
The salient object detection method for compressed video according to claim 5, wherein the hole spatial convolution pooling pyramid network comprises five modules connected in parallel, and the five modules comprise a global average pooling layer , one 1x1 convolutional layer and three 3x3 hole convolutional layers, the outputs of the five modules are combined to obtain the third feature of each frame of data.
The salient target detection method for compressed video according to claim 5, wherein, according to the third feature of each frame of data, a saliency recognition network is used to obtain the salient target area of each frame, comprising the following steps:

Inputting the third feature of each frame of data into the saliency recognition network to obtain a probability map corresponding to each frame of data;

Binarize the probability map according to the probability threshold to obtain a binarized map;

A saliency region is extracted from the binarized map.
The saliency target detection method for compressed video according to claim 7, wherein the saliency recognition network comprises first to fifth deconvolution layers and activation function layers, and the third feature of each frame of data Input the first deconvolution layer, the first feature of each frame of data is input to the second deconvolution layer, and the outputs of the first deconvolution layer and the second deconvolution layer are combined. Input the third deconvolution layer, the fourth deconvolution layer and the fifth deconvolution layer in series, and the output of the fifth deconvolution layer outputs the probability map of each frame of data after passing through the activation function layer.
A salient target detection system for compressed video, characterized in that, applied to the salient target detection method for compressed video according to any one of claims 1 to 8, the system comprising:

A first feature extraction module, configured to input the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;

The second feature extraction module is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into the long-short-term memory network, and extract the first feature of the P frame data. a feature;

The saliency detection module is used for obtaining the saliency target area of each frame by adopting the saliency identification network according to the first feature of the data of each frame.
A salient target detection device for compressed video, characterized in that it includes:

processor;

a memory in which executable instructions for the processor are stored;

Wherein, the processor is configured to perform the steps of the salient object detection method for compressed video according to any one of claims 1 to 8 by executing the executable instructions.
A computer-readable storage medium for storing a program, characterized in that, when the program is executed, the steps of the salient target detection method for compressed video according to any one of claims 1 to 8 are implemented.