WO2022062344A1 - Method, system, and device for detecting salient target in compressed video, and storage medium - Google Patents

Method, system, and device for detecting salient target in compressed video, and storage medium Download PDF

Info

Publication number
WO2022062344A1
WO2022062344A1 PCT/CN2021/082752 CN2021082752W WO2022062344A1 WO 2022062344 A1 WO2022062344 A1 WO 2022062344A1 CN 2021082752 W CN2021082752 W CN 2021082752W WO 2022062344 A1 WO2022062344 A1 WO 2022062344A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
data
feature
network
compressed video
Prior art date
Application number
PCT/CN2021/082752
Other languages
French (fr)
Chinese (zh)
Inventor
邹文艺
章勇
曹李军
Original Assignee
苏州科达科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州科达科技股份有限公司 filed Critical 苏州科达科技股份有限公司
Publication of WO2022062344A1 publication Critical patent/WO2022062344A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present invention relates to the technical field of video processing, and in particular, to a method, system, device and storage medium for salient target detection of compressed video.
  • Video saliency detection is mainly divided into two categories, one is visual attention detection, which is used to estimate the change trajectory of the gaze point when the human eye observes an image, which has been widely studied in neurology, and the other is saliency target detection. , to segment the most important or visually prominent objects from the background noise.
  • visual attention detection which is used to estimate the change trajectory of the gaze point when the human eye observes an image, which has been widely studied in neurology
  • saliency target detection to segment the most important or visually prominent objects from the background noise.
  • salient target detection there is no salient target detection method for compressed video that can take into account both detection speed and detection effect in the prior art.
  • the purpose of the present invention is to provide a method, system, device and storage medium for salient target detection in compressed video, which can improve the detection speed of salient target in compressed video on the basis of ensuring the detection effect.
  • An embodiment of the present invention provides a method for detecting a salient object in a compressed video, where the compressed video includes multiple frames of data, and the multiple frames of data include I-frame data and at least one P-frame data, and the method includes the following steps:
  • the feature extraction network includes a convolutional neural network
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • the I-frame data includes I-frame image data obtained by decoding the I-frame code stream of the compressed video
  • the P-frame data includes motion information and residuals in the P-frame code stream of the compressed video. information.
  • the feature extraction network further includes a first residual network connected in series with the convolutional neural network.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • the atrous spatial convolutional pooling pyramid network includes five modules connected in parallel, and the five modules include a global average pooling layer, a 1x1 convolutional layer and three 3x3 atrous convolutional layers, so The outputs of the five modules are combined to obtain the third feature of each frame of data.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency region is extracted from the binarized map.
  • the saliency recognition network includes first to fifth deconvolution layers and activation function layers, the third feature of each frame of data is input to the first deconvolution layer, and the third feature of each frame of data is input to the first deconvolution layer.
  • the first feature is input into the second deconvolution layer, and the outputs of the first deconvolution layer and the second deconvolution layer are combined and input into the third deconvolution layer and the fourth deconvolution layer in series in sequence layer and a fifth deconvolution layer, the output of the fifth deconvolution layer outputs the probability map of each frame of data after passing through the activation function layer.
  • a long and short-term memory network is introduced, and it is only necessary to extract features for the I frame, while the P frame can use the features of the previous frame, the P frame data and the long and short-term memory network for feature extraction.
  • salient object detection can be performed on the extracted features, which greatly improves the detection speed of salient objects in compressed video.
  • the embodiment of the present invention also provides a salient target detection system for compressed video, which is applied to the salient target detection method for compressed video, and the system includes:
  • a first feature extraction module configured to input the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
  • the second feature extraction module is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into the long-short-term memory network, and extract the first feature of the P frame data.
  • the saliency detection module is used for obtaining the saliency target area of each frame by adopting the saliency identification network according to the first feature of the data of each frame.
  • a long and short-term memory network is introduced, and only the features of the I frame need to be extracted, while the P frame can use the features of the previous frame, the P frame data and the long and short-term memory network for feature extraction.
  • salient object detection can be performed on the extracted features, which greatly improves the detection speed of salient objects in compressed video.
  • the embodiment of the present invention also provides a salient object detection device for compressed video, including:
  • the processor is configured to execute the steps of the salient object detection method for compressed video by executing the executable instructions.
  • the processor executes the method for detecting salient objects in compressed video when executing the executable instructions, so that the salient objects in the compressed video can be obtained.
  • Beneficial effects of sexual object detection methods are provided by the present invention.
  • Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and when the program is executed, the steps of the method for detecting a salient object in a compressed video are implemented.
  • the stored program realizes the steps of the method for detecting salient objects in compressed video when it is executed, so that the above-mentioned method for detecting salient objects in compressed video can be obtained. beneficial effect.
  • FIG. 1 is a flowchart of a method for detecting a salient object in a compressed video according to an embodiment of the present invention
  • Fig. 2 is the structure diagram of the salient target detection network of the compressed video of a specific example of the present invention
  • FIG. 3 is a structural diagram of a long-short-term memory network according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a salient object detection system for compressed video according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a salient object detection device for compressed video according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • the same reference numerals in the drawings denote the same or similar structures, and thus their repeated descriptions will be omitted.
  • the present invention provides a salient object detection method for compressed video, where the compressed video includes multiple frames of data, and the multiple frames of data include I-frame data and at least one P-frame data.
  • Video is generally regarded as a sequence of independent images, which can be stored and transmitted in compressed form. Codec divides video into I frame and P/B frame. I frame is a complete image frame, and P/B frame is only reserved and referenced. image changes.
  • the P frame data at time t+k only records the motion information m t+k of the object and the residual information r t+k , so the consecutive frames are highly correlated, and the frames between the frames are highly correlated. The change in time is also recorded in the video stream.
  • the salient target detection method for the compressed video includes the following steps:
  • S100 Input the I-frame data into a feature extraction network to extract the first feature of the I-frame data, where the feature extraction network includes a convolutional neural network, and the convolutional neural network can extract the complete image frame based on the I-frame Complete features, where the first feature corresponds to the form of the feature map;
  • S200 For each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the first feature of the P frame data, where P The frame data includes the motion vector and residual data of the P frame relative to the frame at the previous moment;
  • a saliency identification network is used to obtain the saliency target area of each frame.
  • step S100 when step S100 is used to perform feature extraction on I frame data, the extraction is performed through a convolutional neural network, and when step S200 is used to perform feature extraction on P frame data, then The long-short-term memory network is introduced, and the features of the previous frame and the long-short-term memory network can be used for feature extraction, and step S300 can be used for salient target detection for the extracted features. Therefore, only complete features need to be extracted for the I frame, while P The frame only needs to pass through the long-term memory network and the P-frame data in the video stream to quickly extract the features of the P-frame.
  • the I-frame data includes I-frame image data obtained by decoding the I-frame code stream of the compressed video
  • the P-frame data includes motion information in the P-frame code stream of the compressed video and Therefore, the features of P frames can be quickly extracted through motion information and residual information, thereby effectively improving the feature extraction speed of compressed video and greatly improving the detection speed of salient objects in compressed video.
  • the motion information may include motion vectors
  • the residual information may include residual coefficients.
  • the I-frame data retains complete information
  • the I-frame data is decoded to obtain complete image information
  • feature extraction and processing are performed in step S100.
  • the saliency target detection is carried out through step S300.
  • a motion-assisted long and short-term memory network Nm_lstm is used to extract features from the continuous P frame data, and then salient target detection is performed on the extracted features.
  • the feature Residual_1 extracted by the long short-term memory network (LSTM, Long Short-Term Memory) with the image data of the previous I frame or the feature c t+k-1 and h extracted from the P frame at the previous time t+k-1 and the motion information and residual information in the video stream are used as input to extract the features of P frames, and then perform saliency target detection on the extracted features.
  • LSTM Long Short-Term Memory
  • the convolutional neural network is a head convolutional neural network HeadConv
  • the feature extraction network further includes a first residual network Residual_1t connected in series with the convolutional neural network HeadConv
  • the output features of the first residual network are input into the long-short-term memory network Nm-lstm of the P-frame data at time t+1
  • the features output by the long-short-term memory network Nm-lstm of the P-frame data at time t+1 are input into The long and short-term memory network Nm-lstm of the P frame data at time t+2, and so on.
  • Residual networks are characterized by being easy to optimize and capable of increasing accuracy by adding considerable depth.
  • the internal residual block uses skip connections to alleviate the gradient disappearance problem caused by increasing depth in deep neural networks.
  • the first feature Residual_1t is extracted from the I frame image data, and input to the motion-assisted long and short-term memory network to obtain the first feature [c t+1 ,..., c t+n ] of the subsequent frames.
  • the specific process is as follows: Show:
  • the first residual network Residual_1t further includes a second residual network Residual_2t, a third residual network Residual_3t and a fourth residual network Residual_4t, by increasing the residual network Depth further improves the accuracy of feature extraction.
  • the first feature of the I frame output by the first residual network Residual_1t is also input to the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t to obtain the second feature of the I frame.
  • the feature extraction part adopts Resnet101 as the skeleton network, including the convolutional neural network Headconv and four residual networks (residual_i i... ⁇ 1,2,3,4 ⁇ ).
  • the output of the long-short-term memory network Nm-lstm of each P frame is also input to the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t to obtain the second feature of the P frame.
  • the feature extraction part includes a motion-assisted long short-term memory network and the same three residual networks as I-frames.
  • the convolutional neural network HeadConv adopts a convolution kernel with a size of 7 ⁇ 7, a stride of 2, and a channel of 64.
  • the four residual networks Residual_1t to Residual_4t respectively include 3, 4, 23 and 3 based on
  • the residual learning network of the "bottleneck block" has 256, 512, 1024 and 2048 output channels, respectively.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • the fourth residual network Residual_4t is also connected in series with a hole space convolution pooling pyramid network.
  • the atrous spatial convolution pooling pyramid network (Atrous Spatial Pyramid Pooling (ASPP)) can further expand the perceptual field of feature extraction and further improve the feature extraction effect.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • the outputs of the five modules are combined by concat to obtain the third feature of each frame of data, through a 1 ⁇ 1 volume Layer up and reduce the number of channels to the desired value.
  • the saliency recognition network includes the first to fifth deconvolution layers conv-1 to conv5 and the activation function layer Sigmoid, and the atrous spatial convolution pooling pyramid network ASPP
  • the third feature of each frame of data output is input to the first deconvolution layer conv-1, the first residual network Residual_1t or the long-short-term memory network Nm-1stm outputs the first feature of each frame of data.
  • the features are input to the second deconvolution layer conv-2, and the outputs of the first deconvolution layer conv-1 and the second deconvolution layer conv-2 are combined by concat and input into the third inverse series in series.
  • the convolution layer conv-3, the fourth deconvolution layer conv-4 and the fifth deconvolution layer conv-5, the obtained third feature is the feature map with the same resolution as the input I-frame image.
  • the output of the fifth deconvolution layer conv-5 is passed through the activation function layer Sigmoid to output the probability map of each frame of data.
  • the convolutional network and residual network are used to make the resolution of the feature map smaller than that of the input frame image. Therefore, the resolution of the feature map is restored to the input frame through five deconvolution layers. The resolution of the image.
  • a saliency region can be extracted according to the probability map.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency region is extracted from the binarized map.
  • FIG. 3 it is a structural diagram of the long and short-term memory network in this embodiment.
  • the long-short-term memory network is configured to obtain the first feature of the current frame by using the motion information and the first feature of the adjacent frame, and the specific formula is as follows:
  • c t+k-1 and h t+k-1 are the outputs of t+k-1 motion-assisted long-term memory network
  • c t and h t are Residual_1 t , k ⁇ [1,n]
  • n is a GOP The number of frames within the P frame.
  • the correction operation W performs bilinear interpolation on each position of the feature map, and maps the p+ ⁇ p position of the t+k-1 frame to the p position of the t+k frame.
  • the specific formula is as follows:
  • ⁇ p is obtained by m t+k
  • q represents the spatial position information of the feature map ct +k-1
  • G(.) represents the bilinear interpolation kernel
  • the hidden layer feature h t+k-1 ⁇ t+k is processed in the same way as c t+k-1 ⁇ t+k , h t+k-1 ⁇ t+k and c t+k-1 ⁇ t+k as The input of the short-term memory network from the previous frame to the current frame.
  • W g , Wi , W c learned weights, ⁇ () represents sigmoid, which is to map variables between 0 and 1.
  • the present invention can quickly extract the features of P frames through the long-short-term memory network and the motion information and residual information in the video code stream, thereby effectively improving the feature extraction speed of compressed video.
  • an embodiment of the present invention further provides a salient target detection system for compressed video, which is applied to the salient target detection method for compressed video, and the system includes:
  • a first feature extraction module M100 configured to input the I frame data into a feature extraction network, to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
  • the second feature extraction module M200 is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the P frame data of the first feature and the P frame data.
  • the saliency detection module M300 is used for obtaining the saliency target area of each frame by adopting a saliency identification network according to the first feature of the data of each frame.
  • the convolutional neural network is used for extraction
  • the second feature extraction module M200 is used to extract the P
  • a long and short-term memory network is introduced, and the features of the previous frame, P frame data and long-term memory network can be used for feature extraction
  • the saliency detection module M300 can be used for saliency target detection for the extracted features.
  • An embodiment of the present invention further provides a salient object detection device for compressed video, including a processor; a memory, in which executable instructions of the processor are stored; wherein the processor is configured to execute the executable instructions by to perform the steps of the saliency target detection method for compressed video.
  • aspects of the present invention may be implemented as a system, method or program product. Therefore, various aspects of the present invention can be embodied in the following forms: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, which may be collectively referred to herein as implementations "circuit", “module” or "system”.
  • the electronic device 600 according to this embodiment of the present invention is described below with reference to FIG. 5 .
  • the electronic device 600 shown in FIG. 5 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present invention.
  • electronic device 600 takes the form of a general-purpose computing device.
  • Components of the electronic device 600 may include, but are not limited to, at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.
  • the storage unit stores program codes, and the program codes can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary embodiments according to the present invention described in the above-mentioned part of the electronic prescription circulation processing method of this specification.
  • the processing unit 610 may perform the steps shown in FIG. 1 .
  • the storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 6201 and/or a cache storage unit 6202 , and may further include a read only storage unit (ROM) 6203 .
  • RAM random access storage unit
  • ROM read only storage unit
  • the storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205 including, but not limited to, an operating system, one or more application programs, other program modules, and programs Data, each or some combination of these examples may include an implementation of a network environment.
  • the bus 630 may be representative of one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures bus.
  • the electronic device 600 may also communicate with one or more external devices 700 (eg, keyboards, pointing devices, Bluetooth devices, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with Any device (eg, router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 650 . Also, the electronic device 600 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 660 . Network adapter 660 may communicate with other modules of electronic device 600 through bus 630 . It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives and data backup storage systems.
  • the processor executes the method for detecting salient objects in compressed video when executing the executable instructions, so that the salient objects in the compressed video can be obtained.
  • Beneficial effects of sexual object detection methods are provided by the present invention.
  • Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and when the program is executed, the steps of the method for detecting a salient object in a compressed video are implemented.
  • aspects of the present invention can also be implemented in the form of a program product comprising program code for enabling the program product to run on a terminal device The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification.
  • a program product 800 for implementing the above method according to an embodiment of the present invention is described, which can adopt a portable compact disk read only memory (CD-ROM) and include program codes, and can be used in a terminal device, For example running on a personal computer.
  • CD-ROM compact disk read only memory
  • the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • the program product may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • the computer-readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, carrying readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable storage medium can also be any readable medium other than a readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming Language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or cluster execute on.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
  • LAN local area network
  • WAN wide area network
  • the stored program realizes the steps of the method for detecting salient objects in compressed video when it is executed, so that the above-mentioned method for detecting salient objects in compressed video can be obtained. beneficial effect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

Provided are a method, system, and device for detecting a salient target in a compressed video, and a storage medium. The compressed video comprises multi-frame data and the multi frame data comprises I-frame data and at least one piece of P-frame data. The method comprises: inputting the I-frame data into a feature extraction network and extracting a first feature of the I-frame data, the feature extraction network comprising a convolutional neural network; for each piece of P-frame data, inputting a corresponding first feature of frame data at a previous moment as well as the P-frame data into a long short-term memory network and extracting a first feature of the P-frame data; and according to the first feature of each piece of frame data, using a saliency recognition network to obtain a salient target region in each frame. By using the present invention and by means of introducing a long short-term memory network, only the features of an I-frame need to be extracted, while the features of a P-frame can be extracted by using features of a previous frame, P-frame data, and the long short-term memory network, thus increasing the speed in detecting a salient target in a compressed video.

Description

压缩视频的显著性目标检测方法、系统、设备及存储介质Salient object detection method, system, device and storage medium for compressed video 技术领域technical field
本发明涉及视频处理技术领域,尤其涉及一种压缩视频的显著性目标检测方法、系统、设备及存储介质。The present invention relates to the technical field of video processing, and in particular, to a method, system, device and storage medium for salient target detection of compressed video.
背景技术Background technique
视频显著性检测主要分为两类,一类是视觉注意力检测,用于估计人眼观察一副图像时注视点的变化轨迹,在神经系统学中被广泛研究,一类是显著性目标检测,从背景噪声中分割出最重要或视觉上突出的目标。对于第二类显著性目标检测,现有技术中尚没有一种能够同时兼顾检测速度和检测效果的压缩视频的显著性目标检测方法。Video saliency detection is mainly divided into two categories, one is visual attention detection, which is used to estimate the change trajectory of the gaze point when the human eye observes an image, which has been widely studied in neurology, and the other is saliency target detection. , to segment the most important or visually prominent objects from the background noise. For the second type of salient target detection, there is no salient target detection method for compressed video that can take into account both detection speed and detection effect in the prior art.
发明内容SUMMARY OF THE INVENTION
针对现有技术中的问题,本发明的目的在于提供一种压缩视频的显著性目标检测方法、系统、设备及存储介质,在保证检测效果的基础上提升压缩视频显著性目标检测速度。In view of the problems in the prior art, the purpose of the present invention is to provide a method, system, device and storage medium for salient target detection in compressed video, which can improve the detection speed of salient target in compressed video on the basis of ensuring the detection effect.
本发明实施例提供一种压缩视频的显著性目标检测方法,所述压缩视频包括多帧数据,所述多帧数据包括I帧数据和至少一个P帧数据,所述方法包括如下步骤:An embodiment of the present invention provides a method for detecting a salient object in a compressed video, where the compressed video includes multiple frames of data, and the multiple frames of data include I-frame data and at least one P-frame data, and the method includes the following steps:
将所述I帧数据输入特征提取网络,提取所述I帧数据的第一特征,所述特征提取网络包括卷积神经网络;Inputting the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
对于各个所述P帧数据,将所对应的前一时刻的帧数据的第一特征和所述P帧数据输入长短时记忆网络,提取所述P帧数据的第一特征;For each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the first feature of the P frame data;
根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧 的显著性目标区域。According to the first feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
可选地,所述I帧数据包括由所述压缩视频的I帧码流解码得到的I帧图像数据,所述P帧数据包括所述压缩视频的P帧码流中的运动信息和残差信息。Optionally, the I-frame data includes I-frame image data obtained by decoding the I-frame code stream of the compressed video, and the P-frame data includes motion information and residuals in the P-frame code stream of the compressed video. information.
可选地,所述特征提取网络还包括与所述卷积神经网络串联的第一残差网络。Optionally, the feature extraction network further includes a first residual network connected in series with the convolutional neural network.
可选地,根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:Optionally, according to the first feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
将所述各帧数据的第一特征输入依次串联的第二残差网络、第三残差网络和第四残差网络,得到所述各帧数据的第二特征;Inputting the first feature of each frame of data into the second residual network, the third residual network and the fourth residual network in series in sequence, to obtain the second feature of each frame of data;
根据所述各帧数据的第二特征,采用显著性识别网络,得到各帧的显著性目标区域。According to the second feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
可选地,根据所述各帧数据的第二特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:Optionally, according to the second feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
将各帧数据的第二特征输入空洞空间卷积池化金字塔网络,得到各帧数据的第三特征;Input the second feature of each frame of data into the hole space convolution pooling pyramid network to obtain the third feature of each frame of data;
根据所述各帧数据的第三特征,采用显著性识别网络,得到各帧的显著性目标区域。According to the third feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
可选地,所述空洞空间卷积池化金字塔网络包括并联连接的五个模块,所述五个模块包括一个全局平均池化层、一个1x1卷积层和三个3x3空洞卷积层,所述五个模块的输出经过合并后得到所述各帧数据的第三特征。Optionally, the atrous spatial convolutional pooling pyramid network includes five modules connected in parallel, and the five modules include a global average pooling layer, a 1x1 convolutional layer and three 3x3 atrous convolutional layers, so The outputs of the five modules are combined to obtain the third feature of each frame of data.
可选地,根据所述各帧数据的第三特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:Optionally, according to the third feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
将所述各帧数据的第三特征输入所述显著性识别网络,得到各帧数据所对应的概率图;Inputting the third feature of each frame of data into the saliency recognition network to obtain a probability map corresponding to each frame of data;
根据概率阈值对所述概率图进行二值化处理,得到二值化图;Binarize the probability map according to the probability threshold to obtain a binarized map;
根据所述二值化图提取显著性区域。A saliency region is extracted from the binarized map.
可选地,所述显著性识别网络包括第一到第五反卷积层和激活函数层,所述各帧数据的第三特征输入所述第一反卷积层,所述各帧数据的第一特征输入所述第二反卷积层,所述第一反卷积层和所述第二反卷积层的输出合并后输入依次串联的第三反卷积层、第四反卷积层和第五反卷积层,所述第五反卷积层的输出经所述激活函数层后输出各帧数据的概率图。Optionally, the saliency recognition network includes first to fifth deconvolution layers and activation function layers, the third feature of each frame of data is input to the first deconvolution layer, and the third feature of each frame of data is input to the first deconvolution layer. The first feature is input into the second deconvolution layer, and the outputs of the first deconvolution layer and the second deconvolution layer are combined and input into the third deconvolution layer and the fourth deconvolution layer in series in sequence layer and a fifth deconvolution layer, the output of the fifth deconvolution layer outputs the probability map of each frame of data after passing through the activation function layer.
通过采用本发明的压缩视频的显著性目标检测方法,引入长短时记忆网络,只需要对I帧提取特征,而P帧可以采用前一帧的特征、P帧数据和长短时记忆网络进行特征提取,对提取的特征可以进行显著性目标检测,从而大大提升了压缩视频显著性目标检测速度。By adopting the salient target detection method for compressed video of the present invention, a long and short-term memory network is introduced, and it is only necessary to extract features for the I frame, while the P frame can use the features of the previous frame, the P frame data and the long and short-term memory network for feature extraction. , salient object detection can be performed on the extracted features, which greatly improves the detection speed of salient objects in compressed video.
本发明实施例还提供一种压缩视频的显著性目标检测系统,应用于所述的压缩视频的显著性目标检测方法,所述系统包括:The embodiment of the present invention also provides a salient target detection system for compressed video, which is applied to the salient target detection method for compressed video, and the system includes:
第一特征提取模块,用于将所述I帧数据输入特征提取网络,提取所述I帧数据的第一特征,所述特征提取网络包括卷积神经网络;A first feature extraction module, configured to input the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
第二特征提取模块,用于对于各个所述P帧数据,将所对应的前一时刻的帧数据的第一特征和所述P帧数据输入长短时记忆网络,提 取所述P帧数据的第一特征;The second feature extraction module is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into the long-short-term memory network, and extract the first feature of the P frame data. a feature;
显著性检测模块,用于根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧的显著性目标区域。The saliency detection module is used for obtaining the saliency target area of each frame by adopting the saliency identification network according to the first feature of the data of each frame.
通过采用本发明的压缩视频的显著性目标检测系统,引入长短时记忆网络,只需要对I帧提取特征,而P帧可以采用前一帧的特征、P帧数据和长短时记忆网络进行特征提取,对提取的特征可以进行显著性目标检测,从而大大提升了压缩视频显著性目标检测速度。By adopting the salient target detection system for compressed video of the present invention, a long and short-term memory network is introduced, and only the features of the I frame need to be extracted, while the P frame can use the features of the previous frame, the P frame data and the long and short-term memory network for feature extraction. , salient object detection can be performed on the extracted features, which greatly improves the detection speed of salient objects in compressed video.
本发明实施例还提供一种压缩视频的显著性目标检测设备,包括:The embodiment of the present invention also provides a salient object detection device for compressed video, including:
处理器;processor;
存储器,其中存储有所述处理器的可执行指令;a memory in which executable instructions for the processor are stored;
其中,所述处理器配置为经由执行所述可执行指令来执行所述的压缩视频的显著性目标检测方法的步骤。Wherein, the processor is configured to execute the steps of the salient object detection method for compressed video by executing the executable instructions.
通过采用本发明所提供的压缩视频的显著性目标检测设备,所述处理器在执行所述可执行指令时执行所述的压缩视频的显著性目标检测方法,由此可以获得上述压缩视频的显著性目标检测方法的有益效果。By adopting the apparatus for detecting salient objects in compressed video provided by the present invention, the processor executes the method for detecting salient objects in compressed video when executing the executable instructions, so that the salient objects in the compressed video can be obtained. Beneficial effects of sexual object detection methods.
本发明实施例还提供一种计算机可读存储介质,用于存储程序,所述程序被执行时实现所述的压缩视频的显著性目标检测方法的步骤。Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and when the program is executed, the steps of the method for detecting a salient object in a compressed video are implemented.
通过采用本发明所提供的计算机可读存储介质,其中存储的程序在被执行时实现所述的压缩视频的显著性目标检测方法的步骤,由此可以获得上述压缩视频的显著性目标检测方法的有益效果。By using the computer-readable storage medium provided by the present invention, the stored program realizes the steps of the method for detecting salient objects in compressed video when it is executed, so that the above-mentioned method for detecting salient objects in compressed video can be obtained. beneficial effect.
附图说明Description of drawings
通过阅读参照以下附图对非限制性实施例所作的详细描述,本发明的其它特征、目的和优点将会变得更明显。Other features, objects and advantages of the present invention will become more apparent upon reading the detailed description of non-limiting embodiments with reference to the following drawings.
图1是本发明一实施例的压缩视频的显著性目标检测方法的流程图;1 is a flowchart of a method for detecting a salient object in a compressed video according to an embodiment of the present invention;
图2是本发明一具体实例的压缩视频的显著性目标检测网络的结构图;Fig. 2 is the structure diagram of the salient target detection network of the compressed video of a specific example of the present invention;
图3是本发明一实施例的长短时记忆网络的结构图;3 is a structural diagram of a long-short-term memory network according to an embodiment of the present invention;
图4是本发明一实施例的压缩视频的显著性目标检测系统的示意图;4 is a schematic diagram of a salient object detection system for compressed video according to an embodiment of the present invention;
图5是本发明一实施例的压缩视频的显著性目标检测设备的结构示意图;5 is a schematic structural diagram of a salient object detection device for compressed video according to an embodiment of the present invention;
图6是本发明一实施例的计算机存储介质的结构示意图。FIG. 6 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention.
具体实施方式detailed description
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的实施方式;相反,提供这些实施方式使得本发明将全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的结构,因而将省略对它们的重复描述。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their repeated descriptions will be omitted.
在一实施例中,本发明提供了一种压缩视频的显著性目标检测方法,所述压缩视频包括多帧数据,所述多帧数据包括I帧数据和至少一个P帧数据。视频一般被认为是独立图像的序列,可以以压缩形式进行存储和传输,编解码将视频分为I帧和P/B帧,I帧是一个完整的 图像帧,P/B帧仅保留和参考图像的变化。在采用I帧和P帧的压缩视频中,t+k时刻的P帧数据仅记录物体的运动信息m t+k和残差信息r t+k,因此连续帧是高度相关的,且帧之间的变化也记录在视频的码流里。 In one embodiment, the present invention provides a salient object detection method for compressed video, where the compressed video includes multiple frames of data, and the multiple frames of data include I-frame data and at least one P-frame data. Video is generally regarded as a sequence of independent images, which can be stored and transmitted in compressed form. Codec divides video into I frame and P/B frame. I frame is a complete image frame, and P/B frame is only reserved and referenced. image changes. In the compressed video using I frame and P frame, the P frame data at time t+k only records the motion information m t+k of the object and the residual information r t+k , so the consecutive frames are highly correlated, and the frames between the frames are highly correlated. The change in time is also recorded in the video stream.
如图1所示,所述压缩视频的显著性目标检测方法包括如下步骤:As shown in Figure 1, the salient target detection method for the compressed video includes the following steps:
S100:将所述I帧数据输入特征提取网络,提取所述I帧数据的第一特征,所述特征提取网络包括卷积神经网络,所述卷积神经网络可以基于I帧的完整图像帧提取完整的特征,此处第一特征即对应于特征图的形式;S100: Input the I-frame data into a feature extraction network to extract the first feature of the I-frame data, where the feature extraction network includes a convolutional neural network, and the convolutional neural network can extract the complete image frame based on the I-frame Complete features, where the first feature corresponds to the form of the feature map;
S200:对于各个所述P帧数据,将所对应的前一时刻的帧数据的第一特征和所述P帧数据输入长短时记忆网络,提取所述P帧数据的第一特征,此处P帧数据包括P帧相对于前一时刻的帧的运动矢量和残差数据;S200: For each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the first feature of the P frame data, where P The frame data includes the motion vector and residual data of the P frame relative to the frame at the previous moment;
S300:根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧的显著性目标区域。S300: According to the first feature of the data of each frame, a saliency identification network is used to obtain the saliency target area of each frame.
通过采用本发明的压缩视频的显著性目标检测方法,在采用步骤S100对I帧数据进行特征提取时,通过卷积神经网络进行提取,而在采用步骤S200对P帧数据进行特征提取时,则引入长短时记忆网络,可以采用前一帧的特征和长短时记忆网络进行特征提取,对提取的特征可以采用步骤S300进行显著性目标检测,因此,只需要对I帧提取完整的特征,而P帧只需要通过长短时记忆网络和视频码流里的P帧数据即可以快速提取P帧的特征。By adopting the salient target detection method for compressed video of the present invention, when step S100 is used to perform feature extraction on I frame data, the extraction is performed through a convolutional neural network, and when step S200 is used to perform feature extraction on P frame data, then The long-short-term memory network is introduced, and the features of the previous frame and the long-short-term memory network can be used for feature extraction, and step S300 can be used for salient target detection for the extracted features. Therefore, only complete features need to be extracted for the I frame, while P The frame only needs to pass through the long-term memory network and the P-frame data in the video stream to quickly extract the features of the P-frame.
在该实施例中,所述I帧数据包括由所述压缩视频的I帧码流解码得到的I帧图像数据,所述P帧数据包括所述压缩视频的P帧码流中的运动信息和残差信息,因此通过运动信息和残差信息即可以快速提取P帧的特征,从而有效提升压缩视频特征提取速度,也就大大提升了压缩视频显著性目标检测速度。具体地,所述运动信息可以包括运动矢量,所述残差信息可以包括残差系数。In this embodiment, the I-frame data includes I-frame image data obtained by decoding the I-frame code stream of the compressed video, and the P-frame data includes motion information in the P-frame code stream of the compressed video and Therefore, the features of P frames can be quickly extracted through motion information and residual information, thereby effectively improving the feature extraction speed of compressed video and greatly improving the detection speed of salient objects in compressed video. Specifically, the motion information may include motion vectors, and the residual information may include residual coefficients.
具体地,在视频编码序列中,在一组连续图像(GOP,Group Of Pictures),I帧数据保留完整信息,I帧数据进行解码得到完整的图像信息,并通过步骤S100对其进行特征提取和通过步骤S300进行显著性目标检测,对于P帧,通过步骤S200,采用一个运动辅助长短时记忆网络(Nm_lstm)对连续的P帧数据提取特征,然后对提取的特征进行显著性目标检测,对于t+k时刻的P帧数据,长短时记忆网络(LSTM,Long Short-Term Memory)以之前的I帧图像数据提取的特征Residual_1或者前一时刻的P帧提取的特征c t+k-1和h t+k-1以及视频码流中运动信息和残差信息作为输入,提取P帧的特征,然后对提取的特征进行显著性目标检测。 Specifically, in the video coding sequence, in a group of continuous images (GOP, Group Of Pictures), the I-frame data retains complete information, the I-frame data is decoded to obtain complete image information, and feature extraction and processing are performed in step S100. The saliency target detection is carried out through step S300. For the P frame, through step S200, a motion-assisted long and short-term memory network (Nm_lstm) is used to extract features from the continuous P frame data, and then salient target detection is performed on the extracted features. For t +P frame data at time k, the feature Residual_1 extracted by the long short-term memory network (LSTM, Long Short-Term Memory) with the image data of the previous I frame or the feature c t+k-1 and h extracted from the P frame at the previous time t+k-1 and the motion information and residual information in the video stream are used as input to extract the features of P frames, and then perform saliency target detection on the extracted features.
如图2所示,在该实施例中,所述卷积神经网络为头卷积神经网络HeadConv,所述特征提取网络还包括与所述卷积神经网络HeadConv串联的第一残差网络Residual_1t,第一残差网络的输出特征输入到t+1时刻的P帧数据的长短时记忆网络Nm-lstm中,而t+1时刻的P帧数据的长短时记忆网络Nm-lstm输出的特征输入到t+2时刻的P帧数据的长短时记忆网络Nm-lstm,以此类推。通过结合卷积 神经网络和残差网络对I帧进行特征提取,可以更好地提取I帧的特征图。残差网络的特点是容易优化,并且能够通过增加相当的深度来提高准确率。其内部的残差块使用了跳跃连接,缓解了在深度神经网络中增加深度带来的梯度消失问题。As shown in FIG. 2, in this embodiment, the convolutional neural network is a head convolutional neural network HeadConv, and the feature extraction network further includes a first residual network Residual_1t connected in series with the convolutional neural network HeadConv, The output features of the first residual network are input into the long-short-term memory network Nm-lstm of the P-frame data at time t+1, and the features output by the long-short-term memory network Nm-lstm of the P-frame data at time t+1 are input into The long and short-term memory network Nm-lstm of the P frame data at time t+2, and so on. By combining the convolutional neural network and the residual network to extract the features of the I frame, the feature map of the I frame can be better extracted. Residual networks are characterized by being easy to optimize and capable of increasing accuracy by adding considerable depth. The internal residual block uses skip connections to alleviate the gradient disappearance problem caused by increasing depth in deep neural networks.
在应用中,对I帧图像数据提取第一特征Residual_1t,同时输入给运动辅助长短时记忆网络得到后来帧的第一特征[c t+1,…,c t+n],具体过程如下式所示: In the application, the first feature Residual_1t is extracted from the I frame image data, and input to the motion-assisted long and short-term memory network to obtain the first feature [c t+1 ,..., c t+n ] of the subsequent frames. The specific process is as follows: Show:
Figure PCTCN2021082752-appb-000001
Figure PCTCN2021082752-appb-000001
其中[c t,c t+1,…,c t+n]代表一个GOP提取特征的集合。 where [c t , c t+1 , . . . , c t+n ] represents a set of GOP extraction features.
如图2所示,在该实施例中,所述第一残差网络Residual_1t后还包括第二残差网络Residual_2t、第三残差网络Residual_3t和第四残差网络Residual_4t,通过增加残差网络的深度进一步提高特征提取的准确率。所述第一残差网络Residual_1t输出的I帧的第一特征还输入所述第二残差网络Residual_2t、第三残差网络Residual_3t和第四残差网络Residual_4t,得到I帧的第二特征。对于I帧,特征提取部分采用Resnet101作为骨架网络,包括卷积神经网络Headconv和四个残差网络(residual_i i…{1,2,3,4})。每个P帧的长短时记忆网络Nm-lstm输出还输入所述第二残差网络Residual_2t、第三残差网络Residual_3t和第四残差网络Residual_4t,得到P帧的第二特征。对于P帧,特征提取部分包括一个运动辅助长短时记忆网络和与I帧一样的三个残差网络。As shown in Figure 2, in this embodiment, the first residual network Residual_1t further includes a second residual network Residual_2t, a third residual network Residual_3t and a fourth residual network Residual_4t, by increasing the residual network Depth further improves the accuracy of feature extraction. The first feature of the I frame output by the first residual network Residual_1t is also input to the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t to obtain the second feature of the I frame. For the I frame, the feature extraction part adopts Resnet101 as the skeleton network, including the convolutional neural network Headconv and four residual networks (residual_i i...{1,2,3,4}). The output of the long-short-term memory network Nm-lstm of each P frame is also input to the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t to obtain the second feature of the P frame. For P-frames, the feature extraction part includes a motion-assisted long short-term memory network and the same three residual networks as I-frames.
在该实施例中,所述卷积神经网络HeadConv采用尺寸为7x7, 步长为2,通道为64的卷积核,4个残差网络Residual_1t~Residual_4t分别包含3、4、23和3个基于“瓶颈块”的残差学习网络,输出通道个数分别为256、512、1024和2048。In this embodiment, the convolutional neural network HeadConv adopts a convolution kernel with a size of 7×7, a stride of 2, and a channel of 64. The four residual networks Residual_1t to Residual_4t respectively include 3, 4, 23 and 3 based on The residual learning network of the "bottleneck block" has 256, 512, 1024 and 2048 output channels, respectively.
所述步骤S300:根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:The step S300: According to the first feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
将所述各帧数据的第一特征输入依次串联的第二残差网络Residual_2t、第三残差网络Residual_3t和第四残差网络Residual_4t,得到所述各帧数据的第二特征;Inputting the first feature of each frame of data into the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t in series in sequence to obtain the second feature of each frame of data;
根据所述各帧数据的第二特征,采用显著性识别网络,得到各帧的显著性目标区域。According to the second feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
进一步地,如图2所示,在该实施例中,所述第四残差网络Residual_4t后还串联有空洞空间卷积池化金字塔网络。所述空洞空间卷积池化金字塔网络(Atrous Spatial Pyramid Pooling(ASPP))可以进一步扩大特征提取的感知域,进一步提升特征提取效果。Further, as shown in FIG. 2 , in this embodiment, the fourth residual network Residual_4t is also connected in series with a hole space convolution pooling pyramid network. The atrous spatial convolution pooling pyramid network (Atrous Spatial Pyramid Pooling (ASPP)) can further expand the perceptual field of feature extraction and further improve the feature extraction effect.
根据所述各帧数据的第二特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:According to the second feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
将各帧数据的第二特征输入空洞空间卷积池化金字塔网络ASPP,得到各帧数据的第三特征;Input the second feature of each frame of data into the hole space convolution pooling pyramid network ASPP to obtain the third feature of each frame of data;
根据所述各帧数据的第三特征,采用显著性识别网络,得到各帧的显著性目标区域。According to the third feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
如图2所示,在该实施例中,所述空洞空间卷积池化金字塔网络包括并联连接的五个模块,所述五个模块包括一个全局平均池化层 GAP、一个1x1卷积层和三个采样率为rates={6,12,18}的3x3空洞卷积层,所述五个模块的输出经过concat合并后得到所述各帧数据的第三特征,通过一个1×1的卷积层,降低通道数到需要的数值。As shown in FIG. 2, in this embodiment, the atrous spatial convolutional pooling pyramid network includes five modules connected in parallel, the five modules include a global average pooling layer GAP, a 1x1 convolutional layer and Three 3x3 atrous convolutional layers with a sampling rate of rates={6, 12, 18}, the outputs of the five modules are combined by concat to obtain the third feature of each frame of data, through a 1×1 volume Layer up and reduce the number of channels to the desired value.
如图2所示,在该实施例中,所述显著性识别网络包括第一到第五反卷积层conv-1~conv5和激活函数层Sigmoid,所述空洞空间卷积池化金字塔网络ASPP输出的各帧数据的第三特征输入所述第一反卷积层conv-1,所述第一残差网络Residual_1t或所述长短时记忆网络Nm-1stm输出的所述各帧数据的第一特征输入所述第二反卷积层conv-2,所述第一反卷积层conv-1和所述第二反卷积层conv-2的输出经concat合并后输入依次串联的第三反卷积层conv-3、第四反卷积层conv-4和第五反卷积层conv-5,得到的第三特征即为与输入的I帧图像具有相同分辨率的特征图。所述第五反卷积层conv-5的输出经所述激活函数层Sigmoid后输出各帧数据的概率图。由于在特征提取过程中,采用卷积网络和残差网络使得特征图的分辨率小于输入的帧图像的分辨率,因此,通过五个反卷积层将特征图的分辨率恢复至输入的帧图像的分辨率。As shown in FIG. 2, in this embodiment, the saliency recognition network includes the first to fifth deconvolution layers conv-1 to conv5 and the activation function layer Sigmoid, and the atrous spatial convolution pooling pyramid network ASPP The third feature of each frame of data output is input to the first deconvolution layer conv-1, the first residual network Residual_1t or the long-short-term memory network Nm-1stm outputs the first feature of each frame of data. The features are input to the second deconvolution layer conv-2, and the outputs of the first deconvolution layer conv-1 and the second deconvolution layer conv-2 are combined by concat and input into the third inverse series in series. The convolution layer conv-3, the fourth deconvolution layer conv-4 and the fifth deconvolution layer conv-5, the obtained third feature is the feature map with the same resolution as the input I-frame image. The output of the fifth deconvolution layer conv-5 is passed through the activation function layer Sigmoid to output the probability map of each frame of data. In the feature extraction process, the convolutional network and residual network are used to make the resolution of the feature map smaller than that of the input frame image. Therefore, the resolution of the feature map is restored to the input frame through five deconvolution layers. The resolution of the image.
在得到各帧数据的概率图之后,可以根据所述概率图提取显著性区域。具体地,在该实施例中,根据所述各帧数据的第三特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:After the probability map of each frame of data is obtained, a saliency region can be extracted according to the probability map. Specifically, in this embodiment, according to the third feature of each frame of data, a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
将所述各帧数据的第三特征输入所述显著性识别网络,得到各帧数据所对应的概率图,即所述激活函数层Sigmoid输出的各帧数据的概率图;Inputting the third feature of each frame of data into the saliency recognition network to obtain a probability map corresponding to each frame of data, that is, the probability map of each frame of data output by the activation function layer Sigmoid;
根据概率阈值对所述概率图进行二值化处理,得到二值化图;Binarize the probability map according to the probability threshold to obtain a binarized map;
根据所述二值化图提取显著性区域。A saliency region is extracted from the binarized map.
如图3所示,为该实施例中长短时记忆网络的结构图。所述长短时记忆网络配置为利用运动信息和相邻帧的第一特征得到当前帧的第一特征,具体公式如下:As shown in FIG. 3 , it is a structural diagram of the long and short-term memory network in this embodiment. The long-short-term memory network is configured to obtain the first feature of the current frame by using the motion information and the first feature of the adjacent frame, and the specific formula is as follows:
c t+k-1→t+k=W(c t+k-1,m t+k) c t+k-1→t+k =W(c t+k-1 ,m t+k )
h t+k-1→t+k=W(h t+k-1,m t+k)     (2) h t+k-1→t+k =W(h t+k-1 ,m t+k ) (2)
其中c t+k-1和h t+k-1为t+k-1运动辅助长短时记忆网络的输出,c t和h t为Residual_1 t,k∈[1,n],n为一个GOP内P帧的帧数。矫正操作W对特征图的每个位置进行双线性插值,通过t+k-1帧p+Δp位置映射到t+k帧p位置,具体公式如下: where c t+k-1 and h t+k-1 are the outputs of t+k-1 motion-assisted long-term memory network, c t and h t are Residual_1 t , k∈[1,n], n is a GOP The number of frames within the P frame. The correction operation W performs bilinear interpolation on each position of the feature map, and maps the p+Δp position of the t+k-1 frame to the p position of the t+k frame. The specific formula is as follows:
Δp=m t+k(p) Δp=m t+k (p)
c t+k-1→t+k(p)=∑ q G(q,p+Δp)c t+k-1(q)     (3) c t+k-1→t+k (p)=∑ q G(q,p+Δp)c t+k-1 (q) (3)
其中Δp通过m t+k得到,q表示特征图c t+k-1的空间位置信息,G(.)表示双线性插值核,具体公式如下: where Δp is obtained by m t+k , q represents the spatial position information of the feature map ct +k-1 , G(.) represents the bilinear interpolation kernel, and the specific formula is as follows:
G(q,p+Δp)=max(0,1-||q-(p+Δp)||     (4)G(q,p+Δp)=max(0,1-||q-(p+Δp)|| (4)
隐藏层特征h t+k-1→t+k处理方式和c t+k-1→t+k一样,h t+k-1→t+k和c t+k-1→t+k作为前一帧到当前帧长短时记忆网络的输入。 The hidden layer feature h t+k-1→t+k is processed in the same way as c t+k-1→t+k , h t+k-1→t+k and c t+k-1→t+k as The input of the short-term memory network from the previous frame to the current frame.
长短时记忆网络具体公式如下:The specific formula of the long-term memory network is as follows:
g t+k=σ(W g(h t+k-1→t+k,r t+k)) g t+k =σ(W g (h t+k-1→t+k ,r t+k ))
i t+k=σ(W i(h t+k-1→t+k,r t+k)) i t+k =σ(W i (h t+k-1→t+k ,r t+k ))
Figure PCTCN2021082752-appb-000002
Figure PCTCN2021082752-appb-000002
Figure PCTCN2021082752-appb-000003
Figure PCTCN2021082752-appb-000003
其中
Figure PCTCN2021082752-appb-000004
Figure PCTCN2021082752-appb-000005
表示逐像素相加和相乘,W g,W i,W c学习的权重,σ()表示sigmoid,即将将变量映射到0,1之间。
in
Figure PCTCN2021082752-appb-000004
and
Figure PCTCN2021082752-appb-000005
Represents pixel-by-pixel addition and multiplication, W g , Wi , W c learned weights, σ() represents sigmoid, which is to map variables between 0 and 1.
o t+k=σ(W o(h t+k-1→t+k,r t+k)) o t+k =σ(W o (h t+k-1→t+k ,r t+k ))
Figure PCTCN2021082752-appb-000006
Figure PCTCN2021082752-appb-000006
因此,本发明通过长短时记忆网络和视频码流里的运动信息和残差信息可以快速提取P帧的特征,有效提升压缩视频特征提取速度。Therefore, the present invention can quickly extract the features of P frames through the long-short-term memory network and the motion information and residual information in the video code stream, thereby effectively improving the feature extraction speed of compressed video.
如图4所示,本发明实施例还提供一种压缩视频的显著性目标检测系统,应用于所述的压缩视频的显著性目标检测方法,所述系统包括:As shown in FIG. 4 , an embodiment of the present invention further provides a salient target detection system for compressed video, which is applied to the salient target detection method for compressed video, and the system includes:
第一特征提取模块M100,用于将所述I帧数据输入特征提取网络,提取所述I帧数据的第一特征,所述特征提取网络包括卷积神经网络;A first feature extraction module M100, configured to input the I frame data into a feature extraction network, to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
第二特征提取模块M200,用于对于各个所述P帧数据,将所对应的前一时刻的帧数据的第一特征和所述P帧数据输入长短时记忆网络,提取所述P帧数据的第一特征;The second feature extraction module M200 is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the P frame data of the first feature and the P frame data. first characteristic;
显著性检测模块M300,用于根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧的显著性目标区域。The saliency detection module M300 is used for obtaining the saliency target area of each frame by adopting a saliency identification network according to the first feature of the data of each frame.
通过采用本发明的压缩视频的显著性目标检测系统,在采用第一特征提取模块M100对I帧数据进行特征提取时,通过卷积神经网络进行提取,而在采用第二特征提取模块M200对P帧数据进行特征提取时,则引入长短时记忆网络,可以采用前一帧的特征、P帧数据和 长短时记忆网络进行特征提取,对提取的特征可以采用显著性检测模块M300进行显著性目标检测,因此,只需要对I帧提取完整的特征,而P帧只需要通过长短时记忆网络和视频码流里的P帧数据即可以快速提取P帧的特征,从而有效提升压缩视频特征提取速度,也就大大提升了压缩视频显著性目标检测速度。By adopting the salient target detection system for compressed video of the present invention, when the first feature extraction module M100 is used to perform feature extraction on the I frame data, the convolutional neural network is used for extraction, and the second feature extraction module M200 is used to extract the P When performing feature extraction on frame data, a long and short-term memory network is introduced, and the features of the previous frame, P frame data and long-term memory network can be used for feature extraction, and the saliency detection module M300 can be used for saliency target detection for the extracted features. , therefore, it is only necessary to extract the complete features of the I frame, and the P frame can quickly extract the features of the P frame only through the long and short-term memory network and the P frame data in the video stream, thereby effectively improving the compressed video feature extraction speed. It also greatly improves the speed of saliency target detection in compressed video.
本发明实施例还提供一种压缩视频的显著性目标检测设备,包括处理器;存储器,其中存储有所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行所述的压缩视频的显著性目标检测方法的步骤。An embodiment of the present invention further provides a salient object detection device for compressed video, including a processor; a memory, in which executable instructions of the processor are stored; wherein the processor is configured to execute the executable instructions by to perform the steps of the saliency target detection method for compressed video.
所属技术领域的技术人员能够理解,本发明的各个方面可以实现为系统、方法或程序产品。因此,本发明的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。As will be appreciated by one skilled in the art, various aspects of the present invention may be implemented as a system, method or program product. Therefore, various aspects of the present invention can be embodied in the following forms: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, which may be collectively referred to herein as implementations "circuit", "module" or "system".
下面参照图5来描述根据本发明的这种实施方式的电子设备600。图5显示的电子设备600仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。The electronic device 600 according to this embodiment of the present invention is described below with reference to FIG. 5 . The electronic device 600 shown in FIG. 5 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present invention.
如图5所示,电子设备600以通用计算设备的形式表现。电子设备600的组件可以包括但不限于:至少一个处理单元610、至少一个存储单元620、连接不同系统组件(包括存储单元620和处理单元610)的总线630、显示单元640等。As shown in FIG. 5, electronic device 600 takes the form of a general-purpose computing device. Components of the electronic device 600 may include, but are not limited to, at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.
其中,所述存储单元存储有程序代码,所述程序代码可以被所述 处理单元610执行,使得所述处理单元610执行本说明书上述电子处方流转处理方法部分中描述的根据本发明各种示例性实施方式的步骤。例如,所述处理单元610可以执行如图1中所示的步骤。Wherein, the storage unit stores program codes, and the program codes can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary embodiments according to the present invention described in the above-mentioned part of the electronic prescription circulation processing method of this specification. Implementation steps. For example, the processing unit 610 may perform the steps shown in FIG. 1 .
所述存储单元620可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)6201和/或高速缓存存储单元6202,还可以进一步包括只读存储单元(ROM)6203。The storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 6201 and/or a cache storage unit 6202 , and may further include a read only storage unit (ROM) 6203 .
所述存储单元620还可以包括具有一组(至少一个)程序模块6205的程序/实用工具6204,这样的程序模块6205包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205 including, but not limited to, an operating system, one or more application programs, other program modules, and programs Data, each or some combination of these examples may include an implementation of a network environment.
总线630可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。The bus 630 may be representative of one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures bus.
电子设备600也可以与一个或多个外部设备700(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备600交互的设备通信,和/或与使得该电子设备600能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口650进行。并且,电子设备600还可以通过网络适配器660与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。网络适配器660可以通过总线630与电子设备600的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备600使用其它硬 件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 600 may also communicate with one or more external devices 700 (eg, keyboards, pointing devices, Bluetooth devices, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with Any device (eg, router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 650 . Also, the electronic device 600 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 660 . Network adapter 660 may communicate with other modules of electronic device 600 through bus 630 . It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives and data backup storage systems.
通过采用本发明所提供的压缩视频的显著性目标检测设备,所述处理器在执行所述可执行指令时执行所述的压缩视频的显著性目标检测方法,由此可以获得上述压缩视频的显著性目标检测方法的有益效果。By adopting the apparatus for detecting salient objects in compressed video provided by the present invention, the processor executes the method for detecting salient objects in compressed video when executing the executable instructions, so that the salient objects in the compressed video can be obtained. Beneficial effects of sexual object detection methods.
本发明实施例还提供一种计算机可读存储介质,用于存储程序,所述程序被执行时实现所述的压缩视频的显著性目标检测方法的步骤。在一些可能的实施方式中,本发明的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述电子处方流转处理方法部分中描述的根据本发明各种示例性实施方式的步骤。Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and when the program is executed, the steps of the method for detecting a salient object in a compressed video are implemented. In some possible implementations, aspects of the present invention can also be implemented in the form of a program product comprising program code for enabling the program product to run on a terminal device The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification.
参考图6所示,描述了根据本发明的实施方式的用于实现上述方法的程序产品800,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Referring to FIG. 6, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which can adopt a portable compact disk read only memory (CD-ROM) and include program codes, and can be used in a terminal device, For example running on a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件, 或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
所述计算机可读存储介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读存储介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。The computer-readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, carrying readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A readable storage medium can also be any readable medium other than a readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或集群上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特 网服务提供商来通过因特网连接)。Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming Language - such as the "C" language or similar programming language. The program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or cluster execute on. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
通过采用本发明所提供的计算机可读存储介质,其中存储的程序在被执行时实现所述的压缩视频的显著性目标检测方法的步骤,由此可以获得上述压缩视频的显著性目标检测方法的有益效果。By using the computer-readable storage medium provided by the present invention, the stored program realizes the steps of the method for detecting salient objects in compressed video when it is executed, so that the above-mentioned method for detecting salient objects in compressed video can be obtained. beneficial effect.
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in combination with specific preferred embodiments, and it cannot be considered that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deductions or substitutions can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims (11)

  1. 一种压缩视频的显著性目标检测方法,其特征在于,所述压缩视频包括多帧数据,所述多帧数据包括I帧数据和至少一个P帧数据,所述方法包括如下步骤:A salient object detection method for compressed video, characterized in that, the compressed video includes multiple frames of data, and the multiple frames of data include I frame data and at least one P frame data, and the method includes the following steps:
    将所述I帧数据输入特征提取网络,提取所述I帧数据的第一特征,所述特征提取网络包括卷积神经网络;Inputting the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
    对于各个所述P帧数据,将所对应的前一时刻的帧数据的第一特征和所述P帧数据输入长短时记忆网络,提取所述P帧数据的第一特征;For each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the first feature of the P frame data;
    根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧的显著性目标区域。According to the first feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
  2. 根据权利要求1所述的压缩视频的显著性目标检测方法,其特征在于,所述I帧数据包括由所述压缩视频的I帧码流解码得到的I帧图像数据,所述P帧数据包括所述压缩视频的P帧码流中的运动信息和残差信息。The salient object detection method for compressed video according to claim 1, wherein the I-frame data includes I-frame image data obtained by decoding an I-frame code stream of the compressed video, and the P-frame data includes The motion information and residual information in the P-frame code stream of the compressed video.
  3. 根据权利要求1所述的压缩视频的显著性目标检测方法,其特征在于,所述特征提取网络还包括与所述卷积神经网络串联的第一残差网络。The salient object detection method for compressed video according to claim 1, wherein the feature extraction network further comprises a first residual network connected in series with the convolutional neural network.
  4. 根据权利要求1所述的压缩视频的显著性目标检测方法,其特征在于,根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:The salient target detection method for compressed video according to claim 1, wherein, according to the first feature of each frame of data, a saliency recognition network is used to obtain the salient target area of each frame, comprising the following steps:
    将所述各帧数据的第一特征输入依次串联的第二残差网络、第三残差网络和第四残差网络,得到所述各帧数据的第二特征;Inputting the first feature of each frame of data into the second residual network, the third residual network and the fourth residual network in series in sequence, to obtain the second feature of each frame of data;
    根据所述各帧数据的第二特征,采用显著性识别网络,得到各帧的显著性目标区域。According to the second feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
  5. 根据权利要求4所述的压缩视频的显著性目标检测方法,其特征在于,根据所述各帧数据的第二特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:The salient target detection method for compressed video according to claim 4, wherein, according to the second feature of each frame of data, a saliency recognition network is used to obtain the salient target area of each frame, comprising the following steps:
    将各帧数据的第二特征输入空洞空间卷积池化金字塔网络,得到各帧数据的第三特征;Input the second feature of each frame of data into the hole space convolution pooling pyramid network to obtain the third feature of each frame of data;
    根据所述各帧数据的第三特征,采用显著性识别网络,得到各帧的显著性目标区域。According to the third feature of the data of each frame, a saliency recognition network is used to obtain the saliency target area of each frame.
  6. 根据权利要求5所述的压缩视频的显著性目标检测方法,其特征在于,所述空洞空间卷积池化金字塔网络包括并联连接的五个模块,所述五个模块包括一个全局平均池化层、一个1x1卷积层和三个3x3空洞卷积层,所述五个模块的输出经过合并后得到所述各帧数据的第三特征。The salient object detection method for compressed video according to claim 5, wherein the hole spatial convolution pooling pyramid network comprises five modules connected in parallel, and the five modules comprise a global average pooling layer , one 1x1 convolutional layer and three 3x3 hole convolutional layers, the outputs of the five modules are combined to obtain the third feature of each frame of data.
  7. 根据权利要求5所述的压缩视频的显著性目标检测方法,其特征在于,根据所述各帧数据的第三特征,采用显著性识别网络,得到各帧的显著性目标区域,包括如下步骤:The salient target detection method for compressed video according to claim 5, wherein, according to the third feature of each frame of data, a saliency recognition network is used to obtain the salient target area of each frame, comprising the following steps:
    将所述各帧数据的第三特征输入所述显著性识别网络,得到各帧数据所对应的概率图;Inputting the third feature of each frame of data into the saliency recognition network to obtain a probability map corresponding to each frame of data;
    根据概率阈值对所述概率图进行二值化处理,得到二值化图;Binarize the probability map according to the probability threshold to obtain a binarized map;
    根据所述二值化图提取显著性区域。A saliency region is extracted from the binarized map.
  8. 根据权利要求7所述的压缩视频的显著性目标检测方 法,其特征在于,所述显著性识别网络包括第一到第五反卷积层和激活函数层,所述各帧数据的第三特征输入所述第一反卷积层,所述各帧数据的第一特征输入所述第二反卷积层,所述第一反卷积层和所述第二反卷积层的输出合并后输入依次串联的第三反卷积层、第四反卷积层和第五反卷积层,所述第五反卷积层的输出经所述激活函数层后输出各帧数据的概率图。The saliency target detection method for compressed video according to claim 7, wherein the saliency recognition network comprises first to fifth deconvolution layers and activation function layers, and the third feature of each frame of data Input the first deconvolution layer, the first feature of each frame of data is input to the second deconvolution layer, and the outputs of the first deconvolution layer and the second deconvolution layer are combined. Input the third deconvolution layer, the fourth deconvolution layer and the fifth deconvolution layer in series, and the output of the fifth deconvolution layer outputs the probability map of each frame of data after passing through the activation function layer.
  9. 一种压缩视频的显著性目标检测系统,其特征在于,应用于权利要求1至8中任一项所述的压缩视频的显著性目标检测方法,所述系统包括:A salient target detection system for compressed video, characterized in that, applied to the salient target detection method for compressed video according to any one of claims 1 to 8, the system comprising:
    第一特征提取模块,用于将所述I帧数据输入特征提取网络,提取所述I帧数据的第一特征,所述特征提取网络包括卷积神经网络;A first feature extraction module, configured to input the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
    第二特征提取模块,用于对于各个所述P帧数据,将所对应的前一时刻的帧数据的第一特征和所述P帧数据输入长短时记忆网络,提取所述P帧数据的第一特征;The second feature extraction module is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into the long-short-term memory network, and extract the first feature of the P frame data. a feature;
    显著性检测模块,用于根据所述各帧数据的第一特征,采用显著性识别网络,得到各帧的显著性目标区域。The saliency detection module is used for obtaining the saliency target area of each frame by adopting the saliency identification network according to the first feature of the data of each frame.
  10. 一种压缩视频的显著性目标检测设备,其特征在于,包括:A salient target detection device for compressed video, characterized in that it includes:
    处理器;processor;
    存储器,其中存储有所述处理器的可执行指令;a memory in which executable instructions for the processor are stored;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1至8中任一项所述的压缩视频的显著性目标检测方法的步骤。Wherein, the processor is configured to perform the steps of the salient object detection method for compressed video according to any one of claims 1 to 8 by executing the executable instructions.
  11. 一种计算机可读存储介质,用于存储程序,其特征在于,所述程序被执行时实现权利要求1至8中任一项所述的压缩视频的显著性目标检测方法的步骤。A computer-readable storage medium for storing a program, characterized in that, when the program is executed, the steps of the salient target detection method for compressed video according to any one of claims 1 to 8 are implemented.
PCT/CN2021/082752 2020-09-24 2021-03-24 Method, system, and device for detecting salient target in compressed video, and storage medium WO2022062344A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011016604.7 2020-09-24
CN202011016604.7A CN111931732B (en) 2020-09-24 2020-09-24 Method, system, device and storage medium for detecting salient object of compressed video

Publications (1)

Publication Number Publication Date
WO2022062344A1 true WO2022062344A1 (en) 2022-03-31

Family

ID=73334166

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/082752 WO2022062344A1 (en) 2020-09-24 2021-03-24 Method, system, and device for detecting salient target in compressed video, and storage medium

Country Status (2)

Country Link
CN (1) CN111931732B (en)
WO (1) WO2022062344A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115529457A (en) * 2022-09-05 2022-12-27 清华大学 Video compression method and device based on deep learning
CN115953727A (en) * 2023-03-15 2023-04-11 浙江天行健水务有限公司 Floc settling rate detection method and system, electronic equipment and medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931732B (en) * 2020-09-24 2022-07-15 苏州科达科技股份有限公司 Method, system, device and storage medium for detecting salient object of compressed video
CN116052047B (en) * 2023-01-29 2023-10-03 荣耀终端有限公司 Moving object detection method and related equipment thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108495129A (en) * 2018-03-22 2018-09-04 北京航空航天大学 The complexity optimized method and device of block partition encoding based on deep learning method
CN110163196A (en) * 2018-04-28 2019-08-23 中山大学 Notable feature detection method and device
CN111026915A (en) * 2019-11-25 2020-04-17 Oppo广东移动通信有限公司 Video classification method, video classification device, storage medium and electronic equipment
US20200143457A1 (en) * 2017-11-20 2020-05-07 A9.Com, Inc. Compressed content object and action detection
CN111931732A (en) * 2020-09-24 2020-11-13 苏州科达科技股份有限公司 Method, system, device and storage medium for detecting salient object of compressed video

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3769788B2 (en) * 1995-09-29 2006-04-26 ソニー株式会社 Image signal transmission apparatus and method
CN108241854B (en) * 2018-01-02 2021-11-09 天津大学 Depth video saliency detection method based on motion and memory information
CN109376611B (en) * 2018-09-27 2022-05-20 方玉明 Video significance detection method based on 3D convolutional neural network
CN111461043B (en) * 2020-04-07 2023-04-18 河北工业大学 Video significance detection method based on deep network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200143457A1 (en) * 2017-11-20 2020-05-07 A9.Com, Inc. Compressed content object and action detection
CN108495129A (en) * 2018-03-22 2018-09-04 北京航空航天大学 The complexity optimized method and device of block partition encoding based on deep learning method
CN110163196A (en) * 2018-04-28 2019-08-23 中山大学 Notable feature detection method and device
CN111026915A (en) * 2019-11-25 2020-04-17 Oppo广东移动通信有限公司 Video classification method, video classification device, storage medium and electronic equipment
CN111931732A (en) * 2020-09-24 2020-11-13 苏州科达科技股份有限公司 Method, system, device and storage medium for detecting salient object of compressed video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YU SHENG; CHENG YUN; XIE LI; LUO ZHIMING; HUANG MIN; LI SHAOZI: "A novel recurrent hybrid network for feature fusion in action recognition", JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, vol. 49, 1 November 2017 (2017-11-01), US , pages 192 - 203, XP085260382, ISSN: 1047-3203, DOI: 10.1016/j.jvcir.2017.09.007 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115529457A (en) * 2022-09-05 2022-12-27 清华大学 Video compression method and device based on deep learning
CN115529457B (en) * 2022-09-05 2024-05-14 清华大学 Video compression method and device based on deep learning
CN115953727A (en) * 2023-03-15 2023-04-11 浙江天行健水务有限公司 Floc settling rate detection method and system, electronic equipment and medium
CN115953727B (en) * 2023-03-15 2023-06-09 浙江天行健水务有限公司 Method, system, electronic equipment and medium for detecting floc sedimentation rate

Also Published As

Publication number Publication date
CN111931732B (en) 2022-07-15
CN111931732A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
WO2022062344A1 (en) Method, system, and device for detecting salient target in compressed video, and storage medium
US11200424B2 (en) Space-time memory network for locating target object in video content
JP7265034B2 (en) Method and apparatus for human body detection
CN108399381B (en) Pedestrian re-identification method and device, electronic equipment and storage medium
WO2018192570A1 (en) Time domain motion detection method and system, electronic device and computer storage medium
WO2022105125A1 (en) Image segmentation method and apparatus, computer device, and storage medium
CN111523447B (en) Vehicle tracking method, device, electronic equipment and storage medium
CN112488073A (en) Target detection method, system, device and storage medium
CN108230354B (en) Target tracking method, network training method, device, electronic equipment and storage medium
US9514363B2 (en) Eye gaze driven spatio-temporal action localization
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
WO2019020062A1 (en) Video object segmentation method and apparatus, electronic device, storage medium and program
CN110427899B (en) Video prediction method and device based on face segmentation, medium and electronic equipment
CN113869138A (en) Multi-scale target detection method and device and computer readable storage medium
US20200111214A1 (en) Multi-level convolutional lstm model for the segmentation of mr images
CN111444807B (en) Target detection method, device, electronic equipment and computer readable medium
WO2023035531A1 (en) Super-resolution reconstruction method for text image and related device thereof
WO2022152104A1 (en) Action recognition model training method and device, and action recognition method and device
Zhang et al. Attention-guided image compression by deep reconstruction of compressive sensed saliency skeleton
CN111832393A (en) Video target detection method and device based on deep learning
WO2022218012A1 (en) Feature extraction method and apparatus, device, storage medium, and program product
GB2579262A (en) Space-time memory network for locating target object in video content
CN114429566A (en) Image semantic understanding method, device, equipment and storage medium
CN111368593B (en) Mosaic processing method and device, electronic equipment and storage medium
CN108460335B (en) Video fine-granularity identification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21870733

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21870733

Country of ref document: EP

Kind code of ref document: A1