WO2022062344A1 - Procédé, système, et dispositif de détection d'une cible proéminente dans une vidéo compressée, et support de stockage - Google Patents

Procédé, système, et dispositif de détection d'une cible proéminente dans une vidéo compressée, et support de stockage Download PDF

Info

Publication number
WO2022062344A1
WO2022062344A1 PCT/CN2021/082752 CN2021082752W WO2022062344A1 WO 2022062344 A1 WO2022062344 A1 WO 2022062344A1 CN 2021082752 W CN2021082752 W CN 2021082752W WO 2022062344 A1 WO2022062344 A1 WO 2022062344A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
data
feature
network
compressed video
Prior art date
Application number
PCT/CN2021/082752
Other languages
English (en)
Chinese (zh)
Inventor
邹文艺
章勇
曹李军
Original Assignee
苏州科达科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州科达科技股份有限公司 filed Critical 苏州科达科技股份有限公司
Publication of WO2022062344A1 publication Critical patent/WO2022062344A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present invention relates to the technical field of video processing, and in particular, to a method, system, device and storage medium for salient target detection of compressed video.
  • Video saliency detection is mainly divided into two categories, one is visual attention detection, which is used to estimate the change trajectory of the gaze point when the human eye observes an image, which has been widely studied in neurology, and the other is saliency target detection. , to segment the most important or visually prominent objects from the background noise.
  • visual attention detection which is used to estimate the change trajectory of the gaze point when the human eye observes an image, which has been widely studied in neurology
  • saliency target detection to segment the most important or visually prominent objects from the background noise.
  • salient target detection there is no salient target detection method for compressed video that can take into account both detection speed and detection effect in the prior art.
  • the purpose of the present invention is to provide a method, system, device and storage medium for salient target detection in compressed video, which can improve the detection speed of salient target in compressed video on the basis of ensuring the detection effect.
  • An embodiment of the present invention provides a method for detecting a salient object in a compressed video, where the compressed video includes multiple frames of data, and the multiple frames of data include I-frame data and at least one P-frame data, and the method includes the following steps:
  • the feature extraction network includes a convolutional neural network
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • the I-frame data includes I-frame image data obtained by decoding the I-frame code stream of the compressed video
  • the P-frame data includes motion information and residuals in the P-frame code stream of the compressed video. information.
  • the feature extraction network further includes a first residual network connected in series with the convolutional neural network.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • the atrous spatial convolutional pooling pyramid network includes five modules connected in parallel, and the five modules include a global average pooling layer, a 1x1 convolutional layer and three 3x3 atrous convolutional layers, so The outputs of the five modules are combined to obtain the third feature of each frame of data.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency region is extracted from the binarized map.
  • the saliency recognition network includes first to fifth deconvolution layers and activation function layers, the third feature of each frame of data is input to the first deconvolution layer, and the third feature of each frame of data is input to the first deconvolution layer.
  • the first feature is input into the second deconvolution layer, and the outputs of the first deconvolution layer and the second deconvolution layer are combined and input into the third deconvolution layer and the fourth deconvolution layer in series in sequence layer and a fifth deconvolution layer, the output of the fifth deconvolution layer outputs the probability map of each frame of data after passing through the activation function layer.
  • a long and short-term memory network is introduced, and it is only necessary to extract features for the I frame, while the P frame can use the features of the previous frame, the P frame data and the long and short-term memory network for feature extraction.
  • salient object detection can be performed on the extracted features, which greatly improves the detection speed of salient objects in compressed video.
  • the embodiment of the present invention also provides a salient target detection system for compressed video, which is applied to the salient target detection method for compressed video, and the system includes:
  • a first feature extraction module configured to input the I frame data into a feature extraction network to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
  • the second feature extraction module is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into the long-short-term memory network, and extract the first feature of the P frame data.
  • the saliency detection module is used for obtaining the saliency target area of each frame by adopting the saliency identification network according to the first feature of the data of each frame.
  • a long and short-term memory network is introduced, and only the features of the I frame need to be extracted, while the P frame can use the features of the previous frame, the P frame data and the long and short-term memory network for feature extraction.
  • salient object detection can be performed on the extracted features, which greatly improves the detection speed of salient objects in compressed video.
  • the embodiment of the present invention also provides a salient object detection device for compressed video, including:
  • the processor is configured to execute the steps of the salient object detection method for compressed video by executing the executable instructions.
  • the processor executes the method for detecting salient objects in compressed video when executing the executable instructions, so that the salient objects in the compressed video can be obtained.
  • Beneficial effects of sexual object detection methods are provided by the present invention.
  • Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and when the program is executed, the steps of the method for detecting a salient object in a compressed video are implemented.
  • the stored program realizes the steps of the method for detecting salient objects in compressed video when it is executed, so that the above-mentioned method for detecting salient objects in compressed video can be obtained. beneficial effect.
  • FIG. 1 is a flowchart of a method for detecting a salient object in a compressed video according to an embodiment of the present invention
  • Fig. 2 is the structure diagram of the salient target detection network of the compressed video of a specific example of the present invention
  • FIG. 3 is a structural diagram of a long-short-term memory network according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a salient object detection system for compressed video according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a salient object detection device for compressed video according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention.
  • Example embodiments will now be described more fully with reference to the accompanying drawings.
  • Example embodiments can be embodied in various forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
  • the same reference numerals in the drawings denote the same or similar structures, and thus their repeated descriptions will be omitted.
  • the present invention provides a salient object detection method for compressed video, where the compressed video includes multiple frames of data, and the multiple frames of data include I-frame data and at least one P-frame data.
  • Video is generally regarded as a sequence of independent images, which can be stored and transmitted in compressed form. Codec divides video into I frame and P/B frame. I frame is a complete image frame, and P/B frame is only reserved and referenced. image changes.
  • the P frame data at time t+k only records the motion information m t+k of the object and the residual information r t+k , so the consecutive frames are highly correlated, and the frames between the frames are highly correlated. The change in time is also recorded in the video stream.
  • the salient target detection method for the compressed video includes the following steps:
  • S100 Input the I-frame data into a feature extraction network to extract the first feature of the I-frame data, where the feature extraction network includes a convolutional neural network, and the convolutional neural network can extract the complete image frame based on the I-frame Complete features, where the first feature corresponds to the form of the feature map;
  • S200 For each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the first feature of the P frame data, where P The frame data includes the motion vector and residual data of the P frame relative to the frame at the previous moment;
  • a saliency identification network is used to obtain the saliency target area of each frame.
  • step S100 when step S100 is used to perform feature extraction on I frame data, the extraction is performed through a convolutional neural network, and when step S200 is used to perform feature extraction on P frame data, then The long-short-term memory network is introduced, and the features of the previous frame and the long-short-term memory network can be used for feature extraction, and step S300 can be used for salient target detection for the extracted features. Therefore, only complete features need to be extracted for the I frame, while P The frame only needs to pass through the long-term memory network and the P-frame data in the video stream to quickly extract the features of the P-frame.
  • the I-frame data includes I-frame image data obtained by decoding the I-frame code stream of the compressed video
  • the P-frame data includes motion information in the P-frame code stream of the compressed video and Therefore, the features of P frames can be quickly extracted through motion information and residual information, thereby effectively improving the feature extraction speed of compressed video and greatly improving the detection speed of salient objects in compressed video.
  • the motion information may include motion vectors
  • the residual information may include residual coefficients.
  • the I-frame data retains complete information
  • the I-frame data is decoded to obtain complete image information
  • feature extraction and processing are performed in step S100.
  • the saliency target detection is carried out through step S300.
  • a motion-assisted long and short-term memory network Nm_lstm is used to extract features from the continuous P frame data, and then salient target detection is performed on the extracted features.
  • the feature Residual_1 extracted by the long short-term memory network (LSTM, Long Short-Term Memory) with the image data of the previous I frame or the feature c t+k-1 and h extracted from the P frame at the previous time t+k-1 and the motion information and residual information in the video stream are used as input to extract the features of P frames, and then perform saliency target detection on the extracted features.
  • LSTM Long Short-Term Memory
  • the convolutional neural network is a head convolutional neural network HeadConv
  • the feature extraction network further includes a first residual network Residual_1t connected in series with the convolutional neural network HeadConv
  • the output features of the first residual network are input into the long-short-term memory network Nm-lstm of the P-frame data at time t+1
  • the features output by the long-short-term memory network Nm-lstm of the P-frame data at time t+1 are input into The long and short-term memory network Nm-lstm of the P frame data at time t+2, and so on.
  • Residual networks are characterized by being easy to optimize and capable of increasing accuracy by adding considerable depth.
  • the internal residual block uses skip connections to alleviate the gradient disappearance problem caused by increasing depth in deep neural networks.
  • the first feature Residual_1t is extracted from the I frame image data, and input to the motion-assisted long and short-term memory network to obtain the first feature [c t+1 ,..., c t+n ] of the subsequent frames.
  • the specific process is as follows: Show:
  • the first residual network Residual_1t further includes a second residual network Residual_2t, a third residual network Residual_3t and a fourth residual network Residual_4t, by increasing the residual network Depth further improves the accuracy of feature extraction.
  • the first feature of the I frame output by the first residual network Residual_1t is also input to the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t to obtain the second feature of the I frame.
  • the feature extraction part adopts Resnet101 as the skeleton network, including the convolutional neural network Headconv and four residual networks (residual_i i... ⁇ 1,2,3,4 ⁇ ).
  • the output of the long-short-term memory network Nm-lstm of each P frame is also input to the second residual network Residual_2t, the third residual network Residual_3t and the fourth residual network Residual_4t to obtain the second feature of the P frame.
  • the feature extraction part includes a motion-assisted long short-term memory network and the same three residual networks as I-frames.
  • the convolutional neural network HeadConv adopts a convolution kernel with a size of 7 ⁇ 7, a stride of 2, and a channel of 64.
  • the four residual networks Residual_1t to Residual_4t respectively include 3, 4, 23 and 3 based on
  • the residual learning network of the "bottleneck block" has 256, 512, 1024 and 2048 output channels, respectively.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • the fourth residual network Residual_4t is also connected in series with a hole space convolution pooling pyramid network.
  • the atrous spatial convolution pooling pyramid network (Atrous Spatial Pyramid Pooling (ASPP)) can further expand the perceptual field of feature extraction and further improve the feature extraction effect.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency recognition network is used to obtain the saliency target area of each frame.
  • the outputs of the five modules are combined by concat to obtain the third feature of each frame of data, through a 1 ⁇ 1 volume Layer up and reduce the number of channels to the desired value.
  • the saliency recognition network includes the first to fifth deconvolution layers conv-1 to conv5 and the activation function layer Sigmoid, and the atrous spatial convolution pooling pyramid network ASPP
  • the third feature of each frame of data output is input to the first deconvolution layer conv-1, the first residual network Residual_1t or the long-short-term memory network Nm-1stm outputs the first feature of each frame of data.
  • the features are input to the second deconvolution layer conv-2, and the outputs of the first deconvolution layer conv-1 and the second deconvolution layer conv-2 are combined by concat and input into the third inverse series in series.
  • the convolution layer conv-3, the fourth deconvolution layer conv-4 and the fifth deconvolution layer conv-5, the obtained third feature is the feature map with the same resolution as the input I-frame image.
  • the output of the fifth deconvolution layer conv-5 is passed through the activation function layer Sigmoid to output the probability map of each frame of data.
  • the convolutional network and residual network are used to make the resolution of the feature map smaller than that of the input frame image. Therefore, the resolution of the feature map is restored to the input frame through five deconvolution layers. The resolution of the image.
  • a saliency region can be extracted according to the probability map.
  • a saliency recognition network is used to obtain the saliency target area of each frame, including the following steps:
  • a saliency region is extracted from the binarized map.
  • FIG. 3 it is a structural diagram of the long and short-term memory network in this embodiment.
  • the long-short-term memory network is configured to obtain the first feature of the current frame by using the motion information and the first feature of the adjacent frame, and the specific formula is as follows:
  • c t+k-1 and h t+k-1 are the outputs of t+k-1 motion-assisted long-term memory network
  • c t and h t are Residual_1 t , k ⁇ [1,n]
  • n is a GOP The number of frames within the P frame.
  • the correction operation W performs bilinear interpolation on each position of the feature map, and maps the p+ ⁇ p position of the t+k-1 frame to the p position of the t+k frame.
  • the specific formula is as follows:
  • ⁇ p is obtained by m t+k
  • q represents the spatial position information of the feature map ct +k-1
  • G(.) represents the bilinear interpolation kernel
  • the hidden layer feature h t+k-1 ⁇ t+k is processed in the same way as c t+k-1 ⁇ t+k , h t+k-1 ⁇ t+k and c t+k-1 ⁇ t+k as The input of the short-term memory network from the previous frame to the current frame.
  • W g , Wi , W c learned weights, ⁇ () represents sigmoid, which is to map variables between 0 and 1.
  • the present invention can quickly extract the features of P frames through the long-short-term memory network and the motion information and residual information in the video code stream, thereby effectively improving the feature extraction speed of compressed video.
  • an embodiment of the present invention further provides a salient target detection system for compressed video, which is applied to the salient target detection method for compressed video, and the system includes:
  • a first feature extraction module M100 configured to input the I frame data into a feature extraction network, to extract the first feature of the I frame data, and the feature extraction network includes a convolutional neural network;
  • the second feature extraction module M200 is configured to, for each of the P frame data, input the first feature of the corresponding frame data at the previous moment and the P frame data into a long-term memory network, and extract the P frame data of the first feature and the P frame data.
  • the saliency detection module M300 is used for obtaining the saliency target area of each frame by adopting a saliency identification network according to the first feature of the data of each frame.
  • the convolutional neural network is used for extraction
  • the second feature extraction module M200 is used to extract the P
  • a long and short-term memory network is introduced, and the features of the previous frame, P frame data and long-term memory network can be used for feature extraction
  • the saliency detection module M300 can be used for saliency target detection for the extracted features.
  • An embodiment of the present invention further provides a salient object detection device for compressed video, including a processor; a memory, in which executable instructions of the processor are stored; wherein the processor is configured to execute the executable instructions by to perform the steps of the saliency target detection method for compressed video.
  • aspects of the present invention may be implemented as a system, method or program product. Therefore, various aspects of the present invention can be embodied in the following forms: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, which may be collectively referred to herein as implementations "circuit", “module” or "system”.
  • the electronic device 600 according to this embodiment of the present invention is described below with reference to FIG. 5 .
  • the electronic device 600 shown in FIG. 5 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present invention.
  • electronic device 600 takes the form of a general-purpose computing device.
  • Components of the electronic device 600 may include, but are not limited to, at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.
  • the storage unit stores program codes, and the program codes can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary embodiments according to the present invention described in the above-mentioned part of the electronic prescription circulation processing method of this specification.
  • the processing unit 610 may perform the steps shown in FIG. 1 .
  • the storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 6201 and/or a cache storage unit 6202 , and may further include a read only storage unit (ROM) 6203 .
  • RAM random access storage unit
  • ROM read only storage unit
  • the storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205 including, but not limited to, an operating system, one or more application programs, other program modules, and programs Data, each or some combination of these examples may include an implementation of a network environment.
  • the bus 630 may be representative of one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures bus.
  • the electronic device 600 may also communicate with one or more external devices 700 (eg, keyboards, pointing devices, Bluetooth devices, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with Any device (eg, router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 650 . Also, the electronic device 600 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 660 . Network adapter 660 may communicate with other modules of electronic device 600 through bus 630 . It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives and data backup storage systems.
  • the processor executes the method for detecting salient objects in compressed video when executing the executable instructions, so that the salient objects in the compressed video can be obtained.
  • Beneficial effects of sexual object detection methods are provided by the present invention.
  • Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and when the program is executed, the steps of the method for detecting a salient object in a compressed video are implemented.
  • aspects of the present invention can also be implemented in the form of a program product comprising program code for enabling the program product to run on a terminal device The terminal device executes the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification.
  • a program product 800 for implementing the above method according to an embodiment of the present invention is described, which can adopt a portable compact disk read only memory (CD-ROM) and include program codes, and can be used in a terminal device, For example running on a personal computer.
  • CD-ROM compact disk read only memory
  • the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • the program product may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • the computer-readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, carrying readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable storage medium can also be any readable medium other than a readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural Programming Language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or cluster execute on.
  • the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device (eg, using an Internet service provider business via an Internet connection).
  • LAN local area network
  • WAN wide area network
  • the stored program realizes the steps of the method for detecting salient objects in compressed video when it is executed, so that the above-mentioned method for detecting salient objects in compressed video can be obtained. beneficial effect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

L'invention concerne un procédé, système, et dispositif de détection d'une cible proéminente dans une vidéo compressée, et un support de stockage. La vidéo compressée comporte des données à trames multiples et les données à trames multiples comportent des données de trames I et au moins un élément de données de trames P. Le procédé comporte les étapes consistant à: introduire les données de trames I dans un réseau d'extraction d'attributs et extraire un premier attribut des données de trames I, le réseau d'extraction d'attributs comportant un réseau neuronal convolutif; pour chaque élément de données de trames P, introduire un premier attribut correspondant de données de trames à un instant précédent ainsi que les données de trames P dans un réseau à longue mémoire de court terme et extraire un premier attribut des données de trames P; et selon le premier attribut de chaque élément de données de trames, utiliser un réseau de reconnaissance de proéminence pour obtenir une région cible proéminente dans chaque trame. En utilisant la présente invention et au moyen de l'introduction d'un réseau à longue mémoire de court terme, seul les attributs d'une trame I ont besoin d'être extraits, tandis que les attributs d'une trame P peuvent être extraits en utilisant des attributs d'une trame précédente, des données de trames P, et le réseau à longue mémoire de court terme, ce qui accroît la vitesse de détection d'une cible proéminente dans une vidéo compressée.
PCT/CN2021/082752 2020-09-24 2021-03-24 Procédé, système, et dispositif de détection d'une cible proéminente dans une vidéo compressée, et support de stockage WO2022062344A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011016604.7 2020-09-24
CN202011016604.7A CN111931732B (zh) 2020-09-24 2020-09-24 压缩视频的显著性目标检测方法、系统、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022062344A1 true WO2022062344A1 (fr) 2022-03-31

Family

ID=73334166

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/082752 WO2022062344A1 (fr) 2020-09-24 2021-03-24 Procédé, système, et dispositif de détection d'une cible proéminente dans une vidéo compressée, et support de stockage

Country Status (2)

Country Link
CN (1) CN111931732B (fr)
WO (1) WO2022062344A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115529457A (zh) * 2022-09-05 2022-12-27 清华大学 基于深度学习的视频压缩方法和装置
CN115953727A (zh) * 2023-03-15 2023-04-11 浙江天行健水务有限公司 一种絮体沉降速率检测方法、系统、电子设备及介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931732B (zh) * 2020-09-24 2022-07-15 苏州科达科技股份有限公司 压缩视频的显著性目标检测方法、系统、设备及存储介质
CN116052047B (zh) * 2023-01-29 2023-10-03 荣耀终端有限公司 运动物体检测方法及其相关设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108495129A (zh) * 2018-03-22 2018-09-04 北京航空航天大学 基于深度学习方法的块分割编码复杂度优化方法及装置
CN110163196A (zh) * 2018-04-28 2019-08-23 中山大学 显著特征检测方法和装置
CN111026915A (zh) * 2019-11-25 2020-04-17 Oppo广东移动通信有限公司 视频分类方法、视频分类装置、存储介质与电子设备
US20200143457A1 (en) * 2017-11-20 2020-05-07 A9.Com, Inc. Compressed content object and action detection
CN111931732A (zh) * 2020-09-24 2020-11-13 苏州科达科技股份有限公司 压缩视频的显著性目标检测方法、系统、设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3769788B2 (ja) * 1995-09-29 2006-04-26 ソニー株式会社 画像信号伝送装置および方法
CN108241854B (zh) * 2018-01-02 2021-11-09 天津大学 一种基于运动和记忆信息的深度视频显著性检测方法
CN109376611B (zh) * 2018-09-27 2022-05-20 方玉明 一种基于3d卷积神经网络的视频显著性检测方法
CN111461043B (zh) * 2020-04-07 2023-04-18 河北工业大学 基于深度网络的视频显著性检测方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200143457A1 (en) * 2017-11-20 2020-05-07 A9.Com, Inc. Compressed content object and action detection
CN108495129A (zh) * 2018-03-22 2018-09-04 北京航空航天大学 基于深度学习方法的块分割编码复杂度优化方法及装置
CN110163196A (zh) * 2018-04-28 2019-08-23 中山大学 显著特征检测方法和装置
CN111026915A (zh) * 2019-11-25 2020-04-17 Oppo广东移动通信有限公司 视频分类方法、视频分类装置、存储介质与电子设备
CN111931732A (zh) * 2020-09-24 2020-11-13 苏州科达科技股份有限公司 压缩视频的显著性目标检测方法、系统、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YU SHENG; CHENG YUN; XIE LI; LUO ZHIMING; HUANG MIN; LI SHAOZI: "A novel recurrent hybrid network for feature fusion in action recognition", JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, vol. 49, 1 November 2017 (2017-11-01), US , pages 192 - 203, XP085260382, ISSN: 1047-3203, DOI: 10.1016/j.jvcir.2017.09.007 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115529457A (zh) * 2022-09-05 2022-12-27 清华大学 基于深度学习的视频压缩方法和装置
CN115529457B (zh) * 2022-09-05 2024-05-14 清华大学 基于深度学习的视频压缩方法和装置
CN115953727A (zh) * 2023-03-15 2023-04-11 浙江天行健水务有限公司 一种絮体沉降速率检测方法、系统、电子设备及介质
CN115953727B (zh) * 2023-03-15 2023-06-09 浙江天行健水务有限公司 一种絮体沉降速率检测方法、系统、电子设备及介质

Also Published As

Publication number Publication date
CN111931732B (zh) 2022-07-15
CN111931732A (zh) 2020-11-13

Similar Documents

Publication Publication Date Title
WO2022062344A1 (fr) Procédé, système, et dispositif de détection d'une cible proéminente dans une vidéo compressée, et support de stockage
US11200424B2 (en) Space-time memory network for locating target object in video content
JP7265034B2 (ja) 人体検出用の方法及び装置
CN108399381B (zh) 行人再识别方法、装置、电子设备和存储介质
WO2018192570A1 (fr) Procédé et système de détection de mouvement dans le domaine temporel, dispositif électronique et support de stockage informatique
WO2022105125A1 (fr) Procédé et appareil de segmentation d'image, dispositif informatique et support de stockage
CN111523447B (zh) 车辆跟踪方法、装置、电子设备及存储介质
CN112488073A (zh) 目标检测方法、系统、设备及存储介质
CN108230354B (zh) 目标跟踪、网络训练方法、装置、电子设备和存储介质
US9514363B2 (en) Eye gaze driven spatio-temporal action localization
CN112861575A (zh) 一种行人结构化方法、装置、设备和存储介质
WO2019020062A1 (fr) Procédé et appareil de segmentation d'objet vidéo, dispositif électronique, support de stockage et programme
CN110427899B (zh) 基于人脸分割的视频预测方法及装置、介质、电子设备
CN113869138A (zh) 多尺度目标检测方法、装置及计算机可读存储介质
US20200111214A1 (en) Multi-level convolutional lstm model for the segmentation of mr images
CN111444807B (zh) 目标检测方法、装置、电子设备和计算机可读介质
WO2023035531A1 (fr) Procédé de reconstruction à super-résolution pour image de texte et dispositif associé
WO2022152104A1 (fr) Procédé et dispositif d'apprentissage de modèle de reconnaissance d'action, ainsi que procédé et dispositif de reconnaissance d'action
Zhang et al. Attention-guided image compression by deep reconstruction of compressive sensed saliency skeleton
CN111832393A (zh) 一种基于深度学习的视频目标检测方法与装置
WO2022218012A1 (fr) Procédé et appareil d'extraction de caractéristiques, dispositif, support de stockage et produit programme
GB2579262A (en) Space-time memory network for locating target object in video content
CN114429566A (zh) 一种图像语义理解方法、装置、设备及存储介质
CN111368593B (zh) 一种马赛克处理方法、装置、电子设备及存储介质
CN108460335B (zh) 视频细粒度识别方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21870733

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21870733

Country of ref document: EP

Kind code of ref document: A1