WO2020238560A1 - 视频目标跟踪方法、装置、计算机设备及存储介质 - Google Patents

视频目标跟踪方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020238560A1
WO2020238560A1 PCT/CN2020/088286 CN2020088286W WO2020238560A1 WO 2020238560 A1 WO2020238560 A1 WO 2020238560A1 CN 2020088286 W CN2020088286 W CN 2020088286W WO 2020238560 A1 WO2020238560 A1 WO 2020238560A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
image
frame
image frame
detection
Prior art date
Application number
PCT/CN2020/088286
Other languages
English (en)
French (fr)
Inventor
崔振
揭泽群
魏力
许春燕
张桐
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20812620.1A priority Critical patent/EP3979200A4/en
Priority to JP2021537733A priority patent/JP7236545B2/ja
Publication of WO2020238560A1 publication Critical patent/WO2020238560A1/zh
Priority to US17/461,978 priority patent/US20210398294A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/174Segmentation; Edge detection involving the use of two or more images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20076Probabilistic image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20092Interactive image processing based on input by user
    • G06T2207/20104Interactive definition of region of interest [ROI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20156Automatic seed setting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20164Salient point detection; Corner detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30232Surveillance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the embodiments of the application relate to the field of image recognition technology, and in particular to a video target tracking method, device, computer equipment, and storage medium.
  • Video target tracking technology refers to tracking the target object of interest in the video, and identifying the target object from each image frame of the video.
  • a video target tracking method based on semi-supervised learning is provided. First, train an image segmentation model through some training samples. Then, using the first image frame of the video to be detected, the parameters of the image segmentation model are adjusted so that the image segmentation model is suitable for the extraction of the target object in the video to be detected. Among them, the position of the target object in the first image frame can be manually marked. After that, the adjusted image segmentation model is used to identify the target object from the subsequent image frames of the video to be detected.
  • the adjusted image segmentation model cannot accurately identify the target object from the subsequent image frames. In most cases, as the apparent information changes, the prediction results of the model will become very inaccurate.
  • a video target tracking method, device, computer equipment, and storage medium are provided.
  • a video target tracking method executed by a computer device, the method including:
  • the partial detection map Acquiring a partial detection map corresponding to the target image frame in the video to be detected, the partial detection map being generated based on the apparent information of the target object that needs to be tracked by the image segmentation model in the video to be detected;
  • Extracting the target object in the target image frame through the adjusted image segmentation model Extracting the target object in the target image frame through the adjusted image segmentation model.
  • a video target tracking device comprising:
  • the detection image acquisition module is configured to obtain a partial detection image corresponding to the target image frame in the video to be detected, the partial detection image being generated based on the apparent information of the target object in the video to be detected that needs to be tracked by the image segmentation model;
  • a motion map acquisition module configured to acquire a relative motion saliency map corresponding to the target image frame, where the relative motion saliency map is generated based on the motion information of the target object;
  • the constraint information acquisition module is configured to determine the constraint information corresponding to the target image frame according to the local detection map and the relative motion saliency map, and the constraint information includes absolute positive sample pixels and absolute pixels in the target image frame. Negative sample pixels and uncertain sample pixels;
  • a model adjustment module configured to adjust the parameters of the image segmentation model through the constraint information to obtain an adjusted image segmentation model
  • the target segmentation module is used to extract the target object in the target image frame through the adjusted image segmentation model.
  • a computer device includes a processor and a memory, the memory stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code
  • the set or instruction set is loaded and executed by the processor to realize the video target tracking method described above.
  • a computer program product when the computer program product is executed, it is used to execute the above video target tracking method.
  • Figure 1a exemplarily shows a schematic diagram of an application environment for video target tracking
  • Figure 1b exemplarily shows a schematic diagram of video target tracking
  • FIG. 2 is a flowchart of a video target tracking method provided by an embodiment of the present application
  • Figure 3 exemplarily shows a schematic diagram of the overall process of the technical solution of the present application
  • FIG. 4 exemplarily shows a schematic diagram of the parameter adjustment process of the target detection model
  • Figure 5 exemplarily shows an architecture diagram of an image segmentation model
  • Figure 6 exemplarily shows a schematic diagram of samples extracted by the traditional method and the method of the present application
  • Fig. 7 is a block diagram of a video target tracking device provided by an embodiment of the present application.
  • FIG. 8 is a block diagram of a video target tracking device provided by another embodiment of the present application.
  • Fig. 9 is a structural block diagram of a computer device provided by an embodiment of the present application.
  • the video target tracking method provided in this application can be applied to the application environment as shown in FIG. 1a.
  • the computer device 102 and the video capture device 104 communicate through a network, as shown in FIG. 1a.
  • the computer device 102 can obtain the video to be detected from the video capture device 104, and obtain the partial detection image corresponding to the target image frame in the video to be detected.
  • the partial detection image is based on the appearance of the target object that needs to be tracked by the image segmentation model in the video to be detected.
  • Information generation, the image segmentation model is a neural network model used to segment and extract the target object from the image frame of the video to be detected; obtain the relative motion saliency map corresponding to the target image frame, which is based on the motion information of the target object Generated; According to the local detection map and the relative motion saliency map, determine the constraint information corresponding to the target image frame.
  • the constraint information includes the absolute positive sample pixels, absolute negative sample pixels and uncertain sample pixels in the target image frame;
  • the parameters of the segmentation model are adjusted to obtain the adjusted image segmentation model; through the adjusted image segmentation model, the target object in the target image frame is extracted.
  • the computer device 102 may be implemented by an independent server or a server cluster composed of multiple servers.
  • the video capture device 104 may include a surveillance camera or a terminal with a camera.
  • Video target tracking technology can be used in many different application scenarios. For example, in security scenarios, suspects in surveillance videos can be tracked and identified. For another example, in an application scenario of video analysis processing, image frames containing a specific character in a movie or TV series can be extracted, so as to integrate a video segment of the specific character.
  • Figure 1b it exemplarily shows a schematic diagram of video target tracking.
  • Figure 1b contains multiple image frames of the video, labeled 11, 12, 13, and 14. If you want to track the characters and vehicles in each image frame of the video, you can train an image segmentation model, input each image frame into the image segmentation model, and extract the characters and vehicles from the image segmentation model. .
  • the person and the vehicle can be labeled with masks respectively, so that the person and the vehicle are marked in the image frame.
  • the execution subject of each step is a computer device.
  • Computer equipment can be any electronic equipment with computing, processing and storage capabilities.
  • the computer device can be a PC (Personal Computer, personal computer) or a server, it can also be a terminal device such as a mobile phone, a tablet computer, a multimedia player, a wearable device, a smart TV, etc., or a drone, a vehicle-mounted terminal, etc.
  • Other devices are not limited in this embodiment of the application.
  • FIG. 2 shows a flowchart of a video target tracking method provided by an embodiment of the present application.
  • the method can include the following steps (201 ⁇ 205):
  • Step 201 Obtain a partial detection image corresponding to a target image frame in the video to be detected.
  • an image frame can be given, the mask of the target object is marked in the image frame, and then the image segmentation model is used to segment and extract the other image frames of the video to be detected Out the target object.
  • the target object may be a person or an object, which is not limited in the embodiment of the present application.
  • the mask of the target object is marked in the first image frame of the video to be detected, and then the target object is segmented and extracted from subsequent image frames of the video to be detected by using an image segmentation model.
  • marking the mask of the target object in the given image frame (such as the first image frame) can be done manually.
  • the target image frame may be any image frame in the video to be detected without the target object marked, that is, the image frame of the target object needs to be extracted from the image segmentation model.
  • the local detection map is generated based on the apparent information of the target object to be tracked.
  • apparent information refers to information that can be distinguished visually, such as color, shape, texture and other information.
  • the target image frame is processed by the target detection model to obtain the local detection map corresponding to the target image frame.
  • the target detection model may be a model obtained by training a convolutional neural network.
  • the size of the local detection map is the same as the size of the target image frame. For example, if the size of the target image frame is 800*600 pixels, the size of the local inspection map is also 800*600 pixels.
  • the value of the target pixel in the local detection image reflects the probability that the target pixel at the same position in the target image frame belongs to the target object, and the probability is determined based on the performance information of the target pixel.
  • the target object in the video to be detected is tracked and identified through the image segmentation model.
  • the image segmentation model is a neural network model used to segment and extract the target object from the image frame of the video to be detected.
  • the image segmentation model may be a deep learning model constructed based on a convolutional neural network.
  • it is necessary to perform online adaptive training on the image segmentation model, adjust the parameters of the model (such as the weight of the neural network), and then Use the adjusted image segmentation model to segment the target object.
  • this step may include the following sub-steps:
  • the training samples are used to train the target detection model to adjust and optimize the parameters of the target detection model.
  • the training sample includes a labeled image frame and a detection target frame corresponding to the labeled image frame.
  • Annotated image frame refers to an image frame that has annotated the mask of the target object.
  • the annotated image frame may include the image frame in which the mask of the target object is manually annotated as described above, or may include the image frame in which the mask of the target object is annotated by the image segmentation model.
  • a training sample includes a labeled image frame and a detection target frame corresponding to this labeled image frame. Therefore, multiple training samples can be selected from a labeled image frame.
  • the detection target frame refers to an image area where the proportion of the target object is greater than a preset threshold.
  • a frame is added to a marked image frame. In the image area in this frame, part of the image area may belong to the target object, and some part does not belong to the target object.
  • the preset threshold may be set in advance according to actual requirements. For example, the preset threshold is 0.5.
  • the frame described above may be rectangular or other shapes, which is not limited in the embodiment of the present application.
  • the training samples are selected in the following way: randomly scatter a frame in the labeled image frame, calculate the proportion of the target object in the frame, and if the proportion of the target object in the frame is greater than the preset threshold, then The frame is determined as the detection target frame corresponding to the labeled image frame, and the labeled image frame and the detection target frame are selected as training samples.
  • the Faster-RCNN network is selected as the framework of the target detection model.
  • the parameters (such as network weights) of the target detection model are fine-tuned through the training samples selected above to obtain the adjusted target detection model.
  • the batch size can be 1, fine-tuning 600 rounds, and the frame size, aspect ratio, etc. can be adjusted during the training process , In order to finally train a higher precision target detection model.
  • the target image frame is processed through the adjusted target detection model to obtain a partial detection map.
  • the mask of the target object in the first image frame of the video to be detected is manually annotated.
  • the target object is segmented and extracted sequentially. If you need to obtain the local detection image corresponding to the i-th image frame in the video to be detected, and i is an integer greater than 1, you can select at least one training sample from the first image frame and the i-1th image frame, and pass The training sample adjusts the parameters of the current target detection model to obtain the adjusted target detection model, and then uses the adjusted target detection model to process the i-th image frame to obtain the i-th image frame Local inspection map.
  • Step 202 Obtain a relative motion saliency map corresponding to the target image frame.
  • the relative motion saliency map is generated based on the motion information of the target object.
  • the position of the target object in each image frame of the video to be detected may not be static, and it may move.
  • the motion information reflects the motion of the target object, that is, the position change in different image frames.
  • the relative motion saliency map is determined by detecting the optical flow between adjacent image frames, and the optical flow reflects the motion information of the target object.
  • optical flow refers to the movement of pixels in a video image over time.
  • the relative motion saliency map has the same size as the target image frame. For example, if the size of the target image frame is 800*600 pixels, the size of the relative motion saliency map is also 800*600 pixels.
  • the value of the target pixel in the relative motion saliency map reflects the probability that the target pixel at the same position in the target image frame belongs to the target object, and the probability is determined based on the motion information of the target pixel.
  • this step may include the following sub-steps:
  • the adjacent image frame refers to the image frame adjacent to the target image frame in the video to be detected.
  • the number of adjacent image frames may be one or multiple, which is not limited in the embodiment of the present application.
  • the adjacent image frame may include the previous image frame, the subsequent image frame, or the previous image frame and the subsequent image frame at the same time.
  • the previous image frame refers to the image frame located before the target image frame in the video to be detected
  • the subsequent image frame refers to the image frame located after the target image frame in the video to be detected.
  • the previous image frame is the previous image frame of the target image frame
  • the subsequent image frame is the next image frame of the target image frame.
  • the target image frame is the i-th image frame
  • the previous image frame is the i-1th image frame
  • the subsequent image frame is the i+1th image frame
  • i is an integer greater than 1.
  • FlowNet2 is used as the basic model for calculating the optical flow between the target image frame and the adjacent image frame.
  • FlowNet2 is a model that uses CNN (Convolutional Neural Networks, convolutional neural network) to extract optical flow, which has the advantages of fast speed and high accuracy.
  • CNN Convolutional Neural Networks, convolutional neural network
  • a relative motion saliency map corresponding to the target image frame is generated according to the optical flow.
  • the relative motion saliency map is generated as follows:
  • the background area in the local detection image refers to the remaining area outside the area where the target object detected in the local detection image is located.
  • the area where the target object is located and the background area can be determined.
  • the average value of the optical flow of each pixel in the background area is taken as the background optical flow.
  • the difference between the optical flow of each pixel and the background optical flow is calculated by RMS (Root Mean Square) to obtain the relative motion saliency map corresponding to the target image frame.
  • RMS Root Mean Square
  • the second norm of absolute optical flow can be increased, and the ratio of the two parts is 1:1, that is, the following formula is used to calculate the relative motion saliency image pixel (m, n)
  • the value of RMS m,n is the value of RMS m,n :
  • O m, n is the optical flow of the pixel (m, n), and ⁇ is the background optical flow.
  • Step 203 Determine the constraint information corresponding to the target image frame according to the local detection map and the relative motion saliency map.
  • the constraint information includes absolute positive sample pixels, absolute negative sample pixels and uncertain sample pixels in the target image frame.
  • the absolute positive sample pixels refer to the pixels in the target image frame that are determined to belong to the target object based on the above-mentioned appearance information and motion information.
  • absolutely negative sample pixels refer to pixels in the target image frame that are determined not to belong to the target object based on the above-mentioned appearance information and motion information.
  • Uncertain sample pixels refer to pixels in the target image frame that cannot be determined whether they belong to the target object based on the above-mentioned appearance information and motion information.
  • the restriction information may also be referred to as a restriction flow.
  • the target pixel in the target image frame if the value of the target pixel in the local detection image meets the first preset condition, and the value of the target pixel in the relative motion saliency image meets the second preset condition, it is determined
  • the target pixel is an absolute positive sample pixel; if the value of the target pixel in the local detection image does not meet the first preset condition, and the value of the target pixel in the relative motion saliency image does not meet the second preset condition, the target pixel is determined to be absolutely negative sample pixel; if the value of the target pixel in the local detection image meets the first preset condition, and the value of the target pixel in the relative motion saliency image does not meet the second preset condition, or the target pixel is in the local detection image If the value does not meet the first preset condition, and the value of the target pixel in the relative motion saliency map meets the second preset condition, it is determined that the target pixel is an uncertain sample pixel.
  • the first preset condition is greater than the first threshold
  • the second preset condition is greater than the second threshold.
  • the first threshold is 0.7
  • the second threshold is 0.5.
  • the first threshold and the second threshold may be preset according to actual conditions, and the foregoing is only an example.
  • Step 204 Adjust the parameters of the image segmentation model through the constraint information to obtain the adjusted image segmentation model.
  • the constraint information can be used to adaptively learn the image segmentation model, fine-tune its parameters, and improve its accuracy when segmenting and extracting the target object from the target image frame.
  • absolute positive sample pixels and absolute negative sample pixels are used to adjust the parameters of the image segmentation model to obtain an adjusted image segmentation model. That is, when adjusting the parameters of the image segmentation model, only absolute positive sample pixels and absolute negative sample pixels are used, and uncertain sample pixels are not considered.
  • the loss function of the image segmentation model can adopt a cross-entropy loss function, the expression of which is:
  • L represents the value of the loss function
  • x is the target image frame
  • Y is the pixel-level constraint information of the target image frame x
  • Y + and Y - are absolute positive sample pixels and absolute negative sample pixels
  • P( ⁇ ) is The prediction result of the target image frame x by the image segmentation model.
  • the difference between the expression of the loss function and the expression of the traditional loss function is that the expression of the loss function does not calculate the loss of uncertain sample pixels, so that the overwhelming area can be ignored and the confidence area can be better learned.
  • Step 205 Extract the target object in the target image frame through the adjusted image segmentation model.
  • the target image frame is input to the adjusted image segmentation model, and the target object in the target image frame is extracted by segmentation.
  • the image segmentation model can be trained for adaptive adjustment once for each image frame, or can be trained for adaptive adjustment once every several image frames (such as 5 image frames). Considering the small changes in the position of the target object in the adjacent image frames, the image segmentation model performs adaptive adjustment training every several image frames, which can reduce the amount of calculation and improve the accuracy of the model without losing as much as possible. The processing efficiency of the entire video.
  • each adaptive adjustment training can be trained for one round or multiple rounds (such as 3 rounds), which is not limited in the embodiment of the present application.
  • the parameters of the image segmentation model are adjusted through constraint information. Since the constraint information integrates the apparent information and motion information of the target object, on the one hand, it can overcome the expression of the target object in the video to be detected in different image frames. On the other hand, it can reduce the error propagation in the adaptive learning process. At the same time, through the complementation of the two parts, it can generate more accurate guidance for each model parameter update, thereby better constraining the model parameters. Adjustment process.
  • FIG. 3 it exemplarily shows a schematic diagram of the overall flow of the technical solution of the present application.
  • the detection target frame corresponding to the target image frame is extracted through the target detection model, and the partial detection map is further obtained, and the optical flow corresponding to the target image frame is extracted through the optical flow model, and further
  • the relative motion saliency map corresponding to the target image frame is calculated, and the local detection map and the relative motion saliency map are merged to obtain constraint information.
  • the parameters of the image segmentation model are adjusted to obtain the adjusted image segmentation model.
  • the target object in the target image frame is extracted.
  • the image segmentation model can include a feature extractor, a spatial hole convolution module, a deconvolution and upsampling module and other components.
  • FIG. 4 it exemplarily shows a schematic diagram of the parameter adjustment process of the target detection model. Randomly select a frame in the labeled image frame, calculate the proportion of the target object in the frame, and select the training sample of the target detection model based on the proportion. The parameters of the target detection model are fine-tuned through the training samples to obtain the adjusted target detection model. After that, the target image frame is input to the adjusted target detection model, and a partial detection map corresponding to the target image frame is obtained.
  • the parameters of the image segmentation model are adjusted through constraint information. Since the constraint information is obtained by combining the apparent information and motion information of the target object, it can be Overcome the problem of the large apparent difference of the target object in the video to be detected in different image frames. On the other hand, it can reduce the error propagation in the adaptive learning process. At the same time, through the complementarity of the two parts, a more accurate model can be generated every time.
  • the parameter update guidance can better constrain the adjustment process of the model parameters, so that the performance of the image segmentation model after the parameter adjustment is better, and the accuracy of the target object extracted from the target image frame is finally higher.
  • the motion information can be characterized more accurately.
  • the pre-training process of the image segmentation model is as follows:
  • the initial image segmentation model can be an end-to-end trainable convolutional neural network.
  • the input is an image and the output is the mask of the target in the image.
  • Deeplab V3+ is selected as an end-to-end trainable convolutional neural network. After the network obtains the input three-channel picture information, it can return a prediction mask of the same size.
  • FIG. 5 it exemplarily shows an architecture diagram of an image segmentation model.
  • ResNet convolutional neural network As the basic feature extractor, adds an ASPP (Atrous Spatial Pyramid Pooling, Atrous spatial pyramid pooling) module after the fifth layer of the ResNet model, and uses different scales of Atrous Convolution to process the output Features, fusion of the features extracted by the third layer ResNet model, which can better restore the segmentation prediction results at various scales, and then return the features learned by the network to high resolution through deconvolution or upsampling, which can effectively improve The accuracy of the image segmentation model.
  • the network Corresponding to each frame in the video, the network will output a corresponding scale response map, which is the probability prediction result of segmentation.
  • the ResNet 101 network is selected as the basic network of the Deeplab V3+ feature extractor. After the basic convolutional neural network, connect the ASPP module, and introduce the features extracted by the third-layer ResNet model at the same time, add the deconvolution process, and two deconvolution up-sampling modules to obtain high-resolution segmentation result prediction maps .
  • the first sample set contains at least one labeled picture
  • the second sample set contains at least one labeled video.
  • the Pascal VOC database is selected as the first sample set, and the Pascal VOC database has 2913 pixel-level image segmentation data. By learning the semantic segmentation of images, the image segmentation model can be better trained. Initial training can use a batch size of 4, training 8000 rounds.
  • the DAVIS16 database is selected as the second sample set to adapt the image segmentation model to the target segmentation task.
  • the DAVIS16 database has 50 pixel-level annotated videos with a total of 3455 frames, 30 of which are used for training and 20 are used for testing.
  • data expansion can be performed on the sample, for example, the original image can be expanded to multiple different scales, such as scaling the size of the original image by 0.8 times, 1.2 times, and 1.6 times, thereby Make the image segmentation model adapt to images of different scales.
  • select the initial learning rate to be 0.001, and set 4 samples for each batch of learning, and drop to 1/10 of the original learning rate every 2400 rounds, a total of 6000 rounds of training, and finally get the pre-trained image segmentation model .
  • the pre-training process of the above-mentioned image segmentation model can be executed in the computer device that executes the video target tracking method introduced above, or in other devices other than the computer device, and then other devices will
  • the pre-trained image segmentation model is provided to a computer device, and the computer device uses the pre-trained image segmentation model to execute the video target tracking method.
  • the pre-training process of the image segmentation model is performed on computer equipment or other equipment, when the computer equipment performs video target tracking on the video to be detected, it needs to adapt the parameters of the pre-trained image segmentation model using the video to be detected Learning and adjustment enable the image segmentation model to output accurate segmentation results for each frame.
  • each frame performs an adaptive training process on the image segmentation model to learn and adjust the model parameters.
  • the adjustment is based on the prediction result of the previous frame. For example, by using the erosion algorithm to generate absolute positive sample pixels for the prediction result of the previous frame, and then set the pixels outside a certain Euclidean distance of the absolute positive sample as absolute negative sample pixels, and guide the adjustment of model parameters through such constraints.
  • the adjusted image segmentation model is used to predict the segmentation result of the target image frame to be detected.
  • the traditional method will rely more on the accuracy of the previous frame, and is more rough, and it is difficult to obtain detailed information.
  • the method provided by the embodiment of the present application can better consider the motion information and table. Observe the information, so as to supervise the adaptive learning process, in addition, it can better maintain local details.
  • the absolute positive sample pixels and the absolute negative sample pixels marked in the adaptive learning process are more accurate and reliable, and the number of uncertain sample pixels is smaller.
  • FIG. 6 it exemplarily shows a schematic diagram of absolute positive sample pixels, absolute negative sample pixels, and uncertain sample pixels marked in the adaptive learning process using the method provided by the embodiment of the application.
  • the pixels in the area 61 are absolute positive sample pixels
  • the pixels in the black area 62 are absolute negative sample pixels
  • the pixels in the gray area 63 are uncertain sample pixels. It can be seen from Fig. 6 that the proportion of uncertain sample pixels is very small and has more accurate and reliable edges.
  • the constraint information obtained by the method provided in the embodiment of this application not only has a high correct rate of positive and negative samples, but also has a small proportion of uncertain samples, so it can explain the method provided by the embodiment of this application Effectiveness.
  • the results obtained by the method provided in the embodiments of the present application are more prominent.
  • the method provided in the embodiments of the present application can obtain very accurate results.
  • the method provided by the embodiments of the application can significantly improve the accuracy of video target segmentation, and better consider the fusion of the motion information and appearance information of the target object, and special cases such as occlusion, large appearance changes, and background clutter in the video target segmentation , It can effectively constrain the adaptive learning process of the model, and through the introduction of the optimized loss function to constrain the learning process of the model, the accuracy of target segmentation in the video can be improved.
  • steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or in detail with other steps or at least a part of the sub-steps or stages of other steps.
  • FIG. 7 shows a block diagram of a video target tracking device provided by an embodiment of the present application.
  • the device has the function of realizing the above method example, and the function can be realized by hardware, or by hardware executing corresponding software.
  • the device can be a computer device, or it can be set on a computer device.
  • the apparatus 700 may include: a detection image acquisition module 710, a motion image acquisition module 720, a constraint information acquisition module 730, a model adjustment module 740, and a target segmentation module 750.
  • the detection image acquisition module 710 is configured to obtain a partial detection image corresponding to the target image frame in the video to be detected, and the partial detection image is generated based on the apparent information of the target object in the video to be detected that needs to be tracked by the image segmentation model
  • the image segmentation model is a neural network model used to segment and extract the target object from the image frame of the video to be detected.
  • the motion map acquisition module 720 is configured to acquire a relative motion saliency map corresponding to the target image frame, where the relative motion saliency map is generated based on the motion information of the target object.
  • the constraint information acquisition module 730 is configured to determine constraint information corresponding to the target image frame according to the local detection map and the relative motion saliency map, and the constraint information includes absolute positive sample pixels in the target image frame, absolutely negative sample pixels and uncertain sample pixels.
  • the model adjustment module 740 is configured to adjust the parameters of the image segmentation model through the constraint information to obtain an adjusted image segmentation model.
  • the target segmentation module 750 is configured to extract the target object in the target image frame through the adjusted image segmentation model.
  • the parameters of the image segmentation model are adjusted through constraint information. Since the constraint information is obtained by combining the apparent information and motion information of the target object, it can be Overcome the problem of the large apparent difference of the target object in the video to be detected in different image frames. On the other hand, it can reduce the error propagation in the adaptive learning process. At the same time, through the complementarity of the two parts, a more accurate model can be generated every time.
  • the parameter update guidance can better constrain the adjustment process of the model parameters, so that the performance of the image segmentation model after the parameter adjustment is better, and the accuracy of the target object extracted from the target image frame is finally higher.
  • the detection image acquisition module 710 includes: a sample selection submodule 711, a model adjustment submodule 712 and a detection image acquisition submodule 713.
  • the sample selection submodule 711 is configured to select at least one training sample from the labeled image frame of the video to be detected, the training sample including the labeled image frame and the detection target frame corresponding to the labeled image frame ,
  • the detection target frame refers to an image area whose proportion of the target object is greater than a preset threshold.
  • the model adjustment sub-module 712 is configured to adjust the parameters of the target detection model through the training samples to obtain an adjusted target detection model.
  • the detection image acquisition sub-module 713 is configured to process the target image frame through the adjusted target detection model to obtain the partial detection image.
  • sample selection submodule 711 is configured to:
  • the frame is determined as the detection target frame corresponding to the marked image frame, and the marked image frame and the The detection target frame is selected as the training sample.
  • the motion picture acquisition module 720 includes: an optical flow calculation sub-module 721 and a motion picture acquisition sub-module 722.
  • the optical flow calculation sub-module 721 is configured to calculate the optical flow between the target image frame and adjacent image frames.
  • the motion map acquisition sub-module 722 is configured to generate the relative motion saliency map according to the optical flow.
  • the motion map acquisition sub-module 722 is configured to:
  • the background area in the local detection image refers to the area outside the area where the target object is detected in the local detection image Remaining area of
  • the relative motion saliency map is generated according to the background optical flow and the optical flow corresponding to the target image frame.
  • the restriction information obtaining module 730 is configured to:
  • the target pixel in the target image frame when the value of the target pixel in the local detection image meets the first preset condition, and the value of the target pixel in the relative motion saliency image meets the second Under preset conditions, determining that the target pixel is the absolute positive sample pixel;
  • the target pixel in the local detection image meets the first preset condition, and the value of the target pixel in the relative motion saliency image does not meet the second preset condition, or When the value of the target pixel in the local detection image does not meet the first preset condition, and the value of the target pixel in the relative motion saliency image meets the second preset condition, it is determined that the The target pixel is the uncertain sample pixel.
  • the model adjustment module 740 is configured to use the absolutely positive sample pixels and the absolute negative sample pixels to retrain the image segmentation model to obtain the adjusted image segmentation model .
  • the pre-training process of the image segmentation model is as follows:
  • the second sample set is used to retrain the image segmentation model after preliminary training to obtain a pre-trained image segmentation model; wherein the second sample set contains at least one tagged video.
  • the device provided in the above embodiment when implementing its functions, only uses the division of the above functional modules for illustration. In practical applications, the above functions can be allocated by different functional modules as needed, namely The internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the apparatus and method embodiments provided by the above-mentioned embodiments belong to the same concept, and the specific implementation process is detailed in the method embodiments, which will not be repeated here.
  • FIG. 9 shows a structural block diagram of a computer device 900 according to an embodiment of the present application.
  • the computer device 900 may be a mobile phone, a tablet computer, an e-book reading device, a wearable device, a smart TV, a multimedia playback device, a PC, a server, and the like.
  • the terminal 900 includes a processor 901 and a memory 902.
  • the processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 901 can be implemented in at least one hardware form among DSP (Digital Signal Processing), FPGA (Field Programmable Gate Array), PLA (Programmable Logic Array, Programmable Logic Array) .
  • the processor 901 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the awake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state.
  • the processor 901 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing content that needs to be displayed on the display screen.
  • the processor 901 may further include an AI (Artificial Intelligence) processor, and the AI processor is used to process calculation operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 902 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 902 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 902 is used to store a computer program, and the computer program is used to be executed by the processor 901 to implement the interface switching method provided in the method embodiment of the present application.
  • the terminal 900 may optionally further include: a peripheral device interface 903 and at least one peripheral device.
  • the processor 901, the memory 902, and the peripheral device interface 903 may be connected by a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 903 through a bus, a signal line, or a circuit board.
  • the peripheral device may include: at least one of a radio frequency circuit 904, a display screen 905, a camera 906, an audio circuit 907, a positioning component 908, and a power supply 909.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM
  • FIG. 9 does not constitute a limitation to the terminal 900, and may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
  • a computer device in the illustrated embodiment, includes a processor and a memory, and at least one instruction, at least a section of program, code set, or instruction set is stored in the memory.
  • the at least one instruction, at least one program, code set, or instruction set is configured to be executed by one or more processors to implement the video target tracking method described above.
  • a computer-readable storage medium stores at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program ,
  • the code set or the instruction set implements the above video target tracking method when executed by the processor of the computer device.
  • the aforementioned computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • a computer program product is also provided.
  • the computer program product When executed, it is used to implement the aforementioned video target tracking method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

一种视频目标跟踪方法、装置、计算机设备及存储介质。所述方法包括:获取待检测视频中的目标图像帧对应的局部检测图(201);获取目标图像帧对应的相对运动显著图(202);根据局部检测图和相对运动显著图,确定目标图像帧对应的约束信息(203);通过约束信息对图像分割模型的参数进行调整,得到调整后的图像分割模型(204);通过调整后的图像分割模型,提取目标图像帧中的目标对象(205)。

Description

视频目标跟踪方法、装置、计算机设备及存储介质
本申请要求于2019年05月27日提交中国专利局,申请号为2019104473793,发明名称为“视频目标跟踪方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及图像识别技术领域,特别涉及一种视频目标跟踪方法、装置、计算机设备及存储介质。
背景技术
视频目标跟踪技术是指对视频中感兴趣的目标对象进行跟踪,从视频的各个图像帧中识别出该目标对象。
在相关技术中,提供了一种基于半监督学习的视频目标跟踪方法。首先,通过一些训练样本训练出一个图像分割模型。然后,运用待检测视频的第一个图像帧,对该图像分割模型的参数进行调整,使得该图像分割模型适应于该待检测视频中的目标对象的提取。其中,目标对象在第一个图像帧中的位置可以由人工标注。之后,利用调整后的图像分割模型,从该待检测视频的后续图像帧中识别出目标对象。
当待检测视频的第一个图像帧与后续图像帧之间的表观差异比较大时,通过调整后的图像分割模型,无法准确地从后续图像帧中识别出目标对象。在大多数情况下,随着表观信息的变化,模型的预测结果会变得很不准确。
发明内容
根据本申请的各种实施例,提供了一种视频目标跟踪方法、装置、计算机设备及存储介质。
一种视频目标跟踪方法,由计算机设备执行,所述方法包括:
获取待检测视频中的目标图像帧对应的局部检测图,所述局部检测图是基于所述待检测视频中需要通过图像分割模型跟踪的目标对象的表观信息生成;
获取所述目标图像帧对应的相对运动显著图,所述相对运动显著图是基于 所述目标对象的运动信息生成的;
根据所述局部检测图和所述相对运动显著图,确定所述目标图像帧对应的约束信息,所述约束信息包括所述目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素;
通过所述约束信息对所述图像分割模型的参数进行调整,得到调整后的图像分割模型;
通过所述调整后的图像分割模型,提取所述目标图像帧中的所述目标对象。
一种视频目标跟踪装置,所述装置包括:
检测图获取模块,用于获取待检测视频中的目标图像帧对应的局部检测图,所述局部检测图是基于所述待检测视频中需要通过图像分割模型跟踪的目标对象的表观信息生成;
运动图获取模块,用于获取所述目标图像帧对应的相对运动显著图,所述相对运动显著图是基于所述目标对象的运动信息生成的;
约束信息获取模块,用于根据所述局部检测图和所述相对运动显著图,确定所述目标图像帧对应的约束信息,所述约束信息包括所述目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素;
模型调整模块,用于通过所述约束信息对所述图像分割模型的参数进行调整,得到调整后的图像分割模型;
目标分割模块,用于通过所述调整后的图像分割模型,提取所述目标图像帧中的所述目标对象。
一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述视频目标跟踪方法。
一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述视频目标跟踪方法。
一种计算机程序产品,当该计算机程序产品被执行时,其用于执行上述视频目标跟踪方法。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的 其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a示例性示出了一种视频目标跟踪的应用环境示意图;
图1b示例性示出了一种视频目标跟踪的示意图;
图2是本申请一个实施例提供的视频目标跟踪方法的流程图;
图3示例性示出了本申请技术方案的整体流程的示意图;
图4示例性示出了目标检测模型的参数调整过程的示意图;
图5示例性示出了一种图像分割模型的架构图;
图6示例性示出了传统方法和本申请方法提取的样本的示意图;
图7是本申请一个实施例提供的视频目标跟踪装置的框图;
图8是本申请另一个实施例提供的视频目标跟踪装置的框图;
图9是本申请一个实施例提供的计算机设备的结构框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的视频目标跟踪方法,可以应用于如图1a所示的应用环境中。其中,计算机设备102和视频采集设备104之间通过网络进行通信,如图1a所示。
计算机设备102可以从视频采集设备104获取待检测视频,获取待检测视频中的目标图像帧对应的局部检测图,局部检测图是基于待检测视频中需要通过图像分割模型跟踪的目标对象的表观信息生成的,图像分割模型是用于从待检测视频的图像帧中分割提取出目标对象的神经网络模型;获取目标图像帧对应的相对运动显著图,相对运动显著图是基于目标对象的运动信息生成的;根 据局部检测图和相对运动显著图,确定目标图像帧对应的约束信息,约束信息包括目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素;通过约束信息对图像分割模型的参数进行调整,得到调整后的图像分割模型;通过调整后的图像分割模型,提取目标图像帧中的目标对象。
其中,计算机设备102可以是用独立的服务器或者是多个服务器组成的服务器集群来实现。视频采集设备104可以包括监控摄像头或具有摄像头的终端。
视频目标跟踪技术可以在多种不同应用场景中得到运用。例如,在安防场景下,可以对监控视频中的嫌疑人进行跟踪识别。又例如,在视频分析处理的应用场景下,可以对电影或电视剧中包含特定人物的图像帧进行提取,从而整合出该特定人物的视频片段。
如图1b所示,其示例性示出了一种视频目标跟踪的示意图。图1b中包含了视频的多个图像帧,分别标记为11、12、13和14。如果要对该视频的各个图像帧中的人物和车辆进行跟踪,则可以训练出一个图像分割模型,将各个图像帧分别输入至该图像分割模型,由该图像分割模型从中分割提取出人物和车辆。例如,可以为人物和车辆分别打上掩膜标签,从而在图像帧中标注出人物和车辆。
本申请实施例提供的方法,各步骤的执行主体为计算机设备。计算机设备可以是任何具备计算、处理和存储能力的电子设备。例如,计算机设备可以是PC(Personal Computer,个人计算机)或服务器,也可以是诸如手机、平板电脑、多媒体播放设备、可穿戴设备、智能电视等终端设备,还可以是无人机、车载终端等其它设备,本申请实施例对此不作限定。
为了便于描述,在下述方法实施例中,仅以各步骤的执行主体为计算机设备进行说明,但对此不构成限定。
请参考图2,其示出了本申请一个实施例提供的视频目标跟踪方法的流程图。该方法可以包括以下几个步骤(201~205):
步骤201,获取待检测视频中的目标图像帧对应的局部检测图。
当需要对待检测视频中的目标对象进行跟踪时,可以给定一个图像帧,在该图像帧中标注出目标对象的掩膜,后续通过图像分割模型从该待检测视频的 其它图像帧中分割提取出该目标对象。目标对象可以是人,也可以是物,本申请实施例对此不作限定。可选地,在待检测视频的第一个图像帧中标注出目标对象的掩膜,然后通过图像分割模型从该待检测视频的后续图像帧中分割提取出该目标对象。另外,在上述给定的图像帧(如第一个图像帧)中标注出目标对象的掩膜,可以由人工标注完成。
目标图像帧可以是待检测视频中的任意一个未标注出目标对象的图像帧,也即需要通过图像分割模型从中提取出目标对象的图像帧。
局部检测图是基于需要跟踪的目标对象的表观信息生成的。其中,表观信息是指从视觉上能够分辨的信息,如颜色、形状、纹理等信息。在示例性实施例中,通过目标检测模型对目标图像帧进行处理,得到该目标图像帧对应的局部检测图。目标检测模型可以是对卷积神经网络进行训练得到的模型。局部检测图的尺寸与目标图像帧的尺寸相同。例如,目标图像帧的尺寸为800*600像素,则局部检测图的尺寸也为800*600像素。可选地,局部检测图中目标像素的值,反映了目标图像帧中该相同位置处的目标像素属于目标对象的概率,且该概率是基于目标像素的表现信息确定的。
在本申请实施例中,通过图像分割模型对待检测视频中的目标对象进行跟踪识别。图像分割模型是用于从待检测视频的图像帧中分割提取出目标对象的神经网络模型,该图像分割模型可以是基于卷积神经网络构建的深度学习模型。在本申请实施例中,为了确保图像分割模型在目标对象跟踪时的分割准确度,需要对该图像分割模型进行在线自适应训练,对该模型的参数(如神经网络的权重)进行调整,再利用调整后的图像分割模型进行目标对象分割。
在示例性实施例中,本步骤可以包括如下几个子步骤:
1、从待检测视频的已标注图像帧中,选取至少一个训练样本;
训练样本用于对目标检测模型进行训练,以对该目标检测模型的参数进行调整优化。训练样本包括已标注图像帧以及该已标注图像帧对应的检测目标框。已标注图像帧是指已经标注出目标对象的掩膜的图像帧。已标注图像帧可以包括上文介绍的由人工标注目标对象的掩膜的图像帧,也可以包括由图像分割模型标注出目标对象的掩膜的图像帧。
对于任意一个已标注图像帧来说,其对应的检测目标框可以有多个。一个训练样本包括一个已标注图像帧,以及这个已标注图像帧对应的一个检测目标 框。因此,从一个已标注图像帧中,可以选取得到多个训练样本。检测目标框是指目标对象的占比大于预设阈值的图像区域。假设在某一个已标注图像帧中添加一个框,这个框内的图像区域中,可能有一部分属于目标对象,有一部分不属于目标对象,通过计算属于目标对象的部分在这个框中的像素占比,如果像素占比大于预设阈值,则将这个框确定为检测目标框;否则,这个框就不被确定为检测目标框。该预设阈值可以根据实际需求预先进行设定,示例性地,该预设阈值为0.5。另外,上文所述的框可以是矩形,也可以是其它形状,本申请实施例对此不作限定。
在示例性实施例中,通过如下方式选取训练样本:在已标注图像帧中随机撒框,计算目标对象在框中的占比,若目标对象在框中的占比大于预设阈值,则将该框确定为已标注图像帧对应的检测目标框,并将该已标注图像帧和该检测目标框选取为训练样本。
2、通过训练样本对目标检测模型的参数进行调整,得到调整后的目标检测模型;
可选地,选用Faster-RCNN网络作为目标检测模型的框架。通过上文选取的训练样本对该目标检测模型的参数(如网络权重)进行微调,得到调整后的目标检测模型。示例性地,在通过训练样本对目标检测模型的参数进行调整的过程中,批尺寸(batch size)可以为1,微调600轮,且框的尺寸、长宽比等在训练过程中均可以调整,以最终训练出一个较高精度的目标检测模型。
3、通过调整后的目标检测模型对目标图像帧进行处理,得到局部检测图。
将目标图像帧输入至调整后的目标检测模型,即可得到该目标图像帧对应的局部检测图。
在示例性实施例中,待检测视频中的第1个图像帧中的目标对象的掩膜,由人工标注得到,从第2个图像帧开始,依次往后分割提取出目标对象。如果需要获取待检测视频中的第i个图像帧对应的局部检测图,i为大于1的整数,则可以从第1个图像帧和第i-1个图像帧中选取至少一个训练样本,通过该训练样本对当前的目标检测模型的参数进行调整,得到调整后的目标检测模型,然后再通过该调整后的目标检测模型对第i个图像帧进行处理,得到该第i个图像帧对应的局部检测图。
步骤202,获取目标图像帧对应的相对运动显著图。
相对运动显著图是基于目标对象的运动信息生成的。目标对象在待检测视频的各个图像帧中的位置可能并不是静止不变的,其可能会发生运动。例如,当目标对象为人、动物、车辆等可移动对象时,其在待检测视频的各个图像帧中的位置会发生改变。运动信息反映了该目标对象的运动情况,也即在不同图像帧中的位置变化情况。在示例性实施例中,通过检测相邻图像帧之间的光流,确定出相对运动显著图,该光流即反映了目标对象的运动信息。在计算机视觉领域,光流是指视频图像中各点像素随时间的运动情况。光流具有丰富的运动信息,因而在运动估计、自动驾驶和行为识别方面都有广泛应用。相对运动显著图与目标图像帧的尺寸相同。例如,目标图像帧的尺寸为800*600像素,则相对运动显著图的尺寸也为800*600像素。可选地,相对运动显著图中目标像素的值,反映了目标图像帧中该相同位置处的目标像素属于目标对象的概率,且该概率是基于目标像素的运动信息确定的。
在示例性实施例中,本步骤可以包括如下几个子步骤:
1、计算目标图像帧与邻近图像帧之间的光流;
邻近图像帧是指待检测视频中与目标图像帧位置邻近的图像帧。邻近图像帧的数量可以是一个,也可以是多个,本申请实施例对此不作限定。邻近图像帧可以包括在先图像帧,也可以包括在后图像帧,还可以同时包括在先图像帧和在后图像帧。其中,在先图像帧是指待检测视频中位于目标图像帧之前的图像帧,在后图像帧是指待检测视频中位于目标图像帧之后的图像帧。可选地,在先图像帧为目标图像帧的前一个图像帧,在后图像帧为目标图像帧的后一个图像帧。例如,目标图像帧为第i个图像帧,则在先图像帧为第i-1个图像帧,在后图像帧为第i+1个图像帧,i为大于1的整数。如果在计算目标图像帧对应的光流时,综合考虑目标图像帧与其前一个图像帧之间的光流以及后一个图像帧之间的光流,效果更佳。
可选地,使用FlowNet2作为计算目标图像帧与邻近图像帧之间的光流的基础模型。FlowNet2是利用CNN(Convolutional Neural Networks,卷积神经网络)提取光流的模型,具有速度快、精度高等优势。
2、根据光流生成相对运动显著图。
在通过上述步骤得到目标图像帧对应的光流之后,根据该光流生成目标图像帧对应的相对运动显著图。
在示例性实施例中,通过如下方式生成相对运动显著图:
2.1、根据局部检测图中的背景区域的光流,确定背景光流;
其中,局部检测图中的背景区域,是指局部检测图中检测出的目标对象所在区域之外的剩余区域。根据目标检测模型输出的目标图像帧对应的局部检测图,可以确定出目标对象所在区域和背景区域。可选地,将背景区域中各像素的光流的平均值,作为背景光流。
2.2、根据背景光流以及目标图像帧对应的光流,生成目标图像帧对应的相对运动显著图。
在示例性实施例中,通过RMS(Root Mean Square,均方根)计算各像素的光流与背景光流之间的差异,得到目标图像帧对应的相对运动显著图。可选地,为了使均方根值更稳定,可以增加绝对光流的二范数,并采用两部分的比值为1:1,即采用如下公式计算相对运动显著图中像素(m,n)的值RMS m,n
Figure PCTCN2020088286-appb-000001
其中,O m,n为像素(m,n)的光流,ψ为背景光流。
步骤203,根据局部检测图和相对运动显著图,确定目标图像帧对应的约束信息。
约束信息包括目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素。其中,绝对正样本像素是指目标图像帧中,基于上述表观信息和运动信息,确定出来的属于目标对象的像素。绝对负样本像素是指目标图像帧中,基于上述表观信息和运动信息,确定出来的不属于目标对象的像素。不确定样本像素是指目标图像帧中,基于上述表观信息和运动信息,还无法确定其是否属于目标对象的像素。在本申请实施例中,约束信息也可以称为约束流。
可选地,对于目标图像帧中的目标像素,若目标像素在局部检测图中的值符合第一预设条件,且目标像素在相对运动显著图中的值符合第二预设条件,则确定目标像素为绝对正样本像素;若目标像素在局部检测图中的值不符合第一预设条件,且目标像素在相对运动显著图中的值不符合第二预设条件,则确定目标像素为绝对负样本像素;若目标像素在局部检测图中的值符合第一预设条件,且目标像素在相对运动显著图中的值不符合第二预设条件,或目标像素在局部检测图中的值不符合第一预设条件,且目标像素在相对运动显著图中的值符合第二预设条件,则确定目标像素为不确定样本像素。其中,第一预设条 件和第二预设条件可以根据实际情况预先设定。
在一个示例中,第一预设条件为大于第一阈值,第二预设条件为大于第二阈值。示例性地,第一阈值为0.7,第二阈值为0.5。该第一阈值和第二阈值可以根据实际情况预先设定,上述仅是示例。
步骤204,通过约束信息对图像分割模型的参数进行调整,得到调整后的图像分割模型。
在得到目标图像帧对应的约束信息之后,可以利用该约束信息对图像分割模型进行自适应学习,对其参数进行微调,提升其从目标图像帧中分割提取目标对象时的准确度。
在示例性实施例中,为了进一步提高图像分割模型的准确度,采用绝对正样本像素和绝对负样本像素,对图像分割模型的参数进行调整,得到调整后的图像分割模型。也即,在对图像分割模型的参数进行调整时,仅采用绝对正样本像素和绝对负样本像素,而不考虑不确定样本像素。
可选地,图像分割模型的损失函数可以采用交叉熵损失函数,其表达式为:
Figure PCTCN2020088286-appb-000002
其中,L表示损失函数的值,x为目标图像帧,Y是目标图像帧x的像素级别的约束信息,Y +和Y -分别为绝对正样本像素和绝对负样本像素,P(□)是图像分割模型对目标图像帧x的预测结果。该损失函数的表达式与传统的损失函数的表达式的不同在于,该损失函数的表达式不计算不确定样本像素的损失,这样可以忽略不置信的区域,更好地学习置信区域。
步骤205,通过调整后的图像分割模型,提取目标图像帧中的目标对象。
在得到调整后的图像分割模型之后,将该目标图像帧输入至调整后的图像分割模型,分割提取出该目标图像帧中的目标对象。
需要说明的一点是,图像分割模型可以每一个图像帧进行一次自适应调整训练,也可以每隔若干个图像帧(如5个图像帧)进行一次自适应调整训练。考虑到邻近图像帧中目标对象的位置变化较小,因此图像分割模型每隔若干个图像帧进行一次自适应调整训练,可以在尽可能保证模型精度不受损失的前提下,减少计算量,提升对整个视频的处理效率。另外,每一次自适应调整训练,可以训练一轮,也可以训练多轮(如3轮),本申请实施例对此不作限定。
在本申请实施例中,通过约束信息对图像分割模型的参数进行调整,由于 约束信息综合了目标对象的表观信息和运动信息,一方面可以克服待检测视频中目标对象在不同图像帧中表观差异大的问题,另一方面可以减少自适应学习过程中的误差传播,同时通过这两部分的互补,可以生成更为准确的每一次模型参数更新的指导,从而更好地约束模型参数的调整过程。
结合参考图3,其示例性示出了本申请技术方案的整体流程的示意图。以对待检测视频中的目标图像帧进行分割为例,通过目标检测模型提取目标图像帧对应的检测目标框,并进一步得到局部检测图,通过光流模型提取目标图像帧对应的光流,并进一步计算出目标图像帧对应的相对运动显著图,融合局部检测图和相对运动显著图,得到约束信息。通过该约束信息和损失函数,对图像分割模型的参数进行调整,得到调整后的图像分割模型。最后通过该调整后的图像分割模型,提取目标图像帧中的目标对象。图像分割模型可以包括特征提取器、空间空洞卷积模块、反卷积上采样模块等组成部分。有关图像分割模型的具体结构可参见下文实施例中的介绍说明。
另外,如图4所示,其示例性示出了目标检测模型的参数调整过程的示意图。在已标注图像帧中随机选取框,计算目标对象在框中的占比,基于该占比选取目标检测模型的训练样本。通过训练样本对目标检测模型的参数进行微调,得到调整后的目标检测模型。之后,将目标图像帧输入至调整后的目标检测模型,得到该目标图像帧对应的局部检测图。
综上所述,本申请实施例提供的技术方案中,通过约束信息对图像分割模型的参数进行调整,由于约束信息是综合目标对象的表观信息和运动信息两方面因素得到的,一方面可以克服待检测视频中目标对象在不同图像帧中表观差异大的问题,另一方面可以减少自适应学习过程中的误差传播,同时通过这两部分的互补,可以生成更为准确的每一次模型参数更新的指导,从而更好地约束模型参数的调整过程,使得参数调整后的图像分割模型的性能更优,最终从目标图像帧中分割提取的目标对象的准确度更高。
另外,通过计算前后图像帧之间的光流,以此来体现目标对象在前后图像帧中的运动信息,能够更加准确地对运动信息进行表征。
另外,在通过约束信息对图像分割模型的参数进行调整时,仅考虑绝对正样本像素和绝对负样本像素的损失,排除掉不确定样本像素的损失,有助于进一步提升图像分割模型的准确度。
在示例性实施例中,图像分割模型的预训练过程如下:
1、构建初始的图像分割模型;
2、采用第一样本集对初始的图像分割模型进行初步训练,得到初步训练后的图像分割模型;
3、采用第二样本集对初步训练后的图像分割模型进行再训练,得到预训练完成的图像分割模型。
初始的图像分割模型可以是一个端到端的可训练卷积神经网络,其输入是一个图像,输出是该图像中目标的掩膜。在一个示例中,选用Deeplab V3+作为端到端的可训练卷积神经网络,在网络获得输入的三通道图片信息后,可以返回一个同等大小的预测掩膜图。如图5所示,其示例性示出了一种图像分割模型的架构图。其先使用ResNet卷积神经网络作为基础特征提取器,在第五层ResNet模型后增加ASPP(Atrous Spatial Pyramid Pooling,Atrous空间金字塔池化)模块,运用不同尺度的空洞卷积(Atrous Convolution)处理输出特征,融合第三层ResNet模型提取得到的特征,这样可以更好的恢复各个尺度上的分割预测结果,再通过反卷积或者上采样把网络学到的特征返回高分辨率,这可以有效提高图像分割模型的准确率。对应视频中的每一帧,网络会输出一张相应尺度的响应图,这个响应图就是分割的概率预测结果。随着ResNet网络深度的增加,相应的提取特征的能力也会增加,网络模型的参数同样会增多,训练时间也会增加。本申请实施例选用ResNet 101网络作为Deeplab V3+特征提取器的基础网络。在基础卷积神经网络后,接ASPP模块,同时引入第三层ResNet模型提取得到的特征,加入解卷积过程,和两个解卷积上采样模块,以得到高分辨率的分割结果预测图。
第一样本集中包含至少一个带标注的图片,第二样本集中包含至少一个带标注的视频。示例性地,选用Pascal VOC数据库作为第一样本集,Pascal VOC数据库拥有2913个像素级别标注的图片分割数据。通过学习图像的语义分割,可以更好的训练图像分割模型。初步训练可以使用批尺寸大小为4,训练8000轮。示例性地,选用DAVIS16数据库作为第二样本集,使图像分割模型适应于目标分割任务。DAVIS16数据库拥有50个像素级别标注的视频,一共3455帧,其中30个用来训练,20个用来测试。可选地,在训练图像分割模型的过程中, 可以对样本进行数据扩充,例如把原始图像扩充到多个不同的尺度上,如将原始图像的尺寸缩放0.8倍、1.2倍和1.6倍,从而使得图像分割模型能够适应不同尺度的图像。可选地,选取初始学习率为0.001,并设置每一批学习4个样本,并每隔2400轮下降为原学习率的1/10,一共训练6000轮,最终得到预训练完成的图像分割模型。
需要说明的一点是,上述图像分割模型的预训练过程,可以在执行上文介绍的视频目标跟踪方法的计算机设备中执行,也可以在该计算机设备之外的其它设备中执行,然后其它设备将预训练完成的图像分割模型提供给计算机设备,由该计算机设备利用该预训练完成的图像分割模型,执行上述视频目标跟踪方法。不论图像分割模型的预训练过程是在计算机设备还是在其它设备执行,计算机设备在对待检测视频进行视频目标跟踪时,均需要采用该待检测视频对预训练完成的图像分割模型的参数进行自适应学习和调整,使得该图像分割模型能够对每一帧输出准确的分割结果。
传统的在线自适应的视频目标跟踪方法,每一帧对图像分割模型进行一次自适应训练过程,学习调整模型参数,调整的依据是前一帧的预测结果。例如,通过对前一帧的预测结果使用侵蚀算法生成绝对正样本像素,再设置绝对正样本一定欧氏距离之外的像素作为绝对负样本像素,通过这样的约束条件来指导模型参数的调整,最后用调整后的图像分割模型预测待检测的目标图像帧的分割结果。
相比于本申请实施例提供的方法,传统方法会更加依赖之前帧的准确性,也更加粗略,很难获得细节信息,而本申请实施例提供的方法可以更好的考虑到运动信息和表观信息,从而监督自适应学习过程,此外还可以更好的保持局部细节。采用本申请实施例提供的方法,在自适应学习过程标注出的绝对正样本像素和绝对负样本像素更加准确可靠,且不确定样本像素的数量更少。如图6所示,其示例性示出了采用本申请实施例提供的方法,在自适应学习过程标注出的绝对正样本像素、绝对负样本像素和不确定样本像素的示意图,图6中白色区域61中的像素为绝对正样本像素,黑色区域62中的像素为绝对负样本像素,灰色区域63中的像素为不确定样本像素。从图6中可以看出,不确定样本像素的占比很少,且具有更为准确可靠的边缘。
经实验得出,采用本申请实施例提供的方法,约束信息的准确度可以如下表 -1所示:
表-1
Figure PCTCN2020088286-appb-000003
从上述表-1中可以看出,采用本申请实施例提供的方法得到的约束信息,不但正负样本的正确率高,而且不确定样本占比少,所以可以说明本申请实施例提供的方法的有效性。尤其是对于不适合掩膜传播的视频序列,也即所要跟踪的目标对象是运动的对象时,本申请实施例提供的方法所得到的结果更加突出。此外,对于目标对象的表观清晰,特征明显的分割问题,本申请实施例提供的方法可以得到非常准确的结果。
本申请实施例提供的方法能显著提高视频目标分割的精度,更好的考虑了目标对象的运动信息和表观信息的融合,对视频目标分割中的遮挡、外观变化大、背景杂乱等特殊情况,能够对模型的自适应学习过程进行有效的约束,且通过引入的优化后的损失函数约束模型的学习过程,实现视频中的目标分割准确率的提升。
应该理解的是,虽然图2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交底地执行。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图7,其示出了本申请一个实施例提供的视频目标跟踪装置的框图。 该装置具有实现上述方法示例的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是计算机设备,也可以设置在计算机设备上。该装置700可以包括:检测图获取模块710、运动图获取模块720、约束信息获取模块730、模型调整模块740和目标分割模块750。
检测图获取模块710,用于获取待检测视频中的目标图像帧对应的局部检测图,所述局部检测图是基于所述待检测视频中需要通过图像分割模型跟踪的目标对象的表观信息生成的,所述图像分割模型是用于从所述待检测视频的图像帧中分割提取出所述目标对象的神经网络模型。
运动图获取模块720,用于获取所述目标图像帧对应的相对运动显著图,所述相对运动显著图是基于所述目标对象的运动信息生成的。
约束信息获取模块730,用于根据所述局部检测图和所述相对运动显著图,确定所述目标图像帧对应的约束信息,所述约束信息包括所述目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素。
模型调整模块740,用于通过所述约束信息对图像分割模型的参数进行调整,得到调整后的图像分割模型。
目标分割模块750,用于通过所述调整后的图像分割模型,提取所述目标图像帧中的所述目标对象。
综上所述,本申请实施例提供的技术方案中,通过约束信息对图像分割模型的参数进行调整,由于约束信息是综合目标对象的表观信息和运动信息两方面因素得到的,一方面可以克服待检测视频中目标对象在不同图像帧中表观差异大的问题,另一方面可以减少自适应学习过程中的误差传播,同时通过这两部分的互补,可以生成更为准确的每一次模型参数更新的指导,从而更好地约束模型参数的调整过程,使得参数调整后的图像分割模型的性能更优,最终从目标图像帧中分割提取的目标对象的准确度更高。
在示例性实施例中,如图8所示,所述检测图获取模块710,包括:样本选取子模块711、模型调整子模块712和检测图获取子模块713。
样本选取子模块711,用于从所述待检测视频的已标注图像帧中,选取至少一个训练样本,所述训练样本包括所述已标注图像帧以及所述已标注图像帧对应的检测目标框,所述检测目标框是指所述目标对象的占比大于预设阈值的图像区域。
模型调整子模块712,用于通过所述训练样本对目标检测模型的参数进行调整,得到调整后的目标检测模型。
检测图获取子模块713,用于通过所述调整后的目标检测模型对所述目标图像帧进行处理,得到所述局部检测图。
在示例性实施例中,所述样本选取子模块711,用于:
在所述已标注图像帧中随机撒框;
计算所述目标对象在所述框中的占比;
若所述目标对象在所述框中的占比大于所述预设阈值,则将所述框确定为所述已标注图像帧对应的检测目标框,并将所述已标注图像帧和所述检测目标框选取为所述训练样本。
在示例性实施例中,如图8所示,所述运动图获取模块720,包括:光流计算子模块721和运动图获取子模块722。
光流计算子模块721,用于计算所述目标图像帧与邻近图像帧之间的光流。
运动图获取子模块722,用于根据所述光流生成所述相对运动显著图。
在示例性实施例中,所述运动图获取子模块722,用于:
根据所述局部检测图中的背景区域的光流,确定背景光流;其中,所述局部检测图中的背景区域,是指所述局部检测图中检测出的所述目标对象所在区域之外的剩余区域;
根据所述背景光流以及所述目标图像帧对应的所述光流,生成所述相对运动显著图。
在示例性实施例中,所述约束信息获取模块730,用于:
对于所述目标图像帧中的目标像素,当所述目标像素在所述局部检测图中的值符合第一预设条件,且所述目标像素在所述相对运动显著图中的值符合第二预设条件时,确定所述目标像素为所述绝对正样本像素;
当所述目标像素在所述局部检测图中的值不符合所述第一预设条件,且所述目标像素在所述相对运动显著图中的值不符合所述第二预设条件时,确定所述目标像素为所述绝对负样本像素;
当所述目标像素在所述局部检测图中的值符合所述第一预设条件,且所述目标像素在所述相对运动显著图中的值不符合所述第二预设条件,或所述目标像素在所述局部检测图中的值不符合所述第一预设条件,且所述目标像素在所 述相对运动显著图中的值符合所述第二预设条件时,确定所述目标像素为所述不确定样本像素。
在示例性实施例中,所述模型调整模块740,用于采用所述绝对正样本像素和所述绝对负样本像素,对所述图像分割模型进行再训练,得到所述调整后的图像分割模型。
在示例性实施例中,所述图像分割模型的预训练过程如下:
构建初始的图像分割模型;
采用第一样本集对所述初始的图像分割模型进行初步训练,得到初步训练后的图像分割模型;其中,所述第一样本集中包含至少一个带标注的图片;
采用第二样本集对所述初步训练后的图像分割模型进行再训练,得到预训练完成的图像分割模型;其中,所述第二样本集中包含至少一个带标注的视频。
需要说明的是,上述实施例提供的装置,在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图9,其示出了本申请一个实施例提供的计算机设备900的结构框图。该计算机设备900可以是手机、平板电脑、电子书阅读设备、可穿戴设备、智能电视、多媒体播放设备、PC、服务器等。
通常,终端900包括有:处理器901和存储器902。
处理器901可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器901可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器901也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器901可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器901还可以包括AI (Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器902可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器902还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器902中的非暂态的计算机可读存储介质用于存储计算机程序,该计算机程序用于被处理器901所执行以实现本申请中方法实施例提供的界面切换方法。
在一些实施例中,终端900还可选包括有:外围设备接口903和至少一个外围设备。处理器901、存储器902和外围设备接口903之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口903相连。具体地,外围设备可以包括:射频电路904、显示屏905、摄像头906、音频电路907、定位组件908和电源909中的至少一种。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
本领域技术人员可以理解,图9中示出的结构并不构成对终端900的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在示例中实施例中,还提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指 令集。所述至少一条指令、至少一段程序、代码集或指令集经配置以由一个或者一个以上处理器执行,以实现上述视频目标跟踪方法。
在示例性实施例中,还提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或所述指令集在被计算机设备的处理器执行时实现上述视频目标跟踪方法。
可选地,上述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品,当该计算机程序产品被执行时,其用于实现上述视频目标跟踪方法。
应当理解的是,本文中描述的步骤编号,仅示例性示出了步骤间的一种可能的执行先后顺序,在一些其它实施例中,上述步骤也可以不按照编号顺序来执行,如两个不同编号的步骤同时执行,或者两个不同编号的步骤按照与图示相反的顺序执行,本申请实施例对此不作限定。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种视频目标跟踪方法,由计算机设备执行,其特征在于,所述方法包括:
    获取待检测视频中的目标图像帧对应的局部检测图,所述局部检测图是基于所述待检测视频中需要通过图像分割模型跟踪的目标对象的表观信息生成;
    获取所述目标图像帧对应的相对运动显著图,所述相对运动显著图是基于所述目标对象的运动信息生成的;
    根据所述局部检测图和所述相对运动显著图,确定所述目标图像帧对应的约束信息,所述约束信息包括所述目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素;
    通过所述约束信息对所述图像分割模型的参数进行调整,得到调整后的图像分割模型;
    通过所述调整后的图像分割模型,提取所述目标图像帧中的所述目标对象。
  2. 根据权利要求1所述的方法,其特征在于,所述获取待检测视频中的目标图像帧对应的局部检测图,包括:
    从所述待检测视频的已标注图像帧中,选取至少一个训练样本,所述训练样本包括所述已标注图像帧以及所述已标注图像帧对应的检测目标框,所述检测目标框是指所述目标对象在所述检测目标框的占比大于预设阈值的图像区域;
    通过所述训练样本对目标检测模型的参数进行调整,得到调整后的目标检测模型;
    通过所述调整后的目标检测模型对所述目标图像帧进行处理,得到所述局部检测图。
  3. 根据权利要求2所述的方法,其特征在于,所述从所述待检测视频的已标注图像帧中,选取至少一个训练样本,包括:
    在所述已标注图像帧中随机撒框;
    计算所述目标对象在随机撒的所述框中的占比;
    若所述目标对象在所述框中的占比大于所述预设阈值,则将所述框确定为所述已标注图像帧对应的检测目标框,并将所述已标注图像帧和所述检测目标框选取为所述训练样本。
  4. 根据权利要求1所述的方法,其特征在于,所述获取所述目标图像帧对应的相对运动显著图,包括:
    计算所述目标图像帧与邻近图像帧之间的光流;
    根据所述光流生成所述相对运动显著图。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述光流生成所述相对运动显著图,包括:
    根据所述局部检测图中的背景区域的光流,确定背景光流;其中,所述局部检测图中的背景区域,是指所述局部检测图中检测出的所述目标对象所在区域之外的剩余区域;
    根据所述背景光流以及所述目标图像帧对应的所述光流,生成所述相对运动显著图。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述根据所述局部检测图和所述相对运动显著图,确定所述目标图像帧对应的约束信息,包括:
    对于所述目标图像帧中的目标像素,若所述目标像素在所述局部检测图中的值符合第一预设条件,且所述目标像素在所述相对运动显著图中的值符合第二预设条件,则确定所述目标像素为所述绝对正样本像素;
    若所述目标像素在所述局部检测图中的值不符合所述第一预设条件,且所述目标像素在所述相对运动显著图中的值不符合所述第二预设条件,则确定所述目标像素为所述绝对负样本像素;
    若所述目标像素在所述局部检测图中的值符合所述第一预设条件,且所述目标像素在所述相对运动显著图中的值不符合所述第二预设条件,则确定所述目标像素为所述不确定样本像素;或者,
    若所述目标像素在所述局部检测图中的值不符合所述第一预设条件,且所述目标像素在所述相对运动显著图中的值符合所述第二预设条件,则确定所述目标像素为所述不确定样本像素。
  7. 根据权利要求1至5任一项所述的方法,其特征在于,所述通过所述约束信息对图像分割模型的参数进行调整,得到调整后的图像分割模型,包括:
    采用所述绝对正样本像素和所述绝对负样本像素,对所述图像分割模型的参数进行调整,得到所述调整后的图像分割模型。
  8. 根据权利要求1至5任一项所述的方法,其特征在于,所述图像分割模 型的预训练过程如下:
    构建初始的图像分割模型;
    采用第一样本集对所述初始的图像分割模型进行初步训练,得到初步训练后的图像分割模型;其中,所述第一样本集中包含至少一个带标注的图片;
    采用第二样本集对所述初步训练后的图像分割模型进行再训练,得到预训练完成的图像分割模型;其中,所述第二样本集中包含至少一个带标注的视频。
  9. 一种视频目标跟踪装置,其特征在于,所述装置包括:
    检测图获取模块,用于获取待检测视频中的目标图像帧对应的局部检测图,所述局部检测图是基于所述待检测视频中需要通过图像分割模型跟踪的目标对象的表观信息生成的,所述图像分割模型是用于从所述待检测视频的图像帧中分割提取出所述目标对象的神经网络模型;
    运动图获取模块,用于获取所述目标图像帧对应的相对运动显著图,所述相对运动显著图是基于所述目标对象的运动信息生成的;
    约束信息获取模块,用于根据所述局部检测图和所述相对运动显著图,确定所述目标图像帧对应的约束信息,所述约束信息包括所述目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素;
    模型调整模块,用于通过所述约束信息对所述图像分割模型的参数进行调整,得到调整后的图像分割模型;
    目标分割模块,用于通过所述调整后的图像分割模型,提取所述目标图像帧中的所述目标对象。
  10. 根据权利要求9所述的装置,其特征在于,所述检测图获取模块,包括:
    样本选取子模块,用于从所述待检测视频的已标注图像帧中,选取至少一个训练样本,所述训练样本包括所述已标注图像帧以及所述已标注图像帧对应的检测目标框,所述检测目标框是指所述目标对象的占比大于预设阈值的图像区域;
    模型调整子模块,用于通过所述训练样本对目标检测模型的参数进行调整,得到调整后的目标检测模型;
    检测图获取子模块,用于通过所述调整后的目标检测模型对所述目标图像 帧进行处理,得到所述局部检测图。
  11. 根据权利要求9所述的装置,其特征在于,所述运动图获取模块,包括:
    光流计算子模块,用于计算所述目标图像帧与邻近图像帧之间的光流;
    运动图获取子模块,用于根据所述光流生成所述相对运动显著图。
  12. 根据权利要求9至11任一项所述的装置,其特征在于,所述约束信息获取模块,用于:
    对于所述目标图像帧中的目标像素,当所述目标像素在所述局部检测图中的值符合第一预设条件,且所述目标像素在所述相对运动显著图中的值符合第二预设条件时,确定所述目标像素为所述绝对正样本像素;
    当所述目标像素在所述局部检测图中的值不符合所述第一预设条件,且所述目标像素在所述相对运动显著图中的值不符合所述第二预设条件时,确定所述目标像素为所述绝对负样本像素;
    当所述目标像素在所述局部检测图中的值符合所述第一预设条件,且所述目标像素在所述相对运动显著图中的值不符合所述第二预设条件,或所述目标像素在所述局部检测图中的值不符合所述第一预设条件,且所述目标像素在所述相对运动显著图中的值符合所述第二预设条件时,确定所述目标像素为所述不确定样本像素。
  13. 根据权利要求9至11任一项所述的装置,其特征在于,所述模型调整模块,用于:
    采用所述绝对正样本像素和所述绝对负样本像素,对所述图像分割模型进行再训练,得到所述调整后的图像分割模型。
  14. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行,使得所述处理器执行以下步骤:
    获取待检测视频中的目标图像帧对应的局部检测图,所述局部检测图是基于所述待检测视频中需要通过图像分割模型跟踪的目标对象的表观信息生成;
    获取所述目标图像帧对应的相对运动显著图,所述相对运动显著图是基于所述目标对象的运动信息生成的;
    根据所述局部检测图和所述相对运动显著图,确定所述目标图像帧对应的约束信息,所述约束信息包括所述目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素;
    通过所述约束信息对所述图像分割模型的参数进行调整,得到调整后的图像分割模型;
    通过所述调整后的图像分割模型,提取所述目标图像帧中的所述目标对象。
  15. 如权利要求14所述的计算机设备,其特征在于,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行获取待检测视频中的目标图像帧对应的局部检测图的步骤时,使得所述处理器具体执行以下步骤:
    从所述待检测视频的已标注图像帧中,选取至少一个训练样本,所述训练样本包括所述已标注图像帧以及所述已标注图像帧对应的检测目标框,所述检测目标框是指所述目标对象在所述检测目标框的占比大于预设阈值的图像区域;
    通过所述训练样本对目标检测模型的参数进行调整,得到调整后的目标检测模型;
    通过所述调整后的目标检测模型对所述目标图像帧进行处理,得到所述局部检测图。
  16. 如权利要求15所述的计算机设备,其特征在于,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行从所述待检测视频的已标注图像帧中,选取至少一个训练样本的步骤时,使得所述处理器具体执行以下步骤:
    在所述已标注图像帧中随机撒框;
    计算所述目标对象在随机撒的所述框中的占比;
    若所述目标对象在所述框中的占比大于所述预设阈值,则将所述框确定为所述已标注图像帧对应的检测目标框,并将所述已标注图像帧和所述检测目标框选取为所述训练样本。
  17. 如权利要求14所述的计算机设备,其特征在于,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行获取所述目标图像帧对应的相对运动显著图的步骤时,使得所述处理器具体执行以下步骤:
    计算所述目标图像帧与邻近图像帧之间的光流;
    根据所述光流生成所述相对运动显著图。
  18. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行,使得所述处理器执行以下步骤:
    获取待检测视频中的目标图像帧对应的局部检测图,所述局部检测图是基于所述待检测视频中需要通过图像分割模型跟踪的目标对象的表观信息生成;
    获取所述目标图像帧对应的相对运动显著图,所述相对运动显著图是基于所述目标对象的运动信息生成的;
    根据所述局部检测图和所述相对运动显著图,确定所述目标图像帧对应的约束信息,所述约束信息包括所述目标图像帧中的绝对正样本像素、绝对负样本像素和不确定样本像素;
    通过所述约束信息对所述图像分割模型的参数进行调整,得到调整后的图像分割模型;
    通过所述调整后的图像分割模型,提取所述目标图像帧中的所述目标对象。
  19. 如权利要求18所述的存储介质,其特征在于,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行获取待检测视频中的目标图像帧对应的局部检测图的步骤时,使得所述处理器具体执行以下步骤:
    从所述待检测视频的已标注图像帧中,选取至少一个训练样本,所述训练样本包括所述已标注图像帧以及所述已标注图像帧对应的检测目标框,所述检测目标框是指所述目标对象在所述检测目标框的占比大于预设阈值的图像区域;
    通过所述训练样本对目标检测模型的参数进行调整,得到调整后的目标检测模型;
    通过所述调整后的目标检测模型对所述目标图像帧进行处理,得到所述局部检测图。
  20. 如权利要求19所述的存储介质,其特征在于,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行从所述待检测 视频的已标注图像帧中,选取至少一个训练样本的步骤时,使得所述处理器具体执行以下步骤:
    在所述已标注图像帧中随机撒框;
    计算所述目标对象在随机撒的所述框中的占比;
    若所述目标对象在所述框中的占比大于所述预设阈值,则将所述框确定为所述已标注图像帧对应的检测目标框,并将所述已标注图像帧和所述检测目标框选取为所述训练样本。
PCT/CN2020/088286 2019-05-27 2020-04-30 视频目标跟踪方法、装置、计算机设备及存储介质 WO2020238560A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP20812620.1A EP3979200A4 (en) 2019-05-27 2020-04-30 VIDEO TARGET TRACKING METHOD AND APPARATUS, COMPUTER DEVICE AND STORAGE MEDIA
JP2021537733A JP7236545B2 (ja) 2019-05-27 2020-04-30 ビデオターゲット追跡方法と装置、コンピュータ装置、プログラム
US17/461,978 US20210398294A1 (en) 2019-05-27 2021-08-30 Video target tracking method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910447379.3 2019-05-27
CN201910447379.3A CN110176027B (zh) 2019-05-27 2019-05-27 视频目标跟踪方法、装置、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/461,978 Continuation US20210398294A1 (en) 2019-05-27 2021-08-30 Video target tracking method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020238560A1 true WO2020238560A1 (zh) 2020-12-03

Family

ID=67696270

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/088286 WO2020238560A1 (zh) 2019-05-27 2020-04-30 视频目标跟踪方法、装置、计算机设备及存储介质

Country Status (5)

Country Link
US (1) US20210398294A1 (zh)
EP (1) EP3979200A4 (zh)
JP (1) JP7236545B2 (zh)
CN (1) CN110176027B (zh)
WO (1) WO2020238560A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733802A (zh) * 2021-01-25 2021-04-30 腾讯科技(深圳)有限公司 图像的遮挡检测方法、装置、电子设备及存储介质
CN113361373A (zh) * 2021-06-02 2021-09-07 武汉理工大学 一种农业场景下的航拍图像实时语义分割方法
EP4138045A1 (en) * 2021-08-20 2023-02-22 INTEL Corporation Resource-efficient video coding and motion estimation
WO2023096685A1 (en) * 2021-11-24 2023-06-01 Microsoft Technology Licensing, Llc. Feature prediction for efficient video processing

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086709B (zh) * 2018-07-27 2023-04-07 腾讯科技(深圳)有限公司 特征提取模型训练方法、装置及存储介质
CN110176027B (zh) * 2019-05-27 2023-03-14 腾讯科技(深圳)有限公司 视频目标跟踪方法、装置、设备及存储介质
CN110503074B (zh) * 2019-08-29 2022-04-15 腾讯科技(深圳)有限公司 视频帧的信息标注方法、装置、设备及存储介质
CN110807784B (zh) * 2019-10-30 2022-07-26 北京百度网讯科技有限公司 用于分割物体的方法和装置
CN112784638B (zh) * 2019-11-07 2023-12-08 北京京东乾石科技有限公司 训练样本获取方法和装置、行人检测方法和装置
CN112862855B (zh) * 2019-11-12 2024-05-24 北京京邦达贸易有限公司 图像标注方法、装置、计算设备及存储介质
CN110866515B (zh) * 2019-11-22 2023-05-09 盛景智能科技(嘉兴)有限公司 厂房内对象行为识别方法、装置以及电子设备
CN111242973A (zh) * 2020-01-06 2020-06-05 上海商汤临港智能科技有限公司 目标跟踪方法、装置、电子设备及存储介质
CN111260679B (zh) * 2020-01-07 2022-02-01 广州虎牙科技有限公司 图像处理方法、图像分割模型训练方法及相关装置
CN111274892B (zh) * 2020-01-14 2020-12-18 北京科技大学 一种鲁棒的遥感影像变化检测方法及系统
CN111208148A (zh) * 2020-02-21 2020-05-29 凌云光技术集团有限责任公司 一种挖孔屏漏光缺陷检测系统
CN111340101B (zh) * 2020-02-24 2023-06-30 广州虎牙科技有限公司 稳定性评估方法、装置、电子设备和计算机可读存储介质
CN111444826B (zh) * 2020-03-25 2023-09-29 腾讯科技(深圳)有限公司 视频检测方法、装置、存储介质及计算机设备
CN111476252B (zh) * 2020-04-03 2022-07-29 南京邮电大学 一种面向计算机视觉应用的轻量化无锚框目标检测方法
CN111461130B (zh) * 2020-04-10 2021-02-09 视研智能科技(广州)有限公司 一种高精度图像语义分割算法模型及分割方法
JP2021174182A (ja) 2020-04-23 2021-11-01 株式会社日立システムズ 画素レベル対象物検出システムおよびそのプログラム
CN111654746B (zh) * 2020-05-15 2022-01-21 北京百度网讯科技有限公司 视频的插帧方法、装置、电子设备和存储介质
CN112132871B (zh) * 2020-08-05 2022-12-06 天津(滨海)人工智能军民融合创新中心 一种基于特征光流信息的视觉特征点追踪方法、装置、存储介质及终端
CN112525145B (zh) * 2020-11-30 2022-05-17 北京航空航天大学 一种飞机降落相对姿态动态视觉测量方法及系统
CN112541475B (zh) * 2020-12-24 2024-01-19 北京百度网讯科技有限公司 感知数据检测方法及装置
KR20220099210A (ko) * 2021-01-05 2022-07-13 삼성디스플레이 주식회사 표시 장치, 이를 포함하는 가상 현실 표시 시스템 및 이를 이용한 입력 영상 기반 사용자 움직임 추정 방법
CN113011371A (zh) * 2021-03-31 2021-06-22 北京市商汤科技开发有限公司 目标检测方法、装置、设备及存储介质
CN113361519B (zh) * 2021-05-21 2023-07-28 北京百度网讯科技有限公司 目标处理方法、目标处理模型的训练方法及其装置
CN113518256B (zh) * 2021-07-23 2023-08-08 腾讯科技(深圳)有限公司 视频处理方法、装置、电子设备及计算机可读存储介质
CN113807185B (zh) * 2021-08-18 2024-02-27 苏州涟漪信息科技有限公司 一种数据处理方法和装置
CN114359973A (zh) * 2022-03-04 2022-04-15 广州市玄武无线科技股份有限公司 基于视频的商品状态识别方法、设备及计算机可读介质
CN114639171B (zh) * 2022-05-18 2022-07-29 松立控股集团股份有限公司 一种停车场全景安全监控方法
CN115052154B (zh) * 2022-05-30 2023-04-14 北京百度网讯科技有限公司 一种模型训练和视频编码方法、装置、设备及存储介质
CN115860275B (zh) * 2023-02-23 2023-05-05 深圳市南湖勘测技术有限公司 一种用于土地整备利益统筹测绘采集方法及系统
CN116188460B (zh) * 2023-04-24 2023-08-25 青岛美迪康数字工程有限公司 基于运动矢量的图像识别方法、装置和计算机设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8467570B2 (en) * 2006-06-14 2013-06-18 Honeywell International Inc. Tracking system with fused motion and object detection
CN106127807A (zh) * 2016-06-21 2016-11-16 中国石油大学(华东) 一种实时的视频多类多目标跟踪方法
CN108122247A (zh) * 2017-12-25 2018-06-05 北京航空航天大学 一种基于图像显著性和特征先验模型的视频目标检测方法
CN109035293A (zh) * 2018-05-22 2018-12-18 安徽大学 适用于视频图像中显著人体实例分割的方法
CN110176027A (zh) * 2019-05-27 2019-08-27 腾讯科技(深圳)有限公司 视频目标跟踪方法、装置、设备及存储介质

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968884A (zh) * 2009-07-28 2011-02-09 索尼株式会社 检测视频图像中的目标的方法和装置
US9107604B2 (en) * 2011-09-26 2015-08-18 Given Imaging Ltd. Systems and methods for generating electromagnetic interference free localization data for an in-vivo device
US11100335B2 (en) * 2016-03-23 2021-08-24 Placemeter, Inc. Method for queue time estimation
CN106530330B (zh) * 2016-12-08 2017-07-25 中国人民解放军国防科学技术大学 基于低秩稀疏的视频目标跟踪方法
US11423548B2 (en) * 2017-01-06 2022-08-23 Board Of Regents, The University Of Texas System Segmenting generic foreground objects in images and videos
US20180204076A1 (en) * 2017-01-13 2018-07-19 The Regents Of The University Of California Moving object detection and classification image analysis methods and systems
CN106709472A (zh) * 2017-01-17 2017-05-24 湖南优象科技有限公司 一种基于光流特征的视频目标检测与跟踪方法
CN106934346B (zh) * 2017-01-24 2019-03-15 北京大学 一种目标检测性能优化的方法
CN107066990B (zh) * 2017-05-04 2019-10-11 厦门美图之家科技有限公司 一种目标跟踪方法及移动设备
CN108305275B (zh) * 2017-08-25 2021-02-12 深圳市腾讯计算机系统有限公司 主动跟踪方法、装置及系统
CN107679455A (zh) * 2017-08-29 2018-02-09 平安科技(深圳)有限公司 目标跟踪装置、方法及计算机可读存储介质
CN107644429B (zh) * 2017-09-30 2020-05-19 华中科技大学 一种基于强目标约束视频显著性的视频分割方法
CN108765465B (zh) * 2018-05-31 2020-07-10 西安电子科技大学 一种无监督sar图像变化检测方法
CN109145781B (zh) * 2018-08-03 2021-05-04 北京字节跳动网络技术有限公司 用于处理图像的方法和装置
CN109376603A (zh) * 2018-09-25 2019-02-22 北京周同科技有限公司 一种视频识别方法、装置、计算机设备及存储介质
CN109461168B (zh) * 2018-10-15 2021-03-16 腾讯科技(深圳)有限公司 目标对象的识别方法和装置、存储介质、电子装置
CN109635657B (zh) * 2018-11-12 2023-01-06 平安科技(深圳)有限公司 目标跟踪方法、装置、设备及存储介质
CN109492608B (zh) * 2018-11-27 2019-11-05 腾讯科技(深圳)有限公司 图像分割方法、装置、计算机设备及存储介质
CN109711445B (zh) * 2018-12-18 2020-10-16 绍兴文理学院 目标跟踪分类器在线训练样本的超像素中智相似加权方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8467570B2 (en) * 2006-06-14 2013-06-18 Honeywell International Inc. Tracking system with fused motion and object detection
CN106127807A (zh) * 2016-06-21 2016-11-16 中国石油大学(华东) 一种实时的视频多类多目标跟踪方法
CN108122247A (zh) * 2017-12-25 2018-06-05 北京航空航天大学 一种基于图像显著性和特征先验模型的视频目标检测方法
CN109035293A (zh) * 2018-05-22 2018-12-18 安徽大学 适用于视频图像中显著人体实例分割的方法
CN110176027A (zh) * 2019-05-27 2019-08-27 腾讯科技(深圳)有限公司 视频目标跟踪方法、装置、设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3979200A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733802A (zh) * 2021-01-25 2021-04-30 腾讯科技(深圳)有限公司 图像的遮挡检测方法、装置、电子设备及存储介质
CN112733802B (zh) * 2021-01-25 2024-02-09 腾讯科技(深圳)有限公司 图像的遮挡检测方法、装置、电子设备及存储介质
CN113361373A (zh) * 2021-06-02 2021-09-07 武汉理工大学 一种农业场景下的航拍图像实时语义分割方法
EP4138045A1 (en) * 2021-08-20 2023-02-22 INTEL Corporation Resource-efficient video coding and motion estimation
WO2023096685A1 (en) * 2021-11-24 2023-06-01 Microsoft Technology Licensing, Llc. Feature prediction for efficient video processing

Also Published As

Publication number Publication date
EP3979200A4 (en) 2022-07-27
US20210398294A1 (en) 2021-12-23
EP3979200A1 (en) 2022-04-06
JP2022534337A (ja) 2022-07-29
CN110176027B (zh) 2023-03-14
JP7236545B2 (ja) 2023-03-09
CN110176027A (zh) 2019-08-27

Similar Documents

Publication Publication Date Title
WO2020238560A1 (zh) 视频目标跟踪方法、装置、计算机设备及存储介质
Dvornik et al. On the importance of visual context for data augmentation in scene understanding
US20200250436A1 (en) Video object segmentation by reference-guided mask propagation
CN110738207B (zh) 一种融合文字图像中文字区域边缘信息的文字检测方法
WO2020228446A1 (zh) 模型训练方法、装置、终端及存储介质
JP4964159B2 (ja) ビデオのフレームのシーケンスにおいてオブジェクトを追跡するコンピュータに実装される方法
CN110414344B (zh) 一种基于视频的人物分类方法、智能终端及存储介质
CN113344932B (zh) 一种半监督的单目标视频分割方法
CN113076871A (zh) 一种基于目标遮挡补偿的鱼群自动检测方法
CN111860398A (zh) 遥感图像目标检测方法、系统及终端设备
CN114764868A (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
CN111382647B (zh) 一种图片处理方法、装置、设备及存储介质
US11941822B2 (en) Volumetric sampling with correlative characterization for dense estimation
CN111445496B (zh) 一种水下图像识别跟踪系统及方法
US20220101539A1 (en) Sparse optical flow estimation
CN117237547B (zh) 图像重建方法、重建模型的处理方法和装置
CN110580462B (zh) 一种基于非局部网络的自然场景文本检测方法和系统
CN114821048A (zh) 目标物分割方法和相关装置
US20240031512A1 (en) Method and system for generation of a plurality of portrait effects in an electronic device
US20240070812A1 (en) Efficient cost volume processing within iterative process
Chen et al. EBANet: Efficient Boundary-Aware Network for RGB-D Semantic Segmentation
Zhan et al. Scale-equivariant Steerable Networks for Crowd Counting
CN117274588A (zh) 图像处理方法、装置、电子设备及存储介质
CN114202559A (zh) 目标跟踪方法、装置、电子设备及存储介质
CN114998953A (zh) 人脸关键点检测方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20812620

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021537733

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020812620

Country of ref document: EP

Effective date: 20220103