WO2021036699A1 - 视频帧的信息标注方法、装置、设备及存储介质 - Google Patents

视频帧的信息标注方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021036699A1
WO2021036699A1 PCT/CN2020/106575 CN2020106575W WO2021036699A1 WO 2021036699 A1 WO2021036699 A1 WO 2021036699A1 CN 2020106575 W CN2020106575 W CN 2020106575W WO 2021036699 A1 WO2021036699 A1 WO 2021036699A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
target
image feature
information
video
Prior art date
Application number
PCT/CN2020/106575
Other languages
English (en)
French (fr)
Inventor
吴锐正
贾佳亚
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2021556971A priority Critical patent/JP7147078B2/ja
Priority to EP20859548.8A priority patent/EP4009231A4/en
Publication of WO2021036699A1 publication Critical patent/WO2021036699A1/zh
Priority to US17/473,940 priority patent/US11727688B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and in particular to a method, device, device, and storage medium for labeling information of video frames.
  • a method based on deep learning is usually used to model the pixel relationship between video frames based on a convolutional neural network, so that the annotation information between the video frames is propagated through the relationship between pixels.
  • a convolutional neural network is usually used to model adjacent video frames. Accordingly, when using the constructed model to propagate annotation information, the previous video frame of the current video frame is determined as the guiding video frame , So as to pass the label information of the guiding video frame to the current video frame through the model.
  • the embodiments of the present application provide a method, device, device, and storage medium for labeling information of video frames, which can improve the accuracy of labeling information generated when labeling video frames.
  • the technical solution is as follows:
  • feature extraction is performed on the target video frame to obtain the target image feature of the target video frame
  • the guide video frame of the target video frame is determined from the marked video frame, and the marked video frame belongs to the to-be-processed video, so
  • the guide video frame is used to guide the target video frame to perform information labeling
  • the image feature matching degree is the matching degree between the target image feature and the corresponding image feature of the labeled video frame
  • the guiding video frame The degree of matching with the image features of the target video frame is higher than the degree of matching with the image features of other marked video frames and the target video frame;
  • target label information corresponding to the target video frame is generated.
  • an embodiment of the present application provides an information labeling device for video frames, the device includes:
  • the acquisition module is used to acquire the video to be processed
  • the generating module is configured to generate target label information corresponding to the target video frame according to the label information corresponding to the guide video frame.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the The code set or instruction set is loaded and executed by the processor to implement the method for labeling information of the video frame as described in the foregoing aspect.
  • the marked video frame that has a high degree of image feature matching with the target video frame is used as the guiding video frame, instead of directly selecting the adjacent video frame as the guiding video frame, the selection quality of the guiding video frame is improved, and the generated annotation information is improved. Accuracy; In addition, the propagation error of the label information will not accumulate, thereby improving the quality of the propagation of the label information.
  • FIG. 1 is a schematic diagram of an implementation of annotating objects in a video using related technologies and methods provided by embodiments of the present application;
  • FIG. 2 is a schematic diagram of the principle of a method for labeling information of a video frame provided by an embodiment of the present application
  • Figure 3 is a schematic diagram of the interface for the automatic tracking and positioning process of the object in the video
  • Fig. 4 is a schematic diagram of the interface of the process of coloring gray-scale video
  • Fig. 5 shows a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application
  • Fig. 6 shows a flowchart of a method for labeling information of video frames provided by an exemplary embodiment of the present application
  • FIG. 7 shows a flowchart of a method for labeling information of a video frame provided by another exemplary embodiment of the present application.
  • FIG. 8 shows a flowchart of a method for labeling information of a video frame provided by another exemplary embodiment of the present application.
  • FIG. 9 is a schematic diagram of the implementation of the method for labeling information of the video frame shown in FIG. 8;
  • FIG. 10 is a schematic diagram of the implementation of feature extraction by selecting the first selection branch of the network
  • Fig. 12 is a flowchart of a network training process provided by an exemplary embodiment
  • FIG. 13 is a structural block diagram of a device for labeling information of video frames provided by an exemplary embodiment of the present application.
  • Computer Vision is a science that studies how to make machines "see”. To put it further, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure machine vision for targets, and further Do graphics processing to make computer processing an image more suitable for human observation or transmission to the instrument for inspection.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual Technologies such as reality, augmented reality, synchronized positioning and map construction also include common facial recognition, fingerprint recognition and other biometric recognition technologies.
  • OCR optical character recognition
  • 3D technology three-dimensional object reconstruction
  • virtual Technologies such as reality, augmented reality, synchronized positioning and map construction also include common facial recognition, fingerprint recognition and other biometric recognition technologies.
  • the method provided in the embodiment of the present application relates to the application of computer vision technology in the field of video information labeling.
  • the propagation error of the annotation information will continue to accumulate during the propagation process, especially when the object to be annotated in some video frames is blocked or temporarily left, which may result in not being able to provide all the information after the video frame. Setting the correct annotation information for the video frame will ultimately affect the propagation effect of the annotation information.
  • the adjacent video of the target video frame t is not directly labeled.
  • Frame that is, video frame t-1
  • feature extraction is performed on the target video frame t to obtain the target image feature 22 of the target video frame t.
  • the computer device selects a video frame that has a high degree of image feature matching with the target video frame t from the marked video frame as the guide video frame 24. Further, the computer device generates target label information 26 for the target video frame t according to the guide video frame 24, the label information 25 of the guide video frame, and the target video frame t, and completes the information labeling of the target video frame t.
  • the computer device when generating the marker information for the 75th video frame, the computer device did not determine the 74th video frame as the guide video Instead, the 35th video frame is determined as the guiding video frame based on the matching degree of the image features, and then based on the labeling information of the guiding video frame, the object 11 is marked in the 75th video frame; similarly, in When the 100th video frame generates the marking information, the computer device determines the 98th video frame as the guiding video frame, and finally marks the object 11 in the 100th video frame.
  • the method for labeling information of video frames provided in the embodiments of the present application can be applied to indoor monitoring applications, road monitoring applications, parking lot monitoring applications, and other applications that have the function of automatically tracking and positioning video objects.
  • the user first imports the video into the application, and then marks the object that needs to be automatically tracked in a certain video frame of the video, and the application generates annotations for other video frames in the video according to the initial annotation information Information, and further mark and display the automatically tracked and positioned objects in each frame of the video frame according to the labeling information.
  • the application interface displays the first video frame in the video, and prompts the user to mark the object to be tracked by frame selection.
  • the user uses the wireframe 31 to select the object "dog" to be tracked, and clicks the start tracking control, the application program generates the annotation information for each video frame in the video in sequence according to the first video frame and its annotation information, and According to the generated annotation information, the wire frame 31 is used to perform real-time frame selection and display of the dog in the video frame.
  • the method for labeling information of video frames provided in the embodiments of the present application can be applied to an application program with a video coloring function, such as a video editing application program.
  • a video coloring function such as a video editing application program.
  • the user first colorizes a certain image frame in the grayscale video, and then inputs the grayscale video containing the initial color information into the application program, and the application program marks the information according to the initial color to other images in the video.
  • the video frame generates color label information, and further colors each frame of the video frame according to the generated color label information, and finally outputs a color video.
  • the method for labeling information of video frames provided in the embodiments of the present application can be applied to computer devices such as terminals or servers.
  • the method for labeling information of video frames provided in the embodiments of this application can be implemented as an application or a part of an application, and installed in the terminal, so that the terminal has the ability to automatically set the video frame in the video.
  • the function of labeling information or, it can be applied to the background server of the application, so that the server provides the function of labeling the information of the video frame for the application in the terminal.
  • FIG. 5 shows a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application.
  • the implementation environment includes a terminal 510 and a server 520.
  • the terminal 510 and the server 520 communicate data through a communication network.
  • the communication network may be a wired network or a wireless network, and the communication network may be a local area network. , At least one of a metropolitan area network and a wide area network.
  • the terminal 510 is installed with an application program that requires video frame information labeling.
  • the application program may be a monitoring application program, a video coloring application program, etc., which is not limited in the embodiment of the present application.
  • the terminal 510 may be a mobile terminal such as a mobile phone, a tablet computer, a laptop portable notebook computer, or an auxiliary device for the visually impaired, or a terminal such as a desktop computer and a projection computer, which is not limited in the embodiment of the present application.
  • a pre-trained memory selection network 521 and a time-series propagation network 522 are provided in the server 520, where the memory selection network 521 is used to select video frames to be labeled from the labeled video frames
  • the time-series propagation network 522 is used to generate annotation information for the video frame to be marked according to the guide video frame selected by the memory selection network 521.
  • the server 520 generates annotation information for each video frame in the to-be-processed video frame through the memory selection network 521 and the timing propagation network 522, and then feeds back the annotation information to the terminal 510, and the terminal 510 performs the evaluation according to the annotation information.
  • the video is processed to display the processed video.
  • the labeling information is object segmentation information
  • the terminal 510 displays the target objects in each video frame according to the object segmentation information
  • the labeling information is color information
  • the terminal 510 displays the information in the video frame according to the color information.
  • Each object is colored.
  • the aforementioned memory selection network 521 and timing propagation network 522 can also be implemented as part or all of the application program. Accordingly, the terminal 510 can locally mark the video frame information without the server 520. This embodiment does not limit this.
  • the video to be processed may be a real-time streaming video, a captured video, or a downloaded video, which is not limited in the embodiment of the present application.
  • the 0th video frame of the video to be processed is the initial annotated video frame.
  • the initial annotated video frame can also be a non-zeroth frame (ie, non-first frame), but any frame in the video to be processed (such as the frame with the richest image content, or contains One frame of all objects to be marked), which is not limited in the embodiment of the present application.
  • the label information when the video to be processed needs to be colored, that is, when the video to be processed is a grayscale video, the label information may be the color information of the video frame, for example, the label information is the red, green, and blue ( Red-Green-Blue, RGB) value; when the object in the video to be processed needs to be tracked and positioned, the label information may be object segmentation information, for example, the label information is the pixel point coordinates of the pixel point corresponding to the target object in the video frame.
  • the label information may also adopt other expression forms, which are not limited in the embodiment of the present application.
  • Step 602 For the target video frame in the to-be-processed video, feature extraction is performed on the target video frame to obtain the target image feature of the target video frame.
  • the computer device first performs feature extraction on the target video frame, thereby Obtain the target image characteristics of the target video frame.
  • the computer device inputs the target video frame into a pre-trained feature extraction network to obtain the target image features output by the feature extraction network, where the feature extraction network may be obtained based on deep convolutional neural network training, for example, the feature extraction
  • the network can adopt a Visual Geometry Group (VGG) network structure, and the size of the output feature map (ie, target image feature) is 1/32 of the input video frame.
  • VCG Visual Geometry Group
  • the embodiment of the present application does not limit the specific method of extracting image features.
  • Step 603 According to the degree of image feature matching between the target video frame and the marked video frame, determine the guide video frame of the target video frame from the marked video frame, the marked video frame belongs to the video to be processed, and the guide video frame is used to guide the target video Frame information is annotated, the image feature matching degree is the matching degree between the target image feature and the corresponding image feature of the labeled video frame, and the image feature matching degree of the guiding video frame and the target video frame is higher than that of other labeled video frames and the target video The image feature matching degree of the frame.
  • the computer device caches the image features corresponding to each marked video frame (that is, realizes the memory function).
  • the target image characteristics and the corresponding image characteristics of each marked video frame are calculated.
  • the image feature matching degree between the target video frame and each marked video frame is obtained, and then the guiding video frame is determined according to the image feature matching degree (that is, the selection function is realized).
  • the 0th video frame in the video to be processed is the initial annotated video frame
  • the leading video frame of the 1st video frame is the 0th video frame
  • the second video frame is the leading video frame That is, it is determined from the 0th and 1st video frames, and so on, the leading video frame of the nth video frame is determined from the 0th to n-1th video frames.
  • this example only takes the determination of the guide video frame from the video frame before the target video frame as an example.
  • the video frame after the target video frame (with completed information labeling The guide video frame is determined in ), which is not limited in this embodiment.
  • Step 604 Generate target label information corresponding to the target video frame according to the label information corresponding to the guide video frame.
  • the computer device generates target label information corresponding to the target video frame according to the label information corresponding to the guide video frame.
  • the solution process of the target video frame x_t corresponding to the label information y_t can be expressed as:
  • P is constructed based on a convolutional neural network.
  • the target video frame in the video to be processed is annotated
  • the target video frame is extracted by feature extraction to obtain the target image characteristics of the target video frame
  • the target image characteristics of the target video frame are obtained according to the target video frame and the target video frame to be processed.
  • the image feature matching degree of the marked video frame in the video is determined, the guide video frame corresponding to the target video frame is determined from the marked video frame, and the target annotation information of the target video frame is generated based on the annotation information of the guide video frame; embodiments of the present application
  • the marked video frame with a high degree of image feature matching with the target video frame is selected as the guide video frame, instead of directly selecting the adjacent video frame as the guide video frame, which improves the guide video frame
  • the selection quality further improves the accuracy of the generated annotation information; and the propagation error of the annotation information will not accumulate, thereby improving the propagation quality of the annotation information.
  • a pre-built memory selection network (Memory Selection Network, MSN) is stored in the computer device.
  • MSN Memory Selection Network
  • the The computer device inputs the target image characteristics into the memory selection network, and the memory selection network selects a marked video frame from the marked video frames as the guiding video frame and outputs it.
  • the memory selection network adopts the structure of "memory pool + selection network", in which the memory pool stores the image characteristics of the marked video frames, and the selection network is used to select the image characteristics and target videos stored in the memory pool. For the target image characteristics of the frame, the guiding video frame is selected from the marked video frames.
  • the computer equipment also includes a temporal propagation network (Temporal Propagation Network, TPN), and the information labeling of the target video frame is performed by the temporal propagation network.
  • TPN Temporal Propagation Network
  • FIG. 7 shows a flowchart of a method for labeling information of a video frame provided by another exemplary embodiment of the present application.
  • the method is used in a computer device as an example for description, and the method includes the following steps.
  • Step 701 Obtain a video to be processed.
  • Step 702 For the target video frame in the to-be-processed video, feature extraction is performed on the target video frame to obtain the target image feature of the target video frame.
  • steps 702 to 701 For the implementation of steps 702 to 701, reference may be made to steps 601 to 602, and details are not described herein again in this embodiment.
  • the candidate image features corresponding to the marked video frames are sequentially stored in the memory pool, and accordingly, the computer device sequentially obtains the candidate image features from the memory pool. For example, when the target image frame is the t-th image frame, the image features of the 0th to t-1th image frames are sequentially stored in the memory pool, and the computer equipment obtains candidates from the memory pool in the order of 0 to t-1 Image characteristics.
  • the memory pool stores the image characteristics of the initial labeled video frame.
  • the computer device directly uses the initial labeled video frame as the guiding video frame in the initial stage of labeling.
  • Step 704 Input the candidate image feature and the target image feature into the selection network to obtain an image feature score output by the selection network.
  • the image feature score is used to indicate the degree of image feature matching between the candidate image feature and the target image feature.
  • the selection network is a lightweight convolutional neural network, which is used to output image feature scores between image features according to the input image features, where the higher the image feature score, the higher the image feature score.
  • the higher the matching degree the better the information dissemination effect when the labeled video frame corresponding to the candidate image feature is used as the guiding video frame, and the higher the accuracy of the information labeling.
  • the computer device obtains its corresponding image feature score through the above steps.
  • the image features in the memory pool will continue to increase (that is, the number of labeled video frames will continue to increase). If all candidate image features in the memory pool are traversed, the information of subsequent video frames The labeling efficiency will gradually decrease.
  • the computer device obtains some candidate image features in the memory pool. Accordingly, it is only necessary to output image feature scores corresponding to some candidate image features through the selection network.
  • the candidate image features corresponding to the marked video frame are obtained from the memory pool every predetermined number of frames .
  • the computer device obtains the candidate image features corresponding to the odd or even video frames in the memory pool (that is, obtains the candidate image features corresponding to the marked video frame every other frame, because the interval between adjacent video frames is short, and the corresponding image features are separated from each other. The difference is small), or, the computer device obtains the candidate image features corresponding to the marked video frame every two frames.
  • the computer device obtains the candidate image features corresponding to the odd-numbered video frames in the memory pool, and outputs the image feature score corresponding to the candidate image feature through the selection network.
  • the computer device may also obtain the candidate image features of the adjacent n frames of the target video frame (for example, adjacent 20 frames of annotated video frames) from the memory pool. This embodiment of the application There is no restriction on this.
  • step 708 can be included after step 703, and accordingly, step 704 can be replaced This is step 709.
  • Step 708 Obtain annotated object image characteristics of the annotated object in the initial annotated video frame, the initial annotated video frame is a video frame preset with annotated information in the to-be-processed video, and the annotated object is an object that contains annotated information in the initial annotated video frame.
  • the computer device determines the labeled object according to the object segmentation information of the initially labeled video frame (used to segment different objects in the initial labeled video frame), and then performs image feature extraction on the labeled object; and, when using a convolutional neural network-based
  • the feature extractor performs image feature extraction
  • the feature extractor that performs image feature extraction on the video frame shares the weight of the feature extractor that performs image feature extraction on the labeled object.
  • the computer device extracts the image feature f_a of the annotated object through the feature extractor 91 according to the initial annotation information y_0 corresponding to the initial annotated video frame x_0; in addition, the computer device uses the feature The extractor 91 performs image feature extraction on the video frames x_0 to x_t-1, and stores the extracted image features f_0 to f_t-1 in the memory pool 92.
  • the computer device obtains the image feature f_a of the labeling object, and obtains the candidate image feature f_p from the memory pool 92.
  • Step 709 Input the candidate image feature, the target image feature, and the annotated object image feature into the selection network to obtain the image feature score output by the selection network.
  • the computer device inputs the candidate image feature, the target image feature, and the annotation object image feature into a selection network, and the selection network outputs an image feature score based on the three.
  • the selection network includes two branches, a first selection branch and a second selection branch, wherein the first selection branch takes the result of the association operation of the two image features as input, and the second selection The branch takes the splicing of three image features as input, and the output of the first selection branch and the second selection branch are spliced and finally input to the fully connected layer of the selection network, and finally the fully connected layer outputs the image feature score.
  • this step may include the following steps.
  • the computer device Before inputting the image feature to the first selection branch, the computer device first performs an associating operation on any two images among the candidate image feature, the target image feature, and the annotation target image feature to obtain the associated image feature.
  • the computer device performs the associated operation operation to obtain the associated image features including: corr(f_p, f_a), corr (f_p, f_t), corr(f_t, f_a).
  • the associated image features are stitched together, and the stitched associated image features are input to the first selection branch to obtain the first feature vector output by the first selection branch.
  • the computer device splices the three associated image features obtained after the associating operation, thereby inputting the spliced associated image features into the first selection branch, and the first selection branch performs further feature extraction on the spliced associated image features. And finally output the first feature vector.
  • the first selection branch is based on a convolutional neural network, that is, the first selection branch outputs the first feature vector after performing convolution, pooling, and activation operations on the spliced associated image features.
  • the embodiment of the present application does not limit the specific structure of the first selection branch.
  • the computer device performs feature extraction on the spliced associated image features through the first selection branch to obtain the first feature vector 93.
  • the computer device splices candidate image features, target image features, and annotated object images, thereby inputting the splicing result into the second selection branch, which performs further feature extraction by the second selection branch, and finally outputs the second feature vector.
  • the second selection branch is based on a convolutional neural network, that is, the second selection branch performs convolution, pooling, and activation operations on the spliced image features, and then outputs the second feature vector.
  • the embodiment of the present application does not limit the specific structure of the second selection branch.
  • the computer device performs feature extraction on the spliced image features through the second selection branch to obtain the second feature vector 94.
  • the image feature score is determined according to the first feature vector and the second feature vector.
  • the computer device splices the first feature vector and the second feature vector, and inputs the spliced feature vector into the fully connected layer to obtain the image feature score corresponding to the candidate image feature.
  • the computer device splices the first feature vector 93 and the second feature vector 94, and inputs the spliced feature vector into the fully connected layer 95, and the fully connected layer 95 outputs the candidate image feature f_p The image feature score.
  • the computer device cyclically executes the above steps 1 to 4, so as to obtain an image feature score between the target image feature and each candidate image feature.
  • Step 705 Determine the labeled video frame corresponding to the highest image feature score as the guide video frame.
  • the computer device obtains the image feature score corresponding to each candidate image feature through the above steps, and further determines the highest image feature score among them, thereby determining the labeled video frame to which the candidate image feature corresponding to the highest image feature score belongs as Guide the video frame.
  • the computer device determines the video frame x_k as the guide video frame of the target video frame x_t according to the image feature score.
  • Step 706 Store the target image feature of the target video frame in the memory pool.
  • the computer device after determining the leading video frame of the target video frame, stores the target image characteristics of the target video frame in the memory pool, so that when information is identified for subsequent video frames, the target image can be used Features as a reference.
  • Step 707 Input the guide video frame, the annotation information corresponding to the guide video frame, and the target video frame into the timing propagation network to obtain target annotation information output by the timing propagation network.
  • the computer device uses a pre-trained timing propagation network to propagate the label information of the guide video frame to the target video frame, and complete the information label of the target video frame.
  • the timing propagation network includes an appearance branch and a motion branch.
  • the image branch takes the target video frame and the identification information of the guide video frame as input, and is used to output image information.
  • the image information feature is used to characterize the expected annotation information of the pixel in the target video frame;
  • the momentum branch takes the annotation information of the guide video frame and the optical flow of the video frame between the guide video frame and the target video frame as input, and is used for output Momentum feature (indicating the movement of objects in the video frame).
  • this step may include the following steps.
  • the image branch in the embodiment of the present application is initialized with a pre-trained VGG16 network.
  • the computer device guides the annotation information y_g of the video frame x_g and the target video frame x_t into the image branch to obtain the image information feature f_app output by the image branch.
  • the guide video frame of the target video frame x_t is x_k
  • the computer device inputs the label information y_k of the guide video frame x_k and the target video frame x_t into the image branch 96 to obtain the image information output by the image branch 96 Features 97.
  • the optical flow of the video frame is a dense optical flow between video frames, that is, it is used to indicate the motion of the object corresponding to the pixel point of the same coordinate in the guide video frame and the target video frame.
  • the computer device determines the video frame optical flow W(x_t, x_g) between the guide video frame x_g and the target video frame x_t, thereby converting the video frame optical flow W(x_t, x_g) ) And guide the input momentum branch of the label information y_g of the video frame x_g.
  • the guide video frame of the target video frame x_t is x_k
  • the computer device determines the optical flow W(x_t, x_k) between the guide video frame x_k and the target video frame x_t.
  • the optical flow of the video frame between the guide video frame and the target video frame is calculated by the pre-trained flownet2.0, and the momentum branch is initialized with the pre-trained VGG16 network.
  • steps one and two there is no strict sequence between steps one and two, that is, steps one and two can be executed at the same time, which is not limited in this embodiment.
  • the computer device fuses the image information feature and the momentum feature information feature, and performs convolution processing on the fused feature through the convolution layer, and finally obtains the target annotation information of the target video frame.
  • the computer device fuses the image information feature 97 and the momentum feature 99, it finally outputs the target annotation information y_t of the target video frame x_t through a convolutional layer (not shown in the figure).
  • the computer device inputs candidate image features, target image features, and annotated object image features into the selection network, and the two selection branches of the selection network perform feature extraction, thereby enriching the feature extraction dimension of image features and improving The accuracy of the image feature score obtained by subsequent calculations is improved.
  • the computer device uses the image branch and the momentum branch of the time-series propagation network to perform feature extraction respectively, and merge the features extracted from the two branches to finally obtain the target annotation information of the target video frame, which is helpful to improve The accuracy of information labeling.
  • the 34th frame when labeling the 35th frame, the 34th frame is used as the guiding frame, and the labeling accuracy rate is 0.44;
  • the 54th frame is used as the guiding frame, and the labeling accuracy is 0.28;
  • the 125th frame is labeled, the 124th frame is used as the guiding frame, and the labeling accuracy is 0.22;
  • the 155th frame is labeled, The 154th frame is the guide frame, and the labeling accuracy is 0.23.
  • the 34th frame is used as the guiding frame, and the labeling accuracy rate is 0.58; when the 55th frame is labeled, the 37th frame is used as the guiding frame, The labeling accuracy rate is 0.80; when labeling the 125th frame, the 102nd frame is used as the guiding frame, and the labeling accuracy rate is 0.92; when the 155th frame is labelled, the 127th frame is used as the guiding frame, and the labeling accuracy is 0.86.
  • the accuracy of information labeling is getting lower and lower; while the method provided by the embodiments of this application does not use chain-type information dissemination, the information labeling Accuracy will not be affected by the depth of information dissemination. Moreover, compared with related technologies, the method provided in the embodiments of the present application can significantly improve the accuracy of the labeling information.
  • the network training process includes the following steps:
  • Step 1201 Train a time-series propagation network according to the sample video, and the sample video frame in the sample video contains annotation information.
  • the computer device first uses the sample video containing the annotation information to train the time series propagation network, and then further trains the memory selection network based on the sample video and the completed time series propagation network.
  • the computer device randomly selects two frames from the sample video as the guide video frame and the target video frame to train the timing propagation network.
  • the computer equipment uses the Intersection over Union (IOU) loss function to train the timing propagation network; when the timing propagation network is used to implement video coloring, the computer equipment The L1 regression loss function is used to train the time series propagation network.
  • IOU Intersection over Union
  • Step 1202 For the target sample video frame in the sample video, the target sample video frame and other sample video frames in the sample video are input to the timing propagation network to obtain prediction sample annotation information output by the timing propagation network.
  • the computer equipment After completing the training of the time series propagation network, the computer equipment further uses the time series propagation network obtained by the training to generate training samples, thereby using the training samples to train the memory selection network.
  • the computer device traverses the video frame before the target sample video frame as the sample guiding video frame x_p (0 ⁇ p ⁇ t-1), and sets the target
  • the sample video frame x_t and the sample guiding video frame x_p are input to the time-series propagation network, and the predicted sample label information y_tp output by the time-series propagation network is obtained.
  • Step 1203 Determine the sample guiding video frame in the sample video frame according to the predicted sample labeling information and the sample labeling information corresponding to the target sample video frame.
  • the computer device determines the guiding quality of the sample guiding video frame by comparing the predicted sample labeling information with the sample labeling information corresponding to the target sample video frame, and then classifies the positive and negative samples of the sample guiding video frame.
  • this step may include the following steps.
  • the computer device calculates the accuracy of the information between the predicted sample labeling information and the sample labeling information, where the higher the accuracy of the information, the closer the predicted sample labeling information is to the sample labeling information, correspondingly, Using the predicted sample to label the information corresponding to the sample guides the higher the quality of the video frame.
  • the computer device calculates the information accuracy s_tp of the predicted sample based on the predicted sample labeling information y_tp and the labeling information y_t of the target sample video frame x_t.
  • the accuracy of the first information corresponding to the positive sample guiding video frame is higher than the accuracy of the second information corresponding to the negative sample guiding video frame, and the first information accuracy is when the target sample video frame is labeled according to the positive sample guiding video frame.
  • the second information accuracy is the information accuracy when the target sample video frame is labeled based on the negative sample guide video frame.
  • the computer device determines the sample guiding video frame as a positive sample guiding video frame (that is, suitable as a guiding video frame); if the information accuracy is less than the first accuracy threshold, Second, the accuracy threshold, the computer device determines the sample guiding video frame as a negative sample guiding video frame (that is, it is not suitable as a guiding video frame).
  • the first accuracy threshold is greater than or equal to the second accuracy threshold, for example, the first accuracy threshold is 0.8, and the second accuracy threshold is 0.4.
  • Step 1204 Train the memory selection network according to the target sample video frame and the sample guide video frame.
  • the computer device inputs the target sample video frame and the sample guiding video frame into the memory selection network to obtain the prediction result output by the memory selection network, and the memory selection is performed according to the prediction result and the positive and negative attributes of the sample guiding video frame. Select the network for training.
  • the computer device may use a back propagation algorithm or a gradient descent algorithm to train the memory selection network, which is not limited in the embodiment of the present application.
  • the computing technology Hebei first trains the timing propagation network based on the sample video, and then divides the sample video frames in the sample video into positive and negative samples based on the trained timing propagation network, and then uses the divided positive and negative samples to select the memory
  • the network is trained without the need for users to manually label the positive and negative attributes of the training samples in advance, which reduces the difficulty of obtaining training samples and helps improve the accuracy of training sample division, thereby improving the quality of the guide frame selection of the memory selection network.
  • Fig. 13 is a structural block diagram of a device for labeling information of video frames provided by an exemplary embodiment of the present application. As shown in Fig. 13, the device includes:
  • the obtaining module 1301 is used to obtain the to-be-processed video
  • the feature extraction module 1302 is configured to perform feature extraction on the target video frame of the target video frame in the to-be-processed video to obtain the target image feature of the target video frame;
  • the guide frame determination module 1303 is configured to determine the guide video frame of the target video frame from the marked video frame according to the degree of image feature matching between the target video frame and the marked video frame, and the marked video frame Belonging to the to-be-processed video, the guide video frame is used to guide the target video frame to perform information labeling, and the image feature matching degree is the match between the target image feature and the corresponding image feature of the labeled video frame Degree, and the image feature matching degree of the guide video frame and the target video frame is higher than the image feature matching degree of other marked video frames and the target video frame;
  • the generating module 1304 is configured to generate target label information corresponding to the target video frame according to the label information corresponding to the guide video frame.
  • the guiding frame determining module 1303 includes:
  • the first acquiring unit is configured to acquire candidate image features from a memory pool of a memory selection network, the memory selection network including the memory pool and a selection network, and the memory pool stores the image features of the marked video frame ;
  • the feature scoring unit is used to input the candidate image feature and the target image feature into the selection network to obtain the image feature score output by the selection network, and the image feature score is used to indicate the candidate image feature and the target image feature.
  • a determining unit configured to determine the marked video frame corresponding to the highest image feature score as the guide video frame
  • the device also includes:
  • the storage module is configured to store the target image characteristics of the target video frame in the memory pool.
  • the guide frame determining module 1303 further includes:
  • the second acquiring unit is configured to acquire the image features of the annotated object of the annotated object in the initial annotated video frame, where the initial annotated video frame is a video frame preset with annotation information in the to-be-processed video, and the annotated object is all The object that contains the annotation information in the initial annotated video frame;
  • the feature scoring unit is also used for:
  • the candidate image feature, the target image feature, and the annotation target image feature are input into the selection network to obtain the image feature score output by the selection network.
  • the selection network includes a first selection branch and a second selection branch;
  • the feature scoring unit is also used for:
  • the image feature score is determined according to the first feature vector and the second feature vector.
  • the first acquiring unit is configured to:
  • the candidate image feature corresponding to the marked video frame is obtained from the memory pool every predetermined number of frames, or all the candidate image features corresponding to the marked video frame are obtained from the memory pool.
  • the generating module 1304 is configured to:
  • the guide video frame, the annotation information corresponding to the guide video frame, and the target video frame are input into a timing propagation network to obtain the target annotation information output by the timing propagation network.
  • the timing propagation network includes image branches and momentum branches
  • the generating module 1304 includes:
  • the first output unit is configured to input the annotation information corresponding to the guide video frame and the target video frame into the image branch to obtain the image information characteristics output by the image branch;
  • the second output unit is configured to determine the optical flow of the video frame between the guide video frame and the target video frame; input the annotation information corresponding to the optical flow of the video frame and the guide video frame into the momentum branch, Obtain the momentum characteristics output by the momentum branch;
  • the determining unit is configured to determine the target label information according to the image information feature and the momentum feature.
  • the device further includes:
  • the first training module is configured to train the time-series propagation network according to sample videos, and sample video frames in the sample videos contain annotation information;
  • the annotation information prediction module is configured to input the target sample video frame and other sample video frames in the sample video into the time-series propagation network for the target sample video frame in the sample video to obtain the time-series propagation network
  • the output prediction sample label information
  • a sample determination module configured to determine the sample guide video frame in the sample video frame according to the predicted sample labeling information and the sample labeling information corresponding to the target sample video frame;
  • the second training module is configured to train the memory selection network according to the target sample video frame and the sample guide video frame.
  • the sample determination module includes:
  • a calculation unit configured to calculate the accuracy of information between the predicted sample labeling information and the sample labeling information
  • a determining unit configured to determine a positive sample guiding video frame and a negative sample guiding video frame in the sample video frame according to the accuracy of the information
  • the accuracy of the first information corresponding to the positive sample guiding video frame is higher than the accuracy of the second information corresponding to the negative sample guiding video frame, and the first information accuracy is based on the positive sample guiding video frame pair
  • the information accuracy of the target sample video frame when information is annotated, and the second information accuracy is the information accuracy when the target sample video frame is information annotated according to the negative sample guide video frame.
  • the target video frame in the video to be processed is annotated
  • the target video frame is extracted by feature extraction to obtain the target image characteristics of the target video frame
  • the target image characteristics of the target video frame are obtained according to the target video frame and the target video frame to be processed.
  • the image feature matching degree of the marked video frame in the video is determined, the guide video frame corresponding to the target video frame is determined from the marked video frame, and the target annotation information of the target video frame is generated based on the annotation information of the guide video frame; embodiments of the present application
  • the marked video frame with a high degree of image feature matching with the target video frame is selected as the guide video frame, instead of directly selecting the adjacent video frame as the guide video frame, which improves the guide video frame
  • the selection quality further improves the accuracy of the generated annotation information; and the propagation error of the annotation information will not accumulate, thereby improving the propagation quality of the annotation information.
  • the information labeling device for video frames provided in the above embodiment only uses the division of the above functional modules as an example.
  • the above functions can be allocated by different functional modules as required, that is, the device The internal structure is divided into different functional modules to complete all or part of the functions described above.
  • the device for labeling information of video frames provided by the above-mentioned embodiments belongs to the same concept as the embodiment of the method for generating information of video frames. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
  • the computer device 1400 includes a central processing unit (CPU) 1401, a system memory 1404 including a random access memory (RAM) 1402 and a read-only memory (ROM) 1403, and a connection between the system memory 1404 and the central processing unit 1401
  • the computer device 1400 also includes a basic input/output system (I/O system) 1406 that helps to transfer information between various devices in the computer, and a large-capacity storage system 1413, application programs 1414, and other program modules 1415.
  • the basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409 such as a mouse and a keyboard for the user to input information.
  • the display 1408 and the input device 1409 are both connected to the central processing unit 1401 through the input and output controller 1410 connected to the system bus 1405.
  • the basic input/output system 1406 may also include an input and output controller 1410 for receiving and processing input from multiple other devices such as a keyboard, a mouse, or an electronic stylus.
  • the input and output controller 1410 also provides output to a display screen, a printer, or other types of output devices.
  • the mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405.
  • the mass storage device 1407 and its associated computer-readable medium provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROI drive.
  • the computer-readable media may include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storing information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory or other solid-state storage technologies, CD-ROM, DVD or other optical storage, tape cartridges, magnetic tape, disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read-only memory
  • EPROM Erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • the memory stores one or more programs, one or more programs are configured to be executed by one or more central processing units 1401, one or more programs contain instructions for implementing the above methods, and the central processing unit 1401 executes the one or more programs. Multiple programs implement the methods provided in the foregoing method embodiments.
  • the computer device 1400 may also be connected to a remote computer on the network through a network such as the Internet to run. That is, the computer device 1400 can be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or in other words, the network interface unit 1411 can also be used to connect to other types of networks or remote computer systems (not shown) ).
  • the memory further includes one or more programs, the one or more programs are stored in the memory, and the one or more programs include steps for performing the steps executed by the computer device in the method provided in the embodiments of the present application .
  • the embodiment of the present application also provides a computer-readable storage medium, the readable storage medium stores at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least one program, the The code set or instruction set is loaded and executed by the processor to implement the method for labeling information of the video frame described in any of the foregoing embodiments.
  • the present application also provides a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the video frame information labeling method provided in the foregoing embodiment. .
  • the program can be stored in a computer-readable storage medium.
  • the medium may be a computer-readable storage medium included in the memory in the foregoing embodiment; or may be a computer-readable storage medium that exists alone and is not assembled into the terminal.
  • the computer-readable storage medium stores at least one instruction, at least one program, code set or instruction set, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by the processor In order to realize the information labeling method of the video frame described in any of the foregoing method embodiments.
  • the computer-readable storage medium may include: read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), solid state drive (SSD, Solid State Drives), optical disks, and the like.
  • random access memory may include resistive random access memory (ReRAM, Resistance Random Access Memory) and dynamic random access memory (DRAM, Dynamic Random Access Memory).
  • ReRAM resistive random access memory
  • DRAM Dynamic Random Access Memory
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种视频帧的信息标注方法、装置、设备及存储介质。该方法包括:获取待处理视频(601);对于待处理视频中的目标视频帧,对目标视频帧进行特征提取,得到目标视频帧的目标图像特征(602);根据目标视频帧与已标注视频帧的图像特征匹配度,从已标注视频帧中确定目标视频帧的引导视频帧,引导视频帧用于引导目标视频帧进行信息标注,图像特征匹配度为目标图像特征与已标注视频帧对应图像特征之间的匹配度(603);根据引导视频帧对应的标注信息,生成目标视频帧对应的目标标注信息(604)。通过提高引导视频帧的选取质量,提高了生成的标注信息的准确性;且标注信息的传播误差不会累积,提高了标注信息的传播质量。

Description

视频帧的信息标注方法、装置、设备及存储介质
本申请实施例要求于2019年08月29日提交,申请号为201910807774.8、发明名称为“视频帧的信息标注方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请实施例中。
技术领域
本申请实施例涉及人工智能领域,特别涉及一种视频帧的信息标注方法、装置、设备及存储介质。
背景技术
视频标注信息传播是视频处理领域的一项重要技术,常被用于执行视频物体追踪以及灰度视频上色等任务。
相关技术中,通常采用基于深度学习的方法,基于卷积神经网络对视频帧之间的像素关系进行建模,使视频帧之间的标注信息通过像素之间的关系进行传播。而采用上述方法时,通常使用卷积神经网络对相邻视频帧进行建模,相应的,使用构建得到的模型进行标注信息传播时,即将当前视频帧的上一帧视频帧确定为引导视频帧,从而通过模型将引导视频帧的标注信息传递给当前视频帧。
然而,采用上述方法将相邻视频帧作为引导视频帧时,若某一视频帧因物体遮挡、快速运动等原因造成标注信息缺失,将直接影响到后续所有视频帧的信息传播,且标注信息的传播误差将不断累积,导致标注信息的传播效果较差。
发明内容
本申请实施例提供了一种视频帧的信息标注方法、装置、设备及存储介质,可以提高对视频帧进行信息标注时生成的标注信息的准确性。所述技术方案如下:
一方面,本申请实施例提供了一种视频帧的信息标注方法,所述方法应用于计算机设备,所述方法包括:
获取待处理视频;
对于所述待处理视频中的目标视频帧,对所述目标视频帧进行特征提取,得到所述目标视频帧的目标图像特征;
根据所述目标视频帧与已标注视频帧的图像特征匹配度,从所述已标注视频帧中确定所述目标视频帧的引导视频帧,所述已标注视频帧属于所述待处理视频,所述引导视频帧用于引导所述目标视频帧进行信息标注,所述图像特征匹配度为所述目标图像特征与所述已标注视频帧对应图像特征之间的匹配度,且所述引导视频帧与所述目标视频帧的图像特征匹配度高于其它已标注视频帧与所述目标视频帧的图像特征匹配度;
根据所述引导视频帧对应的标注信息,生成所述目标视频帧对应的目标标注信息。
另一方面,本申请实施例提供了一种视频帧的信息标注装置,所述装置包括:
获取模块,用于获取待处理视频;
特征提取模块,用于对于所述待处理视频中的目标视频帧,对所述目标视频帧进行特征提取,得到所述目标视频帧的目标图像特征;
引导帧确定模块,用于根据所述目标视频帧与已标注视频帧的图像特征匹配度,从所述已标注视频帧中确定所述目标视频帧的引导视频帧,所述已标注视频帧属于所述待处理视频,所述引导视频帧用于引导所述目标视频帧进行信息标注,所述图像特征匹配度为所述目标图像特征与所述已标注视频帧对应图像特征之间的匹配度,且所述引导视频帧与所述目标视频帧的图像特征匹配度高于其它已标注视频帧与所述目标视频帧的图像特征匹配度;
生成模块,用于根据所述引导视频帧对应的标注信息,生成所述目标视频帧对应的目标标注信息。
另一方面,本申请实施例提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上述方面所述的视频帧的信息标注方法。
另一方面,提供了一种计算机可读存储介质,所述可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上述方面所述的视频帧的信息标注方法。
另一方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述方面提供的视频帧的信息标注方法。。
本申请实施例提供的技术方案带来的有益效果至少包括:
对待处理视频中的目标视频帧进行信息标注时,通过对目标视频帧进行特征提取,得到目标视频帧的目标图像特征,并根据目标视频帧与待处理视频中已标注视频帧的图像特征匹配度,从已标注视频帧中确定出目标视频帧对应的引导视频帧,从而基于引导视频帧的标注信息生成目标视频帧的目标标注信息;本申请实施例中,基于目标视频帧的图像特征,选取与目标视频帧具有高图像特征匹配度的已标注视频帧作为引导视频帧,而非直接选取相邻视频帧作为引导视频帧,提高了引导视频帧的选取质量,进而提高了生成的标注信息的准确性;并且,标注信息的传播误差不会累积,进而提高了标注信息的传播质量。
附图说明
图1是采用相关技术以及本申请实施例提供的方法对视频中的物体进行标注的实施示意图;
图2是本申请实施例提供的视频帧的信息标注方法的原理示意图;
图3是对视频中对象进行自动跟踪定位过程的界面示意图;
图4是对灰度视频进行上色过程的界面示意图;
图5示出了本申请一个示例性实施例提供的实施环境的示意图;
图6示出了本申请一个示例性实施例提供的视频帧的信息标注方法的流程图;
图7示出了本申请另一个示例性实施例提供的视频帧的信息标注方法的流程图;
图8示出了本申请另一个示例性实施例提供的视频帧的信息标注方法的流程图;
图9是图8所示视频帧的信息标注方法的实施示意图;
图10是通过选择网络的第一选择分支进行特征提取的实施示意图;
图11是相关技术与本申请实施例中标注信息准确率的对比图;
图12是一个示例性实施例提供的网络训练过程的流程图;
图13是本申请一个示例性实施例提供的视频帧的信息标注装置的结构框图;
图14示出了本申请一个示例性实施例提供的计算机设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
计算机视觉技术(Computer Vision,CV)是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、光学字符识别(Optical Character Recognition,OCR)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。本申请实施例提供的方法涉及计算机视觉技术在视频信息标注领域的应用。
对于视频帧序列x_t(t=0,1,…,T),当预先为其中某一视频帧设置标注信息时,根据该标注信息为视频帧序列中其它视频帧设置标注信息的过程即为视频标注信息传播。比如,预先为视频帧序列中的第0帧视频帧x_0设置标注信息y_0,根据标注信息y_0依次求解第1帧视频帧的标注信息y_1,第2帧视频帧的标注信息y_2直至第T帧视频帧的标注信息y_T,这个过程即为视频标注信息传播。
相关技术中,使用相邻视频帧作为当前视频帧的引导视频帧,并利用引导视频帧中的标注信息为当前视频帧生成标注信息。比如,当视频帧序列x_t中的第0帧视频帧包含标注信息y_0时,第1帧视频帧以第0帧视频帧为引导视频帧,从而根据标注信息y_0生成第1帧视频帧的标注信息y_1;第2帧视频 帧以第1帧视频帧为引导视频帧,从而根据标注信息y_1生成第2帧视频帧的标注信息y_2,以此类推,直至为各个视频帧均设置标注信息。
然而,采用上述方式时,标注信息在传播过程中,传播误差会不断累积,尤其是在某些视频帧中待标注的物体被遮挡或暂时离开时,可能会导致无法为该视频帧之后所有的视频帧设置正确的标注信息,最终影响标注信息的传播效果。
示意性的,如图1中的(a)所示,采用人工标注的方式,为视频帧序列中的第0帧设置标注信息,该标注信息用于标记视频帧中的对象11(图1中白色线条围合的区域)。若以相邻视频帧作为引导视频帧进行标注信息传播,由于第50帧视频帧中对象11脱离视频帧画面,因此从第50帧视频帧开始,均无法为视频帧设置正确的标注信息。然而,实际上从第75帧开始,对象11逐渐进入视频帧画面。
为了提高视频中标注信息的准确性,如图2所示,本申请实施例中,当需要对待处理视频21中的目标视频帧t进行信息标注时,并非直接将目标视频帧t的相邻视频帧(即视频帧t-1)作为引导视频帧,而是首先对目标视频帧t进行特征提取,得到目标视频帧t的目标图像特征22。基于缓存的已标注视频帧的图像特征23以及目标图像特征22,计算机设备从已标注视频帧中,选取与目标视频帧t具有高图像特征匹配度的视频帧作为引导视频帧24。进一步的,计算机设备根据引导视频帧24、引导视频帧的标注信息25以及目标视频帧t,为目标视频帧t生成目标标注信息26,完成对目标视频帧t的信息标注。
示意性的,如图1中的(b)所示,采用本申请实施例提供的方法,在为第75帧视频帧生成标记信息时,计算机设备并未将第74帧视频帧确定为引导视频帧,而是基于图像特征的匹配度将第35帧视频帧确定为引导视频帧,进而在基于该引导视频帧的标注信息,在第75帧视频帧中标记出对象11;类似的,在为第100帧视频帧生成标记信息时,计算机设备将第98帧视频帧确定为引导视频帧,并最终在第100帧视频帧中标记出对象11。可见,采用本申请实施例提供的方法,即便中间视频帧中待标注的物体被遮挡或暂时离开,计算机设备也能够对后续视频帧进行准确标注。并且,由于标注信息并非在视频帧之间链式传输,因此能够避免传播过程中产生的传播误差,进而提高了视频帧的标注准确性。
下面对本申请实施例提供的视频帧的信息标注方法的应用场景进行示意性说明。
1、视频对象的自动跟踪定位
该应用场景下,本申请实施例提供的视频帧的信息标注方法可以应用于室内监控应用程序、道路监控应用程序、停车场监控应用程序等具有视频对象自动跟踪定位功能的应用程序。进行对象自动跟踪定位时,用户首先将视频导入应用程序,然后在视频的某一视频帧中标记出需要自动跟踪定位的对象,由应用程序根据初始标注信息,为视频中的其它视频帧生成标注信息,并进一步根据标注信息在各帧视频帧中标记显示出自动跟踪定位的对象。
示意性的,如图3所示,将视频导入应用程序后,应用界面显示视频中的第一视频帧,并提示用户通过框选的方式标记出需要跟踪的对象。用户使用线框31框选出需要跟踪的对象“狗”,并点击开始跟踪控件后,应用程序即根据第一视频帧及其标注信息,按序为视频中的各个视频帧生成标注信息,并根据生成的标注信息,使用线框31对视频帧中的狗进行实时框选显示。
2、灰度(黑白)视频的自动上色
该应用场景下,本申请实施例提供的视频帧的信息标注方法可以应用于具有视频上色功能的应用程序,比如视频编辑应用程序。进行视频上色时,用户首先对灰度视频中的某一图像帧进行上色,然后将包含初始色彩信息的灰度视频输入应用程序,由应用程序根据初始色彩标注信息,为视频中的其它视频帧生成色彩标注信息,并进一步根据生成的色彩标注信息对各帧视频帧进行上色,最终输出彩色视频。
示意性的,如图4所示,用户首先对灰度视频中的第一视频帧进行上色(分别对人41和狗42进行上色),然后将上色后的灰度视频输入应用程序,由应用程序根据第一视频帧的色彩标注信息,按序为视频中的各个视频帧生成色彩标注信息,并根据色彩标注信息对各帧视频帧中的人41和狗42进行上色,最终输出彩色视频。
当然,除了应用于上述场景外,本申请实施例提供方法还可以应用于其他需要对视频中的标注信息进行传播的场景,本申请实施例并不对具体的应用场景进行限定。
本申请实施例提供的视频帧的信息标注方法可以应用于终端或者服务器 等计算机设备中。在一种可能的实施方式中,本申请实施例提供的视频帧的信息标注方法可以实现成为应用程序或应用程序的一部分,并被安装到终端中,使终端具备自动为视频中的视频帧设置标注信息的功能;或者,可以应用于应用程序的后台服务器中,从而由服务器为终端中的应用程序提供视频帧的信息标注功能。
请参考图5,其示出了本申请一个示例性实施例提供的实施环境的示意图。该实施环境中包括终端510和服务器520,其中,终端510与服务器520之间通过通信网络进行数据通信,可选地,通信网络可以是有线网络也可以是无线网络,且该通信网络可以是局域网、城域网以及广域网中的至少一种。
终端510中安装有具有视频帧信息标注需求的应用程序。该应用程序可以是监控类应用程序、视频上色类应用程序等等,本申请实施例对此不作限定。可选的,终端510可以是手机、平板电脑、膝上便携式笔记本电脑、视障人士辅助设备等移动终端,也可以是台式电脑、投影式电脑等终端,本申请实施例对此不做限定。
服务器520可以实现为一台服务器,也可以实现为一组服务器构成的服务器集群,其可以是物理服务器,也可以实现为云服务器。在一种可能的实施方式中,服务器520是终端510中应用程序的后台服务器。
如图5所示,本申请实施例中,服务器520中设置有预先训练的记忆选择网络521和时序传播网络522,其中,记忆选择网络521用于从已标注的视频帧中选取待标注视频帧的引导视频帧,而时序传播网络522则用于根据记忆选择网络521选取的引导视频帧为待标注视频帧生成标注信息。
在一种可能的应用场景下,服务器520通过记忆选择网络521和时序传播网络522为待处理视频帧中各个视频帧生成标注信息后,将标注信息反馈给终端510,由终端510根据标注信息对视频进行处理,从而对处理后的视频进行显示。其中,当标注信息为物体分割信息时,终端510即根据物体分割信息对各个视频帧中的目标物体进行框选显示;当标注信息为色彩信息时,终端510即根据色彩信息对视频帧中的各个对象进行上色。
在其他可能的实施方式中,上述记忆选择网络521和时序传播网络522也可以实现成为应用程序的部分或全部,相应的,终端510可以在本地为视频帧进行信息标注,而无需借助服务器520,本实施例对此不作限定。
为了方便表述,下述各个实施例以视频帧的信息标注方法由计算机设备执 行为例进行说明。
请参考图6,其示出了本申请一个示例性实施例提供的视频帧的信息标注方法的流程图。本实施例以该方法用于计算机设备为例进行说明,该方法包括如下步骤。
步骤601,获取待处理视频。
其中,该待处理视频可以是实时流媒体视频,拍摄的视频或者下载的视频,本申请实施例对此不作限定。
在一种可能的实施方式中,该待处理视频中包含初始标注视频帧,该初始标注视频帧为预设有标注信息的视频帧。其中,初始标注视频帧的标注信息可以由用户手动设置,且初始标注视频帧的数量为至少一帧。
在一个示意性的例子中,待处理视频的第0帧视频帧为初始标注视频帧。当然,在其他可能的实现方式中,初始标注视频帧也可以是非第0帧(即非首帧),而是待处理视频中的任意一帧(比如图像内容最丰富的一帧,或者,包含所有待标注对象的一帧),本申请实施例对此不作限定。
可选的,当需要对待处理视频进行上色,即待处理视频为灰度视频时,该标注信息可以是视频帧的色彩信息,比如,标注信息为视频帧中各个像素点的红绿蓝(Red-Green-Blue,RGB)值;当需要对待处理视频中的物体进行跟踪定位时,该标注信息可以是物体分割信息,比如,标注信息为视频帧中目标物体对应像素点的像素点坐标。除了上述表现形式的标注信息外,该标注信息还可以采用其他表现形式,本申请实施例对此不作限定。
步骤602,对于待处理视频中的目标视频帧,对目标视频帧进行特征提取,得到目标视频帧的目标图像特征。
在一种可能的实施方式中,计算机设备按序对待处理视频中的各个视频帧生成标注信息,目标视频帧即计算机设备当前处理的视频帧。比如,待处理视频中初始标注视频为第0帧视频帧,计算机设备即从第1帧视频帧开始,以此对各帧视频帧生成标注信息。
不同于相关技术中直接将目标视频帧的相邻视频帧(比如目标视频帧的前一帧视频帧)作为引导视频帧,本申请实施例中,计算机设备首先对目标视频帧进行特征提取,从而得到目标视频帧的目标图像特征。
可选的,计算机设备将目标视频帧输入预训练的特征提取网络,得到特征 提取网络输出的目标图像特征,其中,该特征提取网络可以是基于深度卷积神经网络训练得到,比如,该特征提取网络可以采用视觉几何组(Visual Geometry Group,VGG)网络结构,且输出的特征图(即目标图像特征)的尺寸为输入视频帧的1/32。本申请实施例并不对提取图像特征的具体方式进行限定。
步骤603,根据目标视频帧与已标注视频帧的图像特征匹配度,从已标注视频帧中确定目标视频帧的引导视频帧,已标注视频帧属于待处理视频,引导视频帧用于引导目标视频帧进行信息标注,图像特征匹配度为目标图像特征与已标注视频帧对应图像特征之间的匹配度,且引导视频帧与目标视频帧的图像特征匹配度高于其它已标注视频帧与目标视频帧的图像特征匹配度。
在一种可能的实施方式中,计算机设备中缓存有各个已标注视频帧对应的图像特征(即实现记忆功能),选取引导视频帧时,即计算目标图像特征与各个已标注视频帧对应图像特征之间的匹配度,得到目标视频帧与各个已标注视频帧之间的图像特征匹配度,进而根据图像特征匹配度确定引导视频帧(即实现选择功能)。
在一个示意性的例子中,待处理视频中的第0帧视频帧为初始标注视频帧,第1帧视频帧的引导视频帧即为第0帧视频帧,第2帧视频帧的引导视频帧即从第0、1视频帧中确定得到,以此类推,第n帧视频帧的引导视频帧即从第0至n-1帧视频帧中确定得到。
需要说明的是,本示例仅以从目标视频帧之前的视频帧中确定引导视频帧为例进行说明,在其他可能的实现方式中,也可以从目标视频帧之后的视频帧(已完成信息标注)中确定引导视频帧,本实施例对此不作限定。
步骤604,根据引导视频帧对应的标注信息,生成目标视频帧对应的目标标注信息。
进一步,计算机设备根据引导视频帧对应的标注信息,生成目标视频帧对应的目标标注信息。
可选的,对于目标视频帧x_t,若其对应的引导视频帧为x_g,且引导视频帧x_g对应标注信息y_g,则目标视频帧x_t对应标注信息y_t的求解过程可以被表示为:
y_t=P(x_t,x_g,y_g)
其中,P基于卷积神经网络构建得到。
综上所述,本申请实施例中,对待处理视频中的目标视频帧进行信息标注 时,通过对目标视频帧进行特征提取,得到目标视频帧的目标图像特征,并根据目标视频帧与待处理视频中已标注视频帧的图像特征匹配度,从已标注视频帧中确定出目标视频帧对应的引导视频帧,从而基于引导视频帧的标注信息生成目标视频帧的目标标注信息;本申请实施例中,基于目标视频帧的图像特征,选取与目标视频帧具有高图像特征匹配度的已标注视频帧作为引导视频帧,而非直接选取相邻视频帧作为引导视频帧,提高了引导视频帧的选取质量,进而提高了生成的标注信息的准确性;并且,标注信息的传播误差不会累积,进而提高了标注信息的传播质量。
在一种可能的实施方式中,本申请实施例中,计算机设备中存储有预先构建的记忆选择网络(Memory Selection Network,MSN),相应的,在确定目标视频帧的引导视频帧时,对于提取到的目标图像特征,计算机设备将目标图像特征输入记忆选择网络,由记忆选择网络从已标注视频帧中,选取一帧已标注视频帧作为引导视频帧并输出。
可选的,记忆选择网络采用“记忆池+选择网络”的结构,其中,记忆池中存储有已标注视频帧的图像特征,而选择网络则用于根据记忆池中存储的图像特征以及目标视频帧的目标图像特征,从已标注视频帧中选取引导视频帧。并且,计算机设备中还包括时序传播网络(Temporal Propagation Network,TPN),目标视频帧的信息标注由该时序传播网络执行。下面结合上述两个网络对视频帧的信息标注过程进行说明。
请参考图7,其示出了本申请另一个示例性实施例提供的视频帧的信息标注方法的流程图。本实施例以该方法用于计算机设备为例进行说明,该方法包括如下步骤。
步骤701,获取待处理视频。
步骤702,对于待处理视频中的目标视频帧,对目标视频帧进行特征提取,得到目标视频帧的目标图像特征。
步骤702至701的实施方式可以参考步骤601至602,本实施例在此不再赘述。
步骤703,从记忆池中获取候选图像特征。
在一种可能的实施方式中,记忆池中顺序存储已标注视频帧对应的候选图像特征,相应的,计算机设备按序从记忆池中获取候选图像特征。比如,当目 标图像帧为第t帧图像帧时,记忆池中顺序存储有第0至第t-1帧图像帧的图像特征,计算机设备按照0到t-1的顺序从记忆池中获取候选图像特征。
其中,在信息标注初始阶段,该记忆池中存储初始标注视频帧的图像特征,相应的,计算机设备在标注初始阶段,直接将初始标注视频帧作为引导视频帧。
步骤704,将候选图像特征和目标图像特征输入选择网络,得到选择网络输出的图像特征评分,图像特征评分用于指示候选图像特征与目标图像特征之间的图像特征匹配度。
在一种可能的实施方式中,该选择网络为轻量级的卷积神经网络,用于根据输入的图像特征输出图像特征之间的图像特征评分,其中,图像特征评分越高,表示图像特征之间的匹配度越高,相应的,该候选图像特征对应的已标注视频帧作为引导视频帧时的信息传播效果越好,信息标注的准确性越高。
可选的,对于记忆池中的各个候选图像特征,计算机设备均通过上述步骤获取其对应的图像特征评分。
然而,随着信息标注的不断执行,记忆池中的图像特征将不断增多(即已标注视频帧的数量不断增多),若对记忆池中的所有候选图像特征均进行遍历,后续视频帧的信息标注效率将逐步降低。
为了进一步提高信息标注效率,可选的,计算机设备获取记忆池中的部分候选图像特征,相应的,仅需要通过选择网络输出部分候选图像特征对应的图像特征评分。
针对选取部分候选图像特征的策略,在一种可能的实施方式中,当待处理视频的帧率大于帧率阈值时,每隔预定帧数从记忆池中获取已标记视频帧对应的候选图像特征。比如,计算机设备获取记忆池中奇数或偶数视频帧对应的候选图像特征(即每隔一帧获取已标记视频帧对应的候选图像特征,因为相邻视频帧的间隔较短,相应的图像特征间的差异较小),或者,计算机设备每隔两帧获取已标记视频帧对应的候选图像特征。
比如,当待处理视频的帧率大于24帧/秒时,计算机设备获取记忆池中奇数视频帧对应的候选图像特征,并通过选择网络输出候选图像特征对应的图像特征评分。
在其他可能的实施方式中,计算机设备也可以从记忆池中获取目标视频帧相邻的n帧已标注视频帧(比如相邻的20帧已标注视频帧)的候选图像特征,本申请实施例对此不作限定。
上述步骤中,选择网络仅基于候选图像特征和目标图像特征计算图像特征评分,评分的维度较为单一。为了进一步提高输出的图像特征评分的准确性,在一种可能的实施方式中,在图7的基础上,如图8所示,步骤703之后可以包括步骤708,相应的,步骤704可以被替换为步骤709。
步骤708,获取初始标注视频帧中标注对象的标注对象图像特征,初始标注视频帧为待处理视频中预设有标注信息的视频帧,且标注对象为初始标注视频帧中包含标注信息的对象。
为了充分利用初始标注视频帧对应的初始标注信息,在一种可能的实施方式中,计算机设备提取初始标注视频帧的图像特征时,对初始标注视频中的标注对象进行图像特征提取,得到标注对象的标注对象图像特征,其中,标注对象图像特征与各个视频帧的图像特征的尺寸相同。
可选的,计算机设备根据初始标注视频帧的物体分割信息(用于分割初始标注视频帧中不同物体)确定标注对象,进而对标注对象进行图像特征提取;并且,当使用基于卷积神经网络的特征提取器进行图像特征提取时,对视频帧进行图像特征提取的特征提取器与对标注对象进行图像特征提取的特征提取器权值共享。
示意性的,如图9所示,计算机设备根据初始标注视频帧x_0对应的初始标注信息y_0,通过特征提取器91提取得到标注对象图像特征f_a;此外,计算机设备在信息标注过程中,通过特征提取器91对视频帧x_0至x_t-1进行图像特征提取,并将提取到的图像特征f_0至f_t-1存储至记忆池92中。在确定视频帧x_t的引导视频帧时,计算机设备即获取标注对象图像特征f_a,并从记忆池92中获取候选图像特征f_p。
步骤709,将候选图像特征、目标图像特征和标注对象图像特征输入选择网络,得到选择网络输出的图像特征评分。
进一步的,计算机设备将候选图像特征、目标图像特征和标注对象图像特征共同输入选择网络,由选择网络根据三者输出图像特征评分。
在一种可能的实施方式中,该选择网络包括两个分支,分别为第一选择分支和第二选择分支,其中,第一选择分支以两两图像特征的关联操作结果作为输入,第二选择分支以三个图像特征的拼接作为输入,且第一选择分支和第二选择分支的输出进行拼接后最终输入选择网络的全连接层,最终由全连接层输出图像特征评分。可选的,本步骤可以包括如下步骤。
一、对候选图像特征、目标图像特征和标注对象图像特征中的任意两个图像特征进行关联操作,得到关联图像特征,关联图像特征即用于表征图像特征之间的相似度。
向第一选择分支输入图像特征前,计算机设备首先通过候选图像特征、目标图像特征和标注对象图像特征中的任意两个图像进行关联操作,得到关联图像特征。
在一种可能的实施方式中,由于候选图像特征、目标图像特征和标注对象图像特征均是使用相同的特征提取器提取得到,因此三者的尺寸相同。进行关联操作时,计算机设备对候选图像特征和目标图像特征进行逐像素(pixel-wise)相似度计算,得到第一关联图像特征;对候选图像特征和标注对象图像特征进行逐像素相似度计算,得到第二关联图像特征;对目标图像特征和标注对象图像特征进行逐像素相似度计算,得到第三关联图像特征。
示意性的,如图9和10所示,对于候选图像特征f_p、目标图像特征f_t和标注对象图像特征f_a,计算机设备进行关联操作操作,得到关联图像特征包括:corr(f_p,f_a),corr(f_p,f_t),corr(f_t,f_a)。
二、将各个关联图像特征进行拼接,并将拼接后的关联图像特征输入第一选择分支,得到第一选择分支输出的第一特征向量。
进一步的,计算机设备对关联操作后得到的三个关联图像特征进行拼接,从而将拼接后的关联图像特征输入第一选择分支,由第一选择分支对拼接后的关联图像特征进行进一步特征提取,并最终输出第一特征向量。
可选的,第一选择分支基于卷积神经网络,即第一选择分支对拼接后的关联图像特征进行卷积、池化以及激活操作后,输出第一特征向量。本申请实施例并不对第一选择分支的具体结构进行限定。
示意性的,如图9所示,计算机设备通过第一选择分支对拼接后的关联图像特征进行特征提取,得到第一特征向量93。
三、将拼接后的候选图像特征、目标图像特征和标注对象图像特征输入第二选择分支,得到第二选择分支输出的第二特征向量。
计算机设备对候选图像特征、目标图像特征和标注对象图像进行拼接,从而将拼接结果输入第二选择分支,由第二选择分支进行进一步特征提取,并最终输出第二特征向量。
可选的,第二选择分支基于卷积神经网络,即第二选择分支对拼接后的图 像特征进行卷积、池化以及激活操作后,输出第二特征向量。本申请实施例并不对第二选择分支的具体结构进行限定。
示意性的,如图9所示,计算机设备通过第二选择分支对拼接后的图像特征进行特征提取,得到第二特征向量94。
四、根据第一特征向量和第二特征向量确定图像特征评分。
在一种可能的实施方式中,计算机设备对第一特征向量和第二特征向量进行拼接,并将拼接后的特征向量输入全连接层,得到候选图像特征对应的图像特征评分。
示意性的,如图9所示,计算机设备对第一特征向量93和第二特征向量94进行拼接,并将拼接后的特征向量输入全连接层95,由全连接层95输出候选图像特征f_p的图像特征评分。
需要说明的是,对于记忆池中的各个候选图像特征,计算机设备循环执行上述步骤一至四,从而得到目标图像特征与各个候选图像特征之间的图像特征评分。
步骤705,将最高图像特征评分对应的已标注视频帧确定为引导视频帧。
对于各个候选图像特征,计算机设备通过上述步骤得到各个候选图像特征对应的图像特征评分,并进一步确定其中的最高图像特征评分,从而将最高图像特征评分对应候选图像特征所属的已标注视频帧确定为引导视频帧。
示意性的,如图9所示,计算机设备根据图像特征评分,将视频帧x_k确定为目标视频帧x_t的引导视频帧。
步骤706,将目标视频帧的目标图像特征存储至记忆池。
在一种可能的实施方式中,确定出目标视频帧的引导视频帧后,计算机设备将目标视频帧的目标图像特征存储至记忆池,以便为后续视频帧进行信息标识时,能够以该目标图像特征作为参考。
步骤707,将引导视频帧、引导视频帧对应的标注信息以及目标视频帧输入时序传播网络,得到时序传播网络输出的目标标注信息。
本申请实施例中,计算机设备利用预先训练的时序传播网络将引导视频帧的标注信息传播至目标视频帧,完成目标视频帧的信息标注。
在一种可能的实施方式中,时序传播网络包括图像分支(appearance branch)和动量分支(motion branch),其中,图像分支以目标视频帧以及引导视频帧的标识信息为输入,用于输出图像信息特征,该图像信息特征用于表征目标视频 帧中像素点的预计标注信息;动量分支以引导视频帧的标注信息以及引导视频帧与目标视频帧之间的视频帧光流为输入,用于输出动量特征(指示视频帧中物体的运动情况)。可选的,本步骤可以包括如下步骤。
一、将引导视频帧对应的标注信息以及目标视频帧输入图像分支,得到图像分支输出的图像信息特征。
可选的,本申请实施例中的图像分支以预训练的VGG16网络作为初始化。
在一种可能的实施方式中,计算机设备将引导视频帧x_g的标注信息y_g以及目标视频帧x_t输入图像分支,得到图像分支输出的图像信息特征f_app。
示意性的,如图9所示,目标视频帧x_t的引导视频帧为x_k,计算机设备将引导视频帧x_k的标注信息y_k以及目标视频帧x_t输入图像分支96,得到图像分支96输出的图像信息特征97。
二、确定引导视频帧与目标视频帧之间的视频帧光流;将视频帧光流和引导视频帧对应的标注信息输入动量分支,得到动量分支输出的动量特征。
其中,视频帧光流用于指示视频帧之间的图像变化情况,包含了视频帧中运动物体的运动信息,因此可以借助视频帧光流确定视频帧中对象的运行情况。
可选的,该视频帧光流为视频帧之间的稠密光流,即用于指示引导视频帧与目标视频帧中相同坐标像素点对应物体的运动情况。
在一种可能的实施方式中,计算机设备根据引导视频帧x_g和目标视频帧x_t,确定两者之间的视频帧光流W(x_t,x_g),从而将视频帧光流W(x_t,x_g)以及引导视频帧x_g的标注信息y_g输入动量分支。
示意性的,如图9所示,目标视频帧x_t的引导视频帧为x_k,计算机设备根据引导视频帧x_k和目标视频帧x_t,确定两者之间的视频帧光流W(x_t,x_k),并将视频帧光流W(x_t,x_k)以及引导视频帧x_k的标注信息y_k输入动量分支98,得到动量分支98输出的动量特征99。
可选的,引导视频帧与目标视频帧之间的视频帧光流通过预训练的flownet2.0计算得到,且动量分支以预训练的VGG16网络作为初始化。
需要说明的,步骤一和二之间并不存在严格的先后顺序,即步骤一和二可以同时执行,本实施例对此不作限定。
三、根据图像信息特征和动量特征确定目标标注信息。
在一种可能的实施方式中,计算机设备对图像信息特征和动量特征信息特征融合,并通过卷积层对融合后的特征进行卷积处理,最终得到目标视频帧的 目标标注信息。
示意性的,如图9所示,计算机设备对图像信息特征97和动量特征99进行融合后,通过卷积层(图中未示出)最终输出目标视频帧x_t的目标标注信息y_t。
本实施例中,计算机设备将候选图像特征、目标图像特征以及标注对象图像特征输入选择网络,并分别由选择网络的两个选择分支进行特征提取,从而丰富了图像特征的特征提取维度,进而提高了后续计算得到的图像特征评分的准确性。
此外,本实施例中,计算机设备利用时序传播网络的图像分支和动量分支分别进行特征提取,并对两个分支提取到的特征进行融合,最终得到目标视频帧的目标标注信息,有助于提高信息标注的准确性。
在一个示意性的例子中,如图11所示,采用相关技术提供的方法,在对第35帧进行标注时,以第34帧为引导帧,标注准确率为0.44;在对第55帧进行标注时,以第54帧为引导帧,标注准确率为0.28;在对第125帧进行标注时,以第124帧为引导帧,标注准确率为0.22;在对第155帧进行标注时,以第154帧为引导帧,标注准确率为0.23。
而采用本申请实施例提供的方法,在对第35帧进行标注时,以第34帧为引导帧,标注准确率为0.58;在对第55帧进行标注时,以第37帧为引导帧,标注准确率为0.80;在对第125帧进行标注时,以第102帧为引导帧,标注准确率为0.92;在对第155帧进行标注时,以第127帧为引导帧,标注准确率为0.86。
可见,采用相关技术提供的方法,随着信息传播的不断深入,信息标注的准确率越来越低;而采用本申请实施例提供的方法,由于并未采用链式信息传播,因此信息标注的准确性并不会受到信息传播深度的影响。并且,相较于相关技术,本申请实施例提供的方法能够显著提高标注信息的准确性。
针对上述实施例中时序传播网络以及记忆选择网络的训练方法,在一种可能的实施方式中,如图12所示,网络训练过程包括如下步骤:
步骤1201,根据样本视频训练时序传播网络,样本视频中的样本视频帧包含标注信息。
在一种可能的实施方式中,计算机设备首先利用包含标注信息的样本视频对时序传播网络进行训练,然后基于样本视频以及训练完成的时序传播网络进一步对记忆选择网络进行训练。
可选的,计算机设备随机从样本视频中选取两帧作为引导视频帧和目标视频帧对时序传播网络进行训练。其中,当时序传播网络用于实现视频物体分割时,计算机设备采用交并比(Intersection over Union,IOU)损失函数对时序传播网络进行训练;当时序传播网络用于实现视频上色时,计算机设备采用L1回归损失函数训练时序传播网络。
步骤1202,对于样本视频中的目标样本视频帧,将目标样本视频帧以及样本视频中的其它样本视频帧输入时序传播网络,得到时序传播网络输出的预测样本标注信息。
完成对时序传播网络的训练后,计算机设备进一步利用训练得到的时序传播网络产生训练样本,从而利用训练样本对记忆选择网络进行训练。
在一种可能的实施方式中,对于样本视频中的目标样本视频帧x_t,计算机设备遍历目标样本视频帧之前的视频帧作为样本引导视频帧x_p(0≤p≤t-1),并将目标样本视频帧x_t和样本引导视频帧x_p输入时序传播网络,得到时序传播网络输出的预测样本标注信息y_tp。
步骤1203,根据预测样本标注信息和目标样本视频帧对应的样本标注信息,确定样本视频帧中的样本引导视频帧。
进一步的,计算机设备通过比较预测样本标注信息和目标样本视频帧对应的样本标注信息,确定样本引导视频帧的引导质量,进而对样本引导视频帧进行正负样本分类。在一种可能的实施方式中,本步骤可以包括如下步骤。
一、计算预测样本标注信息与样本标注信息之间的信息准确度。
在一种可能的实施方式中,计算机设备计算预测样本标注信息与样本标注信息之间的信息准确度,其中,信息准确度越高,表明预测样本标注信息与样本标注信息越接近,相应的,以该预测样本标注信息对应样本引导视频帧的质量越高。
在一个示意性的例子中,计算机设备根据预测样本标注信息y_tp以及目标样本视频帧x_t的标注信息y_t,计算得到两者的信息准确度s_tp。
二、根据信息准确度确定样本视频帧中的正样本引导视频帧和负样本引导视频帧。
其中,正样本引导视频帧对应的第一信息准确度高于负样本引导视频帧对应的第二信息准确度,第一信息准确度是根据正样本引导视频帧对目标样本视频帧进行信息标注时的信息准确度,第二信息准确度是根据负样本引导视频帧对目标样本视频帧进行信息标注时的信息准确度。
在一种可能的实施方式中,若信息准确度大于第一准确度阈值,计算机设备则将样本引导视频帧确定为正样本引导视频帧(即适合作为引导视频帧);若信息准确度小于第二准确度阈值,计算机设备则将样本引导视频帧确定为负样本引导视频帧(即不适合作为引导视频帧)。其中,第一准确度阈值大于等于第二准确度阈值,比如第一准确度阈值为0.8,第二准确度阈值为0.4。
步骤1204,根据目标样本视频帧和样本引导视频帧训练记忆选择网络。
在一种可能的实施方式中,计算机设备将目标样本视频帧和样本引导视频帧输入记忆选择网络,得到记忆选择网络输出的预测结果,并根据预测结果以及样本引导视频帧的正负属性对记忆选择网络进行训练。其中,计算机设备可以采用反向传播算法或梯度下降算法训练记忆选择网络,本申请实施例对此不做限定。
本实施例中,计算技术河北首先根据样本视频训练时序传播网络,然后基于训练得到的时序传播网络对样本视频中的样本视频帧进行正负样本划分,进而使用划分出的正负样本对记忆选择网络进行训练,无需用户预先手动标注训练样本的正负属性,降低了训练样本的获取难度,并且有助于提高训练样本划分的准确性,进而提高了记忆选择网络的引导帧选择质量。
图13是本申请一个示例性实施例提供的视频帧的信息标注装置的结构框图,如图13所示,该装置包括:
获取模块1301,用于获取待处理视频;
特征提取模块1302,用于对于所述待处理视频中的目标视频帧,对所述目标视频帧进行特征提取,得到所述目标视频帧的目标图像特征;
引导帧确定模块1303,用于根据所述目标视频帧与已标注视频帧的图像特征匹配度,从所述已标注视频帧中确定所述目标视频帧的引导视频帧,所述已标注视频帧属于所述待处理视频,所述引导视频帧用于引导所述目标视频帧进行信息标注,所述图像特征匹配度为所述目标图像特征与所述已标注视频帧对应图像特征之间的匹配度,且所述引导视频帧与所述目标视频帧的图像特征匹 配度高于其它已标注视频帧与所述目标视频帧的图像特征匹配度;
生成模块1304,用于根据所述引导视频帧对应的标注信息,生成所述目标视频帧对应的目标标注信息。
所述引导帧确定模块1303,包括:
第一获取单元,用于从记忆选择网络的记忆池中获取候选图像特征,所述记忆选择网络包括所述记忆池和选择网络,所述记忆池中存储有所述已标注视频帧的图像特征;
特征评分单元,用于将所述候选图像特征和所述目标图像特征输入所述选择网络,得到所述选择网络输出的图像特征评分,所述图像特征评分用于指示所述候选图像特征与所述目标图像特征之间的图像特征匹配度;
确定单元,用于将最高图像特征评分对应的已标注视频帧确定为所述引导视频帧;
所述装置还包括:
存储模块,用于将所述目标视频帧的所述目标图像特征存储至所述记忆池。
可选的,所述引导帧确定模块1303还包括:
第二获取单元,用于获取初始标注视频帧中标注对象的标注对象图像特征,所述初始标注视频帧为所述待处理视频中预设有标注信息的视频帧,且所述标注对象为所述初始标注视频帧中包含标注信息的对象;
所述特征评分单元,还用于:
将所述候选图像特征、所述目标图像特征和所述标注对象图像特征输入所述选择网络,得到所述选择网络输出的所述图像特征评分。
可选的,所述选择网络包括第一选择分支和第二选择分支;
所述特征评分单元,还用于:
对所述候选图像特征、所述目标图像特征和所述标注对象图像特征中的任意两个图像特征进行关联操作,得到关联图像特征,所述关联图像特征用于表征图像特征之间的相似度;
将各个所述关联图像特征进行拼接,并将拼接后的所述关联图像特征输入所述第一选择分支,得到所述第一选择分支输出的第一特征向量;
将拼接后的所述候选图像特征、所述目标图像特征和所述标注对象图像特征输入所述第二选择分支,得到所述第二选择分支输出的第二特征向量;
根据所述第一特征向量和所述第二特征向量确定所述图像特征评分。
可选的,所述第一获取单元,用于:
若所述待处理视频的帧率大于帧率阈值,每隔预定帧数从所述记忆池中获取所述已标记视频帧对应的所述候选图像特征,或者,从所述记忆池中获取所述目标视频帧对应的n帧相邻已标注视频帧的所述候选图像特征,n为正整数。
可选的,所述生成模块1304,用于:
将所述引导视频帧、所述引导视频帧对应的标注信息以及所述目标视频帧输入时序传播网络,得到所述时序传播网络输出的所述目标标注信息。
可选的,所述时序传播网络包括图像分支和动量分支;
所述生成模块1304,包括:
第一输出单元,用于将所述引导视频帧对应的标注信息以及所述目标视频帧输入所述图像分支,得到所述图像分支输出的图像信息特征;
第二输出单元,用于确定所述引导视频帧与所述目标视频帧之间的视频帧光流;将所述视频帧光流和所述引导视频帧对应的标注信息输入所述动量分支,得到所述动量分支输出的动量特征;
确定单元,用于根据所述图像信息特征和所述动量特征确定所述目标标注信息。
可选的,所述装置还包括:
第一训练模块,用于根据样本视频训练所述时序传播网络,所述样本视频中的样本视频帧包含标注信息;
标注信息预测模块,用于对于所述样本视频中的目标样本视频帧,将所述目标样本视频帧以及所述样本视频中的其它样本视频帧输入所述时序传播网络,得到所述时序传播网络输出的预测样本标注信息;
样本确定模块,用于根据所述预测样本标注信息和所述目标样本视频帧对应的样本标注信息,确定所述样本视频帧中的样本引导视频帧;
第二训练模块,用于根据所述目标样本视频帧和所述样本引导视频帧训练所述记忆选择网络。
可选的,所述样本确定模块,包括:
计算单元,用于计算所述预测样本标注信息与所述样本标注信息之间的信息准确度;
确定单元,用于根据所述信息准确度确定所述样本视频帧中的正样本引导视频帧和负样本引导视频帧;
其中,所述正样本引导视频帧对应的第一信息准确度高于所述负样本引导视频帧对应的第二信息准确度,所述第一信息准确度是根据所述正样本引导视频帧对所述目标样本视频帧进行信息标注时的信息准确度,所述第二信息准确度是根据所述负样本引导视频帧对所述目标样本视频帧进行信息标注时的信息准确度。
综上所述,本申请实施例中,对待处理视频中的目标视频帧进行信息标注时,通过对目标视频帧进行特征提取,得到目标视频帧的目标图像特征,并根据目标视频帧与待处理视频中已标注视频帧的图像特征匹配度,从已标注视频帧中确定出目标视频帧对应的引导视频帧,从而基于引导视频帧的标注信息生成目标视频帧的目标标注信息;本申请实施例中,基于目标视频帧的图像特征,选取与目标视频帧具有高图像特征匹配度的已标注视频帧作为引导视频帧,而非直接选取相邻视频帧作为引导视频帧,提高了引导视频帧的选取质量,进而提高了生成的标注信息的准确性;并且,标注信息的传播误差不会累积,进而提高了标注信息的传播质量。
需要说明的是:上述实施例提供的视频帧的信息标注装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的视频帧的信息标注装置与视频帧的信息标注生成方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图14,其示出了本申请一个示例性实施例提供的计算机设备的结构示意图。具体来讲:所述计算机设备1400包括中央处理单元(CPU)1401、包括随机存取存储器(RAM)1402和只读存储器(ROM)1403的系统存储器1404,以及连接系统存储器1404和中央处理单元1401的系统总线1405。所述计算机设备1400还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)1406,和用于存储操作系统1413、应用程序1414和其他程序模块1415的大容量存储设备1407。
所述基本输入/输出系统1406包括有用于显示信息的显示器1408和用于用户输入信息的诸如鼠标、键盘之类的输入设备1409。其中所述显示器1408和 输入设备1409都通过连接到系统总线1405的输入输出控制器1410连接到中央处理单元1401。所述基本输入/输出系统1406还可以包括输入输出控制器1410以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1410还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备1407通过连接到系统总线1405的大容量存储控制器(未示出)连接到中央处理单元1401。所述大容量存储设备1407及其相关联的计算机可读介质为计算机设备1400提供非易失性存储。也就是说,所述大容量存储设备1407可以包括诸如硬盘或者CD-ROI驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM、EEPROM、闪存或其他固态存储其技术,CD-ROM、DVD或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1404和大容量存储设备1407可以统称为存储器。
存储器存储有一个或多个程序,一个或多个程序被配置成由一个或多个中央处理单元1401执行,一个或多个程序包含用于实现上述方法的指令,中央处理单元1401执行该一个或多个程序实现上述各个方法实施例提供的方法。
根据本申请的各种实施例,所述计算机设备1400还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备1400可以通过连接在所述系统总线1405上的网络接口单元1411连接到网络1412,或者说,也可以使用网络接口单元1411来连接到其他类型的网络或远程计算机系统(未示出)。
所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,所述一个或者一个以上程序包含用于进行本申请实施例提供的方法中由计算机设备所执行的步骤。
本申请实施例还提供一种计算机可读存储介质,该可读存储介质中存储有 至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述任一实施例所述的视频帧的信息标注方法。
本申请还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例提供的视频帧的信息标注方法。。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,该计算机可读存储介质可以是上述实施例中的存储器中所包含的计算机可读存储介质;也可以是单独存在,未装配入终端中的计算机可读存储介质。该计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述任一方法实施例所述的视频帧的信息标注方法。
可选地,该计算机可读存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、固态硬盘(SSD,Solid State Drives)或光盘等。其中,随机存取记忆体可以包括电阻式随机存取记忆体(ReRAM,Resistance Random Access Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种视频帧的信息标注方法,其特征在于,所述方法应用于计算机设备,所述方法包括:
    获取待处理视频;
    对于所述待处理视频中的目标视频帧,对所述目标视频帧进行特征提取,得到所述目标视频帧的目标图像特征;
    根据所述目标视频帧与已标注视频帧的图像特征匹配度,从所述已标注视频帧中确定所述目标视频帧的引导视频帧,所述已标注视频帧属于所述待处理视频,所述引导视频帧用于引导所述目标视频帧进行信息标注,所述图像特征匹配度为所述目标图像特征与所述已标注视频帧对应图像特征之间的匹配度,且所述引导视频帧与所述目标视频帧的图像特征匹配度高于其它已标注视频帧与所述目标视频帧的图像特征匹配度;
    根据所述引导视频帧对应的标注信息,生成所述目标视频帧对应的目标标注信息。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述目标视频帧与已标注视频帧的图像特征匹配度,从所述已标注视频帧中确定所述目标视频帧的引导视频帧,包括:
    从记忆选择网络的记忆池中获取候选图像特征,所述记忆选择网络包括所述记忆池和选择网络,所述记忆池中存储有所述已标注视频帧的图像特征;
    将所述候选图像特征和所述目标图像特征输入所述选择网络,得到所述选择网络输出的图像特征评分,所述图像特征评分用于指示所述候选图像特征与所述目标图像特征之间的图像特征匹配度;
    将最高图像特征评分对应的已标注视频帧确定为所述引导视频帧;
    所述根据所述目标视频帧与已标注视频帧的图像特征匹配度,从所述已标注视频帧中确定所述目标视频帧的引导视频帧之后,所述方法还包括:
    将所述目标视频帧的所述目标图像特征存储至所述记忆池。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述候选图像特征和所述目标图像特征输入所述选择网络,得到所述选择网络输出的图像特征评分之前,所述方法还包括:
    获取初始标注视频帧中标注对象的标注对象图像特征,所述初始标注视频帧为所述待处理视频中预设有标注信息的视频帧,且所述标注对象为所述初始标注视频帧中包含标注信息的对象;
    所述将所述候选图像特征和所述目标图像特征输入所述选择网络,得到所述选择网络输出的图像特征评分,包括:
    将所述候选图像特征、所述目标图像特征和所述标注对象图像特征输入所述选择网络,得到所述选择网络输出的所述图像特征评分。
  4. 根据权利要求3所述的方法,其特征在于,所述选择网络包括第一选择分支和第二选择分支;
    所述将所述候选图像特征、所述目标图像特征和所述标注对象图像特征输入所述选择网络,得到所述选择网络输出的所述图像特征评分,包括:
    对所述候选图像特征、所述目标图像特征和所述标注对象图像特征中的任意两个图像特征进行关联操作,得到关联图像特征,所述关联图像特征用于表征图像特征之间的相似度;
    将各个所述关联图像特征进行拼接,并将拼接后的所述关联图像特征输入所述第一选择分支,得到所述第一选择分支输出的第一特征向量;
    将拼接后的所述候选图像特征、所述目标图像特征和所述标注对象图像特征输入所述第二选择分支,得到所述第二选择分支输出的第二特征向量;
    根据所述第一特征向量和所述第二特征向量确定所述图像特征评分。
  5. 根据权利要求2至4任一所述的方法,其特征在于,所述从记忆选择网络的记忆池中获取候选图像特征,包括:
    若所述待处理视频的帧率大于帧率阈值,每隔预定帧数从所述记忆池中获取所述已标记视频帧对应的所述候选图像特征,或者,从所述记忆池中获取所述目标视频帧对应的n帧相邻已标注视频帧的所述候选图像特征,n为正整数。
  6. 根据权利要求2至4任一所述的方法,其特征在于,所述根据所述引导视频帧对应的标注信息,生成所述目标视频帧对应的目标标注信息,包括:
    将所述引导视频帧、所述引导视频帧对应的标注信息以及所述目标视频帧输入时序传播网络,得到所述时序传播网络输出的所述目标标注信息。
  7. 根据权利要求6所述的方法,其特征在于,所述时序传播网络包括图像分支和动量分支;
    所述将所述引导视频帧、所述引导视频帧对应的标注信息以及所述目标视频帧输入时序传播网络,得到所述时序传播网络输出的所述目标标注信息,包括:
    将所述引导视频帧对应的标注信息以及所述目标视频帧输入所述图像分支,得到所述图像分支输出的图像信息特征;
    确定所述引导视频帧与所述目标视频帧之间的视频帧光流;将所述视频帧光流和所述引导视频帧对应的标注信息输入所述动量分支,得到所述动量分支输出的动量特征;
    根据所述图像信息特征和所述动量特征确定所述目标标注信息。
  8. 根据权利要求6所述的方法,其特征在于,所述获取待处理视频之前,所述方法还包括:
    根据样本视频训练所述时序传播网络,所述样本视频中的样本视频帧包含标注信息;
    对于所述样本视频中的目标样本视频帧,将所述目标样本视频帧以及所述样本视频中的其它样本视频帧输入所述时序传播网络,得到所述时序传播网络输出的预测样本标注信息;
    根据所述预测样本标注信息和所述目标样本视频帧对应的样本标注信息,确定所述样本视频帧中的样本引导视频帧;
    根据所述目标样本视频帧和所述样本引导视频帧训练所述记忆选择网络。
  9. 根据权利要求8所述的方法,其特征在于,所述根据所述预测样本标注信息和所述目标样本视频帧对应的样本标注信息,确定所述样本视频帧中的样本引导视频帧,包括:
    计算所述预测样本标注信息与所述样本标注信息之间的信息准确度;
    根据所述信息准确度确定所述样本视频帧中的正样本引导视频帧和负样本引导视频帧;
    其中,所述正样本引导视频帧对应的第一信息准确度高于所述负样本引导 视频帧对应的第二信息准确度,所述第一信息准确度是根据所述正样本引导视频帧对所述目标样本视频帧进行信息标注时的信息准确度,所述第二信息准确度是根据所述负样本引导视频帧对所述目标样本视频帧进行信息标注时的信息准确度。
  10. 一种视频帧的信息标注装置,其特征在于,所述装置包括:
    获取模块,用于获取待处理视频;
    特征提取模块,用于对于所述待处理视频中的目标视频帧,对所述目标视频帧进行特征提取,得到所述目标视频帧的目标图像特征;
    引导帧确定模块,用于根据所述目标视频帧与已标注视频帧的图像特征匹配度,从所述已标注视频帧中确定所述目标视频帧的引导视频帧,所述已标注视频帧属于所述待处理视频,所述引导视频帧用于引导所述目标视频帧进行信息标注,所述图像特征匹配度为所述目标图像特征与所述已标注视频帧对应图像特征之间的匹配度,且所述引导视频帧与所述目标视频帧的图像特征匹配度高于其它已标注视频帧与所述目标视频帧的图像特征匹配度;
    生成模块,用于根据所述引导视频帧对应的标注信息,生成所述目标视频帧对应的目标标注信息。
  11. 根据权利要求10所述的装置,其特征在于,所述引导帧确定模块,包括:
    第一获取单元,用于从记忆选择网络的记忆池中获取候选图像特征,所述记忆选择网络包括所述记忆池和选择网络,所述记忆池中存储有所述已标注视频帧的图像特征;
    特征评分单元,用于将所述候选图像特征和所述目标图像特征输入所述选择网络,得到所述选择网络输出的图像特征评分,所述图像特征评分用于指示所述候选图像特征与所述目标图像特征之间的图像特征匹配度;
    确定单元,用于将最高图像特征评分对应的已标注视频帧确定为所述引导视频帧;
    所述装置还包括:
    存储模块,用于将所述目标视频帧的所述目标图像特征存储至所述记忆池。
  12. 根据权利要求11所述的装置,其特征在于,所述引导帧确定模块还包括:
    第二获取单元,用于获取初始标注视频帧中标注对象的标注对象图像特征,所述初始标注视频帧为所述待处理视频中预设有标注信息的视频帧,且所述标注对象为所述初始标注视频帧中包含标注信息的对象;
    所述特征评分单元,还用于:
    将所述候选图像特征、所述目标图像特征和所述标注对象图像特征输入所述选择网络,得到所述选择网络输出的所述图像特征评分。
  13. 根据权利要求12所述的装置,其特征在于,所述选择网络包括第一选择分支和第二选择分支;
    所述特征评分单元,还用于:
    对所述候选图像特征、所述目标图像特征和所述标注对象图像特征中的任意两个图像特征进行关联操作,得到关联图像特征,所述关联图像特征用于表征图像特征之间的相似度;
    将各个所述关联图像特征进行拼接,并将拼接后的所述关联图像特征输入所述第一选择分支,得到所述第一选择分支输出的第一特征向量;
    将拼接后的所述候选图像特征、所述目标图像特征和所述标注对象图像特征输入所述第二选择分支,得到所述第二选择分支输出的第二特征向量;
    根据所述第一特征向量和所述第二特征向量确定所述图像特征评分。
  14. 根据权利要求11至13任一所述的装置,其特征在于,所述生成模块,用于:
    将所述引导视频帧、所述引导视频帧对应的标注信息以及所述目标视频帧输入时序传播网络,得到所述时序传播网络输出的所述目标标注信息。
  15. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至9任一所述的视频帧的信息标注方法。
  16. 一种计算机可读存储介质,其特征在于,所述可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至9任一所述的视频帧的信息标注方法。
PCT/CN2020/106575 2019-08-29 2020-08-03 视频帧的信息标注方法、装置、设备及存储介质 WO2021036699A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021556971A JP7147078B2 (ja) 2019-08-29 2020-08-03 ビデオフレームの情報ラベリング方法、装置、機器及びコンピュータプログラム
EP20859548.8A EP4009231A4 (en) 2019-08-29 2020-08-03 METHOD, DEVICE AND APPARATUS FOR LABELING VIDEO FRAME INFORMATION, AND STORAGE MEDIA
US17/473,940 US11727688B2 (en) 2019-08-29 2021-09-13 Method and apparatus for labelling information of video frame, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910807774.8 2019-08-29
CN201910807774.8A CN110503074B (zh) 2019-08-29 2019-08-29 视频帧的信息标注方法、装置、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/473,940 Continuation US11727688B2 (en) 2019-08-29 2021-09-13 Method and apparatus for labelling information of video frame, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021036699A1 true WO2021036699A1 (zh) 2021-03-04

Family

ID=68590435

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/106575 WO2021036699A1 (zh) 2019-08-29 2020-08-03 视频帧的信息标注方法、装置、设备及存储介质

Country Status (5)

Country Link
US (1) US11727688B2 (zh)
EP (1) EP4009231A4 (zh)
JP (1) JP7147078B2 (zh)
CN (1) CN110503074B (zh)
WO (1) WO2021036699A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114697702A (zh) * 2022-03-23 2022-07-01 咪咕文化科技有限公司 音视频标记方法、装置、设备及存储介质

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503074B (zh) * 2019-08-29 2022-04-15 腾讯科技(深圳)有限公司 视频帧的信息标注方法、装置、设备及存储介质
CN113271424A (zh) * 2020-02-17 2021-08-17 北京沃东天骏信息技术有限公司 一种音视频通讯方法、装置和系统
CN112233171A (zh) * 2020-09-03 2021-01-15 上海眼控科技股份有限公司 目标标注质量检验方法、装置、计算机设备和存储介质
US20220180633A1 (en) * 2020-12-04 2022-06-09 Samsung Electronics Co., Ltd. Video object detection and tracking method and apparatus
CN112950667B (zh) * 2021-02-10 2023-12-22 中国科学院深圳先进技术研究院 一种视频标注方法、装置、设备及计算机可读存储介质
CN115134656A (zh) * 2021-03-26 2022-09-30 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、设备以及介质
CN113343857B (zh) * 2021-06-09 2023-04-18 浙江大华技术股份有限公司 标注方法、装置、存储介质及电子装置
CN113506610A (zh) * 2021-07-08 2021-10-15 联仁健康医疗大数据科技股份有限公司 标注规范生成方法、装置、电子设备及存储介质
CN113672143A (zh) * 2021-08-27 2021-11-19 广州市网星信息技术有限公司 图像标注方法、系统、设备和存储介质
US20230138254A1 (en) * 2021-10-29 2023-05-04 International Business Machines Corporation Temporal contrastive learning for semi-supervised video action recognition
CN114419502A (zh) * 2022-01-12 2022-04-29 深圳力维智联技术有限公司 一种数据分析方法、装置及存储介质
CN114863321B (zh) * 2022-04-08 2024-03-08 北京凯利时科技有限公司 自动视频生成方法、装置及电子设备和芯片系统
CN115294506B (zh) * 2022-10-09 2022-12-09 深圳比特微电子科技有限公司 一种视频高光检测方法和装置
CN115757871A (zh) * 2022-11-15 2023-03-07 北京字跳网络技术有限公司 视频标注方法、装置、设备、介质及产品
CN117437635B (zh) * 2023-12-21 2024-04-05 杭州海康慧影科技有限公司 一种生物组织类图像的预标注方法、装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379080A1 (en) * 2015-06-25 2016-12-29 A9.Com, Inc. Image match for featureless objects
CN107886104A (zh) * 2016-09-30 2018-04-06 法乐第(北京)网络科技有限公司 一种图像的标注方法
CN108965687A (zh) * 2017-05-22 2018-12-07 阿里巴巴集团控股有限公司 拍摄方向识别方法、服务器及监控方法、系统及摄像设备
CN110163095A (zh) * 2019-04-16 2019-08-23 中国科学院深圳先进技术研究院 回环检测方法、回环检测装置及终端设备
CN110503074A (zh) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 视频帧的信息标注方法、装置、设备及存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7889794B2 (en) 2006-02-03 2011-02-15 Eastman Kodak Company Extracting key frame candidates from video clip
CN103324937B (zh) * 2012-03-21 2016-08-03 日电(中国)有限公司 标注目标的方法和装置
CN103914850B (zh) * 2014-04-22 2017-02-15 南京影迹网络科技有限公司 一种基于运动匹配的视频自动标注方法及自动标注系统
US10319412B2 (en) 2016-11-16 2019-06-11 Adobe Inc. Robust tracking of objects in videos
CN108012202B (zh) * 2017-12-15 2020-02-14 浙江大华技术股份有限公司 视频浓缩方法、设备、计算机可读存储介质及计算机装置
CN108965852A (zh) * 2018-08-14 2018-12-07 宁波工程学院 一种具有容错能力的半自动2d转3d的方法
CN109325967B (zh) * 2018-09-14 2023-04-07 腾讯科技(深圳)有限公司 目标跟踪方法、装置、介质以及设备
CN109753975B (zh) * 2019-02-02 2021-03-09 杭州睿琪软件有限公司 一种训练样本获得方法、装置、电子设备和存储介质
CN110176027B (zh) * 2019-05-27 2023-03-14 腾讯科技(深圳)有限公司 视频目标跟踪方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379080A1 (en) * 2015-06-25 2016-12-29 A9.Com, Inc. Image match for featureless objects
CN107886104A (zh) * 2016-09-30 2018-04-06 法乐第(北京)网络科技有限公司 一种图像的标注方法
CN108965687A (zh) * 2017-05-22 2018-12-07 阿里巴巴集团控股有限公司 拍摄方向识别方法、服务器及监控方法、系统及摄像设备
CN110163095A (zh) * 2019-04-16 2019-08-23 中国科学院深圳先进技术研究院 回环检测方法、回环检测装置及终端设备
CN110503074A (zh) * 2019-08-29 2019-11-26 腾讯科技(深圳)有限公司 视频帧的信息标注方法、装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114697702A (zh) * 2022-03-23 2022-07-01 咪咕文化科技有限公司 音视频标记方法、装置、设备及存储介质
CN114697702B (zh) * 2022-03-23 2024-01-30 咪咕文化科技有限公司 音视频标记方法、装置、设备及存储介质

Also Published As

Publication number Publication date
EP4009231A4 (en) 2022-11-23
EP4009231A1 (en) 2022-06-08
US20210406553A1 (en) 2021-12-30
JP2022526513A (ja) 2022-05-25
US11727688B2 (en) 2023-08-15
CN110503074A (zh) 2019-11-26
JP7147078B2 (ja) 2022-10-04
CN110503074B (zh) 2022-04-15

Similar Documents

Publication Publication Date Title
WO2021036699A1 (zh) 视频帧的信息标注方法、装置、设备及存储介质
US20210326597A1 (en) Video processing method and apparatus, and electronic device and storage medium
Valle et al. Multi-task head pose estimation in-the-wild
US11842487B2 (en) Detection model training method and apparatus, computer device and storage medium
CN111126272B (zh) 姿态获取方法、关键点坐标定位模型的训练方法和装置
US11514625B2 (en) Motion trajectory drawing method and apparatus, and device and storage medium
CN111354079A (zh) 三维人脸重建网络训练及虚拟人脸形象生成方法和装置
CN108710897A (zh) 一种基于ssd-t的远端在线通用目标检测系统
CN110866936A (zh) 视频标注方法、跟踪方法、装置、计算机设备及存储介质
WO2023273668A1 (zh) 图像分类方法、装置、设备、存储介质及程序产品
CN111368751A (zh) 图像处理方法、装置、存储介质及电子设备
WO2023221608A1 (zh) 口罩识别模型的训练方法、装置、设备及存储介质
CN111742345A (zh) 通过着色的视觉跟踪
CN112417947B (zh) 关键点检测模型的优化及面部关键点的检测方法及装置
CN111428650B (zh) 一种基于sp-pggan风格迁移的行人重识别方法
CN115115969A (zh) 视频检测方法、装置、设备、存储介质和程序产品
CN115905622A (zh) 视频标注方法、装置、设备、介质及产品
Zhang et al. Facial component-landmark detection with weakly-supervised lr-cnn
CN117115917A (zh) 基于多模态特征融合的教师行为识别方法、设备以及介质
Mori et al. Good keyframes to inpaint
CN110942463A (zh) 一种基于生成对抗网络的视频目标分割方法
CN116052108A (zh) 基于Transformer的交通场景小样本目标检测方法及装置
CN112862840B (zh) 图像分割方法、装置、设备及介质
US11610414B1 (en) Temporal and geometric consistency in physical setting understanding
Zhou et al. A lightweight neural network for loop closure detection in indoor visual slam

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20859548

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021556971

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020859548

Country of ref document: EP

Effective date: 20210927