WO2022262337A1 - 视频标注方法、装置、计算设备和计算机可读存储介质 - Google Patents

视频标注方法、装置、计算设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2022262337A1
WO2022262337A1 PCT/CN2022/081027 CN2022081027W WO2022262337A1 WO 2022262337 A1 WO2022262337 A1 WO 2022262337A1 CN 2022081027 W CN2022081027 W CN 2022081027W WO 2022262337 A1 WO2022262337 A1 WO 2022262337A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
pixel
frame
target pixel
information
Prior art date
Application number
PCT/CN2022/081027
Other languages
English (en)
French (fr)
Inventor
邬书哲
金鑫
涂丹丹
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2022262337A1 publication Critical patent/WO2022262337A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • the present disclosure relates to the field of artificial intelligence, and more specifically, to a video tagging method, device, computing device and computer-readable storage medium.
  • video includes videos uploaded by users on the Internet, videos saved by monitoring systems, videos of film and television dramas, etc., which can provide data input for various visual intelligence applications.
  • An exemplary embodiment of the present disclosure provides a video tagging method, which can realize automatic tagging of each frame in a video based on a position mapping model, and has high tagging efficiency.
  • a video labeling method includes: according to the position mapping model, determining the matching pixel of the target pixel in the current frame in the video in the previous frame, and the previous frame has annotation information; and determining the target based on the part related to the matching pixel in the annotation information of the previous frame Annotation information for pixels.
  • the labeling of each frame in the video is realized based on the position mapping model.
  • the annotation information can be determined for various types of objects, and high-quality annotation information can be obtained even for scenes such as irregular motion trajectories, occlusion, and various object deformations.
  • the video is a color video
  • the position mapping model is trained based on the color video and a grayscale video constructed from the color video.
  • the first aspect before determining the matching pixels, it also includes: constructing a grayscale video based on the color video input by the user; and updating the preset position mapping model based on the color video and the grayscale video to obtain the position Mapping model.
  • the location mapping model can be updated based on the video to be labeled, so that the location mapping model can adapt to different data situations before labeling, and further ensure the accuracy of labeling.
  • determining the annotation information of the target pixel includes: determining a part related to another matching pixel of the target pixel in the previous frame from the annotation information of the previous frame; and based on the annotation information of the previous frame The part related to the matched pixel and the part related to another matched pixel determine the labeling information of the target pixel.
  • the labeling information of the target pixel can be determined based on multiple matching pixels, so that the labeling result is more accurate.
  • determining the label information of the target pixel includes: determining a first similarity between a matching pixel and the target pixel, and a second similarity between another matching pixel and the target pixel; and by A weighted sum is performed on the first similarity and the second similarity to determine label information of the target pixel.
  • determining the label information of the target pixel includes: determining a first similarity between a matching pixel and the target pixel, and a second similarity between another matching pixel and the target pixel; and by The comparison result of the first similarity degree and the second similarity degree determines the labeling information of the target pixel.
  • the previous frame includes a start frame of the video, and the annotation information of the start frame is marked by the user.
  • the annotation information of the target pixel in the current frame can be determined based on the accurate user annotation, so that the obtained annotation information is more accurate.
  • the previous frame includes adjacent or non-adjacent frames preceding the current frame in the video.
  • labeling can be performed based on adjacent frames, fully considering the mobility of objects in the video, so that the obtained labeling information is more in line with the actual situation.
  • the labeling information includes information obtained by at least one of the following labeling methods: points, straight lines, curves, rectangular boxes, irregular polygons, and segmentation masks.
  • the present disclosure can perform subsequent labeling for various forms of labeling information, and can determine labeling information for various types of objects, without requiring the user to select different preset algorithms for different objects.
  • High-quality annotation information can be obtained even for scenes with irregular motion trajectories, occlusions, and various object deformations.
  • the method further includes: determining the labeling information of the current frame based at least on the labeling information of the target pixel.
  • a video tagging device in a second aspect, includes: the mapping unit is configured to determine, according to the position mapping model, the matching pixel of the target pixel in the current frame in the video in the previous frame, and the previous frame has labeling information; and the determining unit is configured to be based on the labeling information of the previous frame In the part related to the matching pixel, determine the label information of the target pixel.
  • the video is a color video
  • the position mapping model is trained based on the color video and a grayscale video constructed from the color video.
  • the device further includes: the construction unit is configured to construct a grayscale video based on the color video input by the user; and the update unit is configured to map the preset position based on the color video and the grayscale video The model is updated to obtain a location map model.
  • the determination unit is configured to determine a part related to another matching pixel of the target pixel in the previous frame from the annotation information of the previous frame; and based on the matching pixel in the annotation information of the previous frame The related part and the part related to another matching pixel determine the labeling information of the target pixel.
  • the determining unit is configured to: determine a first degree of similarity between a matching pixel and a target pixel, and a second degree of similarity between another matching pixel and a target pixel; The first similarity and the second similarity are weighted and summed to determine the label information of the target pixel.
  • the previous frame includes a start frame of the video, and the marking information of the start frame is marked by a user.
  • the previous frame includes adjacent or non-adjacent frames preceding the current frame in the video.
  • the labeling information includes information obtained by at least one of the following labeling methods: points, straight lines, curves, rectangular boxes, irregular polygons, and segmentation masks.
  • the determining unit is further configured to determine the annotation information of the current frame based at least on the annotation information of the target pixel.
  • a computing device including a processor and a memory, the memory stores instructions executed by the processor, and when the instructions are executed by the processor, the computing device realizes: according to the location mapping model, Determine the matching pixel of the target pixel in the current frame in the video in the previous frame, the previous frame has annotation information; and determine the annotation information of the target pixel based on the part related to the matching pixel in the annotation information of the previous frame.
  • the computing device when the instructions are executed by the processor, the computing device is enabled to: determine a part related to another matching pixel of the target pixel in the previous frame from the annotation information of the previous frame; and The label information of the target pixel is determined based on the part related to the matching pixel and the part related to another matching pixel in the label information of the previous frame.
  • the instructions when executed by a processor, cause the computing device to: determine a first similarity between a matching pixel and a target pixel, and determine a first similarity between another matching pixel and a target pixel the second similarity; and determining the labeling information of the target pixel by weighting and summing the first similarity and the second similarity.
  • the video is a color video
  • the position mapping model is trained based on the color video and a grayscale video constructed from the color video.
  • the computing device when the instructions are executed by the processor, the computing device is enabled to: construct a grayscale video based on the color video input by the user; and map a preset position based on the color video and the grayscale video The model is updated to obtain a location map model.
  • the previous frame includes a start frame of the video, and the marking information of the start frame is marked by the user.
  • the previous frame includes adjacent or non-adjacent frames prior to the current frame in the video.
  • the labeling information includes information obtained by at least one of the following labeling methods: points, straight lines, curves, rectangular boxes, irregular polygons, and segmentation masks.
  • the computing device when the instructions are executed by the processor, the computing device is enabled to: determine the annotation information of the current frame based at least on the annotation information of the target pixel.
  • a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium.
  • the computer program is executed by a processor, the method according to the above-mentioned first aspect or any embodiment thereof is implemented. operate.
  • a chip or a chip system in a fifth aspect, includes a processing circuit configured to perform operations according to the method in the above first aspect or any embodiment thereof.
  • a computer program or computer program product is provided.
  • the computer program or computer program product is tangibly stored on a computer-readable medium and includes computer-executable instructions that, when executed, cause the device to implement the method according to the above-mentioned first aspect or any of its embodiments. operate.
  • FIG. 1 shows a schematic structural diagram of a system according to an embodiment of the present disclosure
  • Fig. 2 shows a schematic diagram of a location mapping model according to an embodiment of the present disclosure
  • Fig. 3 shows a schematic diagram of a scenario in which a system according to an embodiment of the present disclosure is deployed in a cloud environment and a local computing device;
  • FIG. 4 shows a schematic structural diagram of a computing device according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic flowchart of a video tagging method according to an embodiment of the present disclosure
  • Fig. 6 shows a schematic flowchart of the process of obtaining a location mapping model according to an embodiment of the present disclosure
  • FIG. 7 shows a schematic flowchart of a process of determining label information of a target pixel according to an embodiment of the present disclosure
  • Fig. 8 shows a schematic block diagram of a video tagging device according to an embodiment of the present disclosure.
  • Artificial Intelligence uses computers to simulate certain human thinking processes and intelligent behaviors.
  • the history of artificial intelligence research has a natural and clear vein from focusing on “reasoning”, to focusing on “knowledge”, and then focusing on “learning”.
  • Artificial intelligence has been widely applied to various industries such as security, medical care, transportation, education, and finance.
  • Machine learning is a branch of artificial intelligence, which studies how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their own performance. In other words, machine learning studies how to improve the performance of specific algorithms during empirical learning.
  • Deep learning is a type of machine learning technology based on deep neural network algorithms. Its main feature is to use multiple nonlinear transformation structures to process and analyze data. It is mainly used in perception, decision-making and other scenarios in the field of artificial intelligence, such as image and speech recognition, natural language translation, computer games, etc.
  • video includes videos uploaded by users on the Internet, videos saved by monitoring systems, videos of film and television dramas, etc., providing data input for various visual intelligence applications.
  • the current vision-oriented artificial intelligence technology especially the mainstream deep learning technology, often relies on a large amount of labeled data for learning. Due to the fact that video data has more time dimensions than images, and the data scale is huge, it is very difficult to label videos, which is not only costly but also low in efficiency, which limits the practical application of related technologies and limits the potential of video data. Therefore, efficient acquisition of labeled video data is crucial for the implementation of artificial intelligence technology in related fields.
  • Annotating a video usually needs to be done frame by frame, that is, after disassembling the video into an image sequence, annotate each frame of image.
  • One way is to use semi-automatic labeling tools to assist labeling on the basis of user labeling.
  • the semi-automatic labeling tool can use the preset object tracking algorithm to track the marked objects, which can reduce the difficulty of labeling to a certain extent. Improve labeling efficiency.
  • such semi-automatic labeling tools can usually only be used for one labeling task, so if different types of labeling tasks are aimed at, different semi-automatic labeling tools are required to perform labeling independently of each other.
  • the semi-automatic labeling tool can only provide prediction assistance for specific objects, and can only track one object at a time, so its use effect and scope of application are extremely limited, which makes it impossible to meet actual needs.
  • an embodiment of the present disclosure provides a method of tagging a video and determining tagging information of a target frame in the video.
  • the method is based on the location mapping model for labeling, and does not depend on the type of labeling and the target object, so it can be applied to various labels and has a wider scope of application, so as to meet the needs of various scenarios.
  • Fig. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure.
  • the system 100 may be shown in FIG. 1
  • the system architecture 100 includes an input/output (Input/Output, I/O) module 110 , a location mapping module 120 and a labeling module 130 .
  • the system 100 may further include a model update module 140 , a data storage module 150 , a model storage module 160 and a correction module 170 .
  • various modules shown in FIG. 1 may communicate with each other.
  • the input/output module 110 may be used to acquire video to be processed, for example, may receive video input by a user.
  • the video input by the user may be stored in the data storage module 150 .
  • the data storage module 150 may be a data storage resource corresponding to an object storage service (Object Storage Service, OBS) provided by a cloud service provider.
  • OBS Object Storage Service
  • a video consists of a sequence of images such as frame 0, frame 1, ... etc.
  • one or more frames in the video have annotation information.
  • the start frame of the video acquired by the input/output module 110 has annotation information, that is, the input/output module 110 can be used to acquire the video and obtain the annotation information of the start frame of the video.
  • the starting frame may refer to the first frame of a video segment that needs to be marked in the video, where the video segment that needs to be marked may be all or part of the video. In the case where all the scenes of the video need to be marked, the starting frame may refer to the first frame at the beginning of the video. In a scenario where only the part of the video needs to be marked, the starting frame may refer to the first frame at the beginning of the part of the video that needs to be marked. For example, the beginning of the video may include one or more invalid frames, test frames, etc., and these frames do not need to be marked, then the first frame of the video segment that needs to be marked after these frames that do not need to be marked can be Defined as the starting frame.
  • the video may be regarded as "a video segment that needs to be marked" hereinafter.
  • the start frame can be defined as the 0th frame of the video
  • the frames after the start frame can be defined as the first frame, the second frame, ...
  • the frames before the start frame are defined as negative frames or invalid frames or other names, etc.
  • this definition is only for the convenience of describing the embodiments herein, and should not be construed as limiting the protection scope of the present disclosure.
  • the annotation information of the start frame may be annotated by the user.
  • an annotator may, based on experience, annotate one or more specific parts (such as animals, human bodies, vehicles, etc.) in the starting frame.
  • the annotation information may also be called a label, a task mark, or other names, which will not be listed one by one in this disclosure.
  • the labeling information of the starting frame includes information obtained by at least one of the following labeling methods: point, straight line, curve, rectangular frame, irregular polygon, and segmentation mask (mask).
  • the labeling information for the starting frame may include some or all of the following information: information obtained by labeling one or more points in the starting frame, information obtained by labeling one or more straight lines in the starting frame Information, information obtained by marking one or more curves in the starting frame, information obtained by marking one or more rectangular boxes in the starting frame, marking one or more irregular polygons in the starting frame.
  • Embodiments of the present disclosure do not limit the original source of the video, for example, it may be obtained from an open source data set, for example, it may be collected by an image acquisition device, or any combination of the above, or others.
  • the input/output module 110 can be implemented as an input module and an output module that are independent of each other, or can also be implemented as a coupling module that has both an input function and an output function.
  • the input/output module 110 may be implemented by using a graphical user interface (Graphical User Interface, GUI) or a command-line interface (Command-Line Interface, CLI).
  • GUI graphical User Interface
  • CLI Command-Line Interface
  • videos acquired through the input/output module 110 may be stored in the data storage module 150 .
  • the position mapping module 120 can be used to determine the matching pixels of the target pixel in the current frame in the video in the previous frame according to the position mapping model.
  • the annotation module 130 may be configured to determine the annotation information of the target pixel based on the part related to the matching pixel in the annotation information of the previous frame.
  • the video may be tagged frame-wise in a time sequence.
  • the first frame can be marked based on the starting frame (that is, the 0th frame)
  • the second frame can be marked based on the starting frame and/or the first frame, and based on at least one of the starting frame
  • the first frame and the second frame A frame marks the third frame, ..., marks the t-th frame based on at least one frame from the start frame to the t-1-th frame, and so on.
  • labeling may not be performed in the order of the time series.
  • the first frame can be marked based on the starting frame (ie frame 0)
  • the third frame can be marked based on the starting frame and/or the first frame, and based on at least one of the starting frame, the first frame and the third frame Label frame 2, ..., etc.
  • the frame currently to be marked may be referred to as the current frame, for example, may be the tth frame.
  • the frame that has been tagged can be called a previous frame.
  • the previous frame can include any frame from the start frame to the t-1th frame.
  • previous frames may include a start frame.
  • the previous frame may include a frame preceding and adjacent to the current frame.
  • the previous frame may include a frame that is located before and not adjacent to the current frame.
  • the current frame is the tth frame
  • the previous frame may include the pth frame
  • p and t are positive integers
  • p is smaller than t.
  • the labeling information of the pixels may be first determined pixel by pixel, wherein the current pixel to be labeled may be called a target pixel.
  • Embodiments of the present disclosure do not limit the manner of selecting the target pixel.
  • a pixel may be randomly selected from unlabeled pixels in the current frame as the target pixel. For example, if the unlabeled pixels of the current frame include at least one pixel, one of the unlabeled pixels may be randomly selected as the target pixel.
  • target pixels may be sequentially selected from unlabeled pixels in a row or column manner. For example, starting from the first row of the current frame, the first pixel in the first row can be used as the target pixel; and then the second pixel in the first row can be used as the target pixel... . For example, starting from the first column of the current frame, the first pixel in the first column can be used as the target pixel; then the second pixel in the first column can be used as the target pixel... .
  • the target pixel may be selected based on labeled pixels in a frame preceding the current frame. For example, if the pixel at position (x1, y1) in the previous frame (such as frame t-1) of the current frame has label information, then the pixel at position (x1, y1) in the current frame can be as the target pixel.
  • target pixel may also be determined in other manners, which will not be listed one by one in the embodiments of the present disclosure.
  • the "position mapping model” may also be referred to as a position mapping algorithm, a position correspondence model, a pixel matching model, a machine learning model, or other names, or may also be referred to as a "model” for short. There is no limit to this.
  • a location-mapping model can be used to determine the corresponding pixels of a pixel in one image in another image. Specifically, for given two images (assuming the first image and the second image), the position mapping model can determine one or more corresponding pixels of any pixel in the first image in the second image.
  • the location mapping model is not limited to the relationship between a given two images.
  • the first image and the second image may be two frames in the same video, or may be two frames in different videos.
  • the first image and the second image are located in the same video, the first image may be a frame before the second image, or the first image may be a frame after the second image.
  • u and v represent the same attribute of the pixel, such as the position of the pixel, the extracted feature of the pixel, or other related information.
  • s may be a real number, for example, a value between 0 and 1, and a larger s indicates a stronger corresponding relationship.
  • the degree of strength is only 0 and 1, that is, s is only 0 or 1.
  • the relationship between one pixel in the first image and multiple pixels in the second image can be constructed based on the above formula, for example, v in the above formula is extended to a pixel set including multiple pixels, and s is correspondingly Expands to a vector of the appropriate dimension.
  • one or more pixels in the second image corresponding to pixel u in the first image can be determined based on the mapping.
  • the maximum value among s1 to sn can be found, such as si, then it can be determined that the pixel vi in the second image corresponds to the pixel u in the first image.
  • one or more s greater than a preset value (for example, 0.5 or 0.8, etc.) among s1 to sn can be found, then it can be determined that one or more v corresponding to one or more s is the second image
  • a preset value for example, 0.5 or 0.8, etc.
  • the corresponding pixels in the second image 220 include v1 and v2 .
  • the location mapping model may be pre-built and stored in the model storage module 160, but the embodiment of the present disclosure does not limit the construction method of the location mapping model. That is to say, in the embodiment of the present disclosure, the location mapping model used when tagging the video is a preset location mapping model.
  • the preset location mapping model may be modeled and trained based on the training image set. Specifically, it may be obtained by modeling and training based on pixel information in each training image in the training image set.
  • the pixel information may include pixel position information, pixel color information, pixel feature values, pixel motion information, and the like.
  • modeling may be performed based on motion information of each pixel to obtain a preset position mapping model. In some other embodiments, modeling may be performed based on the apparent features of each pixel to obtain a preset position mapping model. In some other embodiments, modeling may be performed based on the motion information and apparent features of each pixel to obtain a preset position mapping model.
  • the motion information may be determined by adopting techniques such as optical flow and deformation field, wherein the optical flow may be constructed based on the moving direction and moving distance of pixels between two adjacent frames.
  • the appearance feature may be a feature such as color (such as red-green-blue (RGB)), which is not limited in the embodiments of the present disclosure.
  • the embodiment of the present disclosure does not limit the model structure of the preset location mapping model.
  • the structure of the preset position mapping model can be determined during modeling, for example, the structure of a convolutional neural network (Convolutional Neural Network, CNN) model can be referred to, optionally including an input layer, a convolutional layer, and a deconvolutional layer , pooling layer, fully connected layer, output layer, etc.
  • CNN convolutional Neural Network
  • the preset location mapping model includes a large number of parameters, which can represent the calculation formula or the weight of calculation factors in the model, and the parameters can be iteratively updated through training.
  • the parameters of the preset location mapping model also include hyper-parameters, which are used to guide the construction or training of the preset location mapping model. Hyperparameters include, for example, the number of iterations of model training, the initial learning rate (leaning rate), the batch size (batch size), the number of layers of the model, the number of neurons in each layer, and the like.
  • the hyperparameters can be parameters obtained by training the model through the training set, or they can be preset parameters, and the preset parameters will not be updated through the training of the model.
  • the process of obtaining the preset location mapping model through training may refer to currently known or future model training algorithms.
  • the training process can be: constructing a training set, inputting the training data items in the training set to the preset position mapping model, and adjusting the parameters of the preset position mapping model by using the loss value of the loss function .
  • Each training data item in the training set iteratively trains the preset position mapping model, so that the parameters of the preset position mapping model are continuously adjusted.
  • the loss function during training is a function used to measure how well the preset position map model is trained.
  • the frame in the video is marked by means of the position mapping model, which does not depend on a specific labeling task, and can use a unified perspective to model different labeling forms, thereby supporting various Objects of different types are annotated simultaneously.
  • the labeling method in the embodiments of the present disclosure has no specific requirements for the scene, and does not limit the category and movement mode of the marked object, and can achieve high labeling quality even for complex scenes such as occlusion.
  • a location-mapping model can be used to determine the corresponding pixels of a pixel in one image in another image. For example, for any pixel in the first image, the position-mapping model may determine one or more corresponding pixels in the second image.
  • the location mapping model used by the location mapping module 120 may be a preset location mapping model obtained from the model storage module 160 .
  • the model update module 140 may update the preset location mapping model in the model storage module 160 to obtain an updated location mapping model.
  • the location mapping model may be updated based on the video acquired from the input/output module 110, and accordingly, the location mapping model used by the location mapping module 120 may be the updated location mapping model. That is to say, the location mapping model used by the location mapping module 120 may be obtained based on the updated video.
  • the updated location mapping model can also be stored in the model storage module 160 .
  • the model updating module 140 may train the position mapping model based on the color video and the grayscale video constructed from the color video.
  • the model update module 140 can construct a grayscale video based on the color video input by the user. Then, the preset mapping model is updated based on the color video and the grayscale video, so as to obtain a position mapping model usable by the position mapping module 120 .
  • a grayscale video can be obtained by constructing a corresponding grayscale frame based on each color frame in the color video by recoloring.
  • the training data set can be constructed based on the color video and the grayscale video, and the preset position mapping model can be trained based on the training data set, so as to update the preset mapping model.
  • the training data set includes multiple training data items, and each training data item includes a color frame and a corresponding grayscale frame.
  • the location mapping model can be updated based on the training data set using gradient descent, or can be updated in other ways, which will not be listed here.
  • the model update module 140 may construct another color video based on the color video input by the user. Then, the preset mapping model is updated based on the color video and another color video, so as to obtain a position mapping model usable by the position mapping module 120 . In other embodiments, the update of the preset mapping model can also be implemented based on the color video input by the user in other ways, which will not be listed in this disclosure.
  • the updating process of the above-mentioned model updating module 140 based on the video to be labeled to the position mapping model can also be referred to as the training process of the position mapping model.
  • the preset position mapping model can be updated based on the video to be marked, so as to be able to adapt to different data situations in advance, and use the updated position mapping model to mark the video, which can further ensure the quality of the mark.
  • the user's instruction can be obtained through the input/output module 110 . If the instruction indicates to update the preset location update model, the model update module 140 updates the preset location update model based on the video input by the user to obtain an updated location update model. Further, the updated location update model can be used by the location mapping module 120 later. If the instruction indicates not to update the preset location update model, the location mapping module 120 may use the preset location mapping model.
  • the position mapping module 120 can determine that the target pixel in the current frame is in the previous frame of matching pixels.
  • the current frame can be understood as the first image in Figure 2
  • the previous frame can be understood as the second image in Figure 2
  • the target pixel in the current frame can be mapped to the target pixel in the previous frame through the position mapping model
  • the position mapping module 120 can also determine the matching pixels of the target pixel in the current frame in other previous frames. In this way, the position mapping module 120 is able to determine matching pixels for the target pixel in a plurality of previous frames. This method takes both the time dimension and the space dimension into account, making the references for determining the labeling information richer and more comprehensive, thereby ensuring the precision and accuracy of labeling.
  • the annotation module 130 can be used to determine the annotation information of the target pixel.
  • the labeling information of the target pixel may be determined based on the similarity between at least one matching pixel and the target pixel. It can be understood that the at least one matching pixel may be at least one matching pixel in a previous frame, or may be multiple matching pixels in multiple previous frames.
  • the label information of the target pixel may be determined based on the matching pixel having the largest similarity with the target pixel.
  • the labeling part related to the matching pixel with the largest similarity to the target pixel may be used as the labeling information of the target pixel.
  • At least one matching pixel includes one matching pixel and another matching pixel. Then the first similarity between one matching pixel and the target pixel can be determined, and the second similarity between another matching pixel and the target pixel can be determined. Comparing the first similarity with the second similarity, if the first similarity is greater than the second similarity, then determining the label information of the target pixel based on a matching pixel. If the first similarity is smaller than the second similarity, the label information of the target pixel is determined based on another matched pixel.
  • the embodiment of the present disclosure does not limit the manner of determining the similarity.
  • the similarity between two pixels can be determined by calculating the distance between the two pixels. For example, for pixel i and pixel j, the feature f i of pixel i can be obtained through feature extraction, and the feature f j of pixel j can be obtained through feature extraction. The similarity between feature f i and feature f j can then be taken as the similarity between pixel i and pixel j.
  • a feature extractor, a neural network or a local feature descriptor can be used for feature extraction.
  • the similarity can be calculated by calculating inner product, Euclidean distance and other ways.
  • the manner of calculating the first similarity and the manner of calculating the second similarity may be the same, for example, Euclidean distance is used as the similarity. In this way, the consistency of the different similarities being compared can be guaranteed, making the determined result more accurate. In other examples, the manner of calculating the first similarity may be inconsistent with the manner of calculating the second similarity, which can meet the requirements of various scenarios.
  • pixel j ci .
  • weights may be used to measure the importance and contribution of at least one matching pixel.
  • the label information of the target pixel may be determined based on at least one matching pixel and its weight. Specifically, the weighted summation of the labeled parts related to at least one matching pixel may be used as the labeling information of the target pixel.
  • weights may be determined based on similarity between pixels. Specifically, the similarity between the at least one matching pixel and the target pixel may be determined, and then normalization is performed based on the total similarity to determine the weight of each matching pixel of the at least one matching pixel.
  • the embodiment of the present disclosure does not limit the manner of determining the similarity.
  • the similarity between the two pixels can be determined by the distance between the two pixels.
  • the feature f i of pixel i can be obtained through feature extraction
  • the feature f j of pixel j can be obtained through feature extraction.
  • the similarity between feature f i and feature f j can then be taken as the similarity between pixel i and pixel j.
  • a feature extractor, a neural network or a local feature descriptor can be used for feature extraction.
  • the similarity can be calculated by calculating inner product, Euclidean distance and other ways.
  • the weight corresponding to pixel i in at least one matching pixel can be expressed as:
  • T means transpose, Indicates the similarity between f i and f j . Subsequently, the labeled parts of the k matched pixels can be weighted and summed to obtain the labeled information of pixel j (ie, the target pixel):
  • the labeling information of the target pixel in the current frame can be determined based on the position mapping model.
  • the labeling module 130 may also determine the labeling information of the current frame based at least on the labeling information of the target pixel in the current frame. Exemplarily, the synthesis of the annotation information of each pixel in the current frame may be determined as the flag information of the current frame.
  • the correction module 170 may correct the labeling information of each pixel, so as to obtain the labeling information of the current frame.
  • Corrections can be made based on the object being annotated or the type of annotation. For example, if the marked object is a straight line in the video, it can be adjusted by piecewise linear fitting after obtaining the marked information of the pixels. For example, if the labeling type is a segmentation mask, after obtaining the labeling information of the pixels, the wrong labeling of some pixels can be eliminated by smoothing. For example, if the label type is a rectangular frame, then after obtaining the label information of the vertex pixels of the rectangular frame, the position of the vertices of the rectangular frame can be adjusted through edge information, regional feature matching degree, etc. to obtain a more standard and compact rectangular frame.
  • the t+1th frame can be used as the current frame to further obtain the annotation information of the t+1th frame, ..., through this In this way, the annotation information for each frame in the video can be obtained.
  • the tagging information for a video may be referred to as a tagging result.
  • the annotation information of the video may be output to the user via the input/output module 110, for example, may be presented to the user in a visual manner. Therefore, the user can manually correct the annotation information of the video, and the like.
  • the labeling of each frame in the video is realized based on the position mapping model.
  • the marking method in the embodiments of the present disclosure is not limited to the type of marking and the marked object, for example, it may be one or more of points, straight lines, curves, rectangular frames, irregular polygons, and segmentation masks. Labeling information can be determined for various types of objects, without the need for the user to select different preset algorithms for different objects. High-quality annotation information can be obtained even for scenes with irregular motion trajectories, occlusions, and various object deformations.
  • the location mapping model in the embodiments of the present disclosure can also be updated based on the video to be tagged, so that the location mapping model can adapt to different data situations before tagging, and further ensure the accuracy of tagging.
  • system 100 shown in FIG. 1 may be a system capable of interacting with users, and the system 100 may be a software system, a hardware system, or a system combining hardware and software.
  • the system 100 can be implemented as a computing device or a part of a computing device, where the computing device includes but not limited to a desktop computer, a mobile terminal, a wearable device, a server, a cloud server, and the like.
  • System 100 can be deployed in at least one of a cloud environment and a local computing device. As an example, the system 100 is fully deployed in a cloud environment. As another example, some modules of the system 100 are deployed in a cloud environment, and other modules of the system 100 are deployed in a local computing device. As yet another example, system 100 is deployed entirely on a local computing device.
  • FIG. 3 shows a schematic diagram of a scenario 300 in which the system 100 is deployed in a cloud environment and a local computing device according to an embodiment of the present disclosure.
  • the system 100 is deployed in a cloud environment 310 and a terminal computing device 320 in a distributed manner, wherein the model update module 140 and the model storage module 160 are deployed in the cloud environment 310, the input/output module 110, the location Mapping module 120 , labeling module 130 , data storage module 150 and correction module 170 are deployed in local computing device 320 .
  • FIG. 3 is only a schematic illustration, and the embodiments of the present disclosure do not limit which parts of the system 100 are specifically deployed and where. Adaptive deployment to the situation or actual needs, etc.
  • system shown in FIG. 1 may also be deployed on a local computing device.
  • system 100 may be independently deployed on one local computing device, or distributed and deployed on multiple local computing devices, which is not limited in the present disclosure.
  • FIG. 4 shows a schematic structural diagram of a computing device 400 according to an embodiment of the present disclosure.
  • Computing device 400 in FIG. 4 may be implemented as a device on which system 100 in FIG. 1 is deployed.
  • computing device 400 may be implemented as a device in cloud environment 310 in FIG. 3 or as local computing device 320 . It should be understood that the computing device 400 shown in FIG. 4 can also be regarded as a computing device cluster.
  • the computing device 400 includes a memory 410 , a processor 420 , a communication interface 430 and a bus 440 , wherein the bus 440 is used for communication between various components of the computing device 400 .
  • the memory 410 may be a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a hard disk, a flash memory or any combination thereof.
  • the memory 410 can store programs, and when the programs stored in the memory 410 are executed by the processor 420, the processor 420 and the communication interface 430 are used to execute the processes that can be executed by the various modules in the system 100 as described above. It should be understood that the processor 420 and the communication interface 430 may also be used to implement part or all of the content in the embodiments of the video tagging method described below in this specification.
  • the memory can also store video and position mapping models. For example, a part of storage resources in the memory 410 is divided into a data storage module for storing videos, such as videos to be labeled, and a part of storage resources in the memory 410 is divided into a model storage module for storing location mapping models.
  • the processor 420 may be a central processing unit (Central Processing Unit, CPU), an application-specific integrated circuit (Application-Specific Integrated Circuit, ASIC), a graphics processing unit (Graphics Processing Unit, GPU) or any combination thereof.
  • Processor 420 may include one or more chips.
  • the processor 420 may include an accelerator, such as a neural processing unit (Neural Processing Unit, NPU).
  • the communication interface 430 uses a transceiver module such as a transceiver to implement communication between the computing device 400 and other devices or communication networks. For example, data may be acquired through communication interface 430 .
  • a transceiver module such as a transceiver to implement communication between the computing device 400 and other devices or communication networks. For example, data may be acquired through communication interface 430 .
  • Bus 440 may include pathways for communicating information between various components of computing device 400 (eg, memory 410 , processor 420 , communication interface 430 ).
  • Fig. 5 shows a schematic flowchart of a video tagging method 500 according to an embodiment of the present disclosure.
  • the method 500 shown in FIG. 5 can be executed by the system 100 .
  • the matching pixel of the target pixel in the current frame in the video in the previous frame is determined, the previous frame has annotation information.
  • the previous frame is located before the target frame and has been annotated.
  • the previous frame includes a start frame of the video, and the annotation information of the start frame is annotated by the user.
  • the previous frame includes adjacent or non-adjacent frames in the video preceding the current frame.
  • the annotation information of the previous frame may include information obtained by at least one of the following annotation methods: points, straight lines, curves, rectangular boxes, irregular polygons, and segmentation masks.
  • the video is a color video
  • the position-mapping model is trained based on the color video and grayscale video constructed from the color video.
  • FIG. 6 shows a schematic flowchart of a process 600 of obtaining a location mapping model according to an embodiment of the present disclosure.
  • a grayscale video is constructed based on the color video input by the user.
  • Each frame in a color video can be recolored to obtain a corresponding grayscale frame.
  • the preset location mapping model is updated based on the color video and the grayscale video to obtain a location mapping model.
  • the preset position mapping model can be updated by using the method of gradient descent, so as to obtain the updated position mapping model.
  • matching pixels of the target pixel in other one or more current frames may also be determined.
  • the annotation information of the target pixel is determined based on the portion of the annotation information of the previous frame related to the matched pixel.
  • the annotation information related to the matching pixel may be used as the annotation information of the target pixel.
  • the label information of the target pixel may be determined based on the matched pixel and one or more other matched pixels.
  • FIG. 7 shows a schematic flowchart of a process 700 of determining label information of a target pixel according to an embodiment of the present disclosure.
  • a portion related to another matching pixel of the target pixel in the previous frame is determined from the annotation information of the previous frame.
  • the position mapping model can determine multiple matching pixels in the previous frame to realize multi-point mapping.
  • the annotation information of the target pixel is determined based on the portion related to the matching pixel and the portion related to another matching pixel in the annotation information of the previous frame.
  • a first similarity between a matching pixel and a target pixel may be determined, and a second similarity between another matching pixel and the target pixel may be determined. Further, the labeling information of the target pixel may be determined by performing a weighted summation of the first similarity and the second similarity.
  • a first similarity between a matching pixel and a target pixel may be determined, and a second similarity between another matching pixel and a target pixel may be determined.
  • the labeling information of the target pixel can be determined through the comparison result of the first similarity and the second similarity. Specifically, the labeling information of the target pixel is determined based on the matching pixel corresponding to a larger similarity value.
  • the label information of the matching pixel may be used as the label information of the target pixel.
  • the label information of another matching pixel may be used as the label information of the target pixel.
  • the annotation information of the current frame may be determined.
  • the annotation information of each pixel in the current frame may be corrected, so as to obtain the annotation information of the current frame.
  • the labeling of each frame in the video is realized based on the position mapping model.
  • the marking method in the embodiments of the present disclosure is not limited to the type of marking and the marked object, for example, it may be one or more of points, straight lines, curves, rectangular frames, irregular polygons, and segmentation masks. Labeling information can be determined for various types of objects, without the need for the user to select different preset algorithms for different objects. High-quality annotation information can be obtained even for scenes with irregular motion trajectories, occlusions, and various object deformations.
  • the location mapping model in the embodiments of the present disclosure can also be updated based on the video to be tagged, so that the location mapping model can adapt to different data situations before tagging, and further ensure the accuracy of tagging.
  • Fig. 8 shows a schematic block diagram of a video tagging device 800 according to an embodiment of the present disclosure.
  • Apparatus 800 may be implemented by software, hardware or a combination of both.
  • the device 800 may be a software or hardware device that implements part or all of the functions in the system 100 shown in FIG. 1 .
  • the apparatus 800 includes a mapping unit 810 and a determining unit 820 .
  • the mapping unit 810 is configured to determine the matching pixels of the target pixel in the current frame in the video in the previous frame according to the position mapping model, and the previous frame has annotation information.
  • the determining unit 820 is configured to determine the annotation information of the target pixel based on the part related to the matching pixel in the annotation information of the previous frame.
  • the determining unit 820 is configured to determine a part related to another matching pixel of the target pixel in the previous frame from the annotation information of the previous frame; and the part related to another matching pixel to determine the annotation information of the target pixel.
  • the determining unit 820 is configured to determine a first similarity between a matching pixel and a target pixel, and a second similarity between another matching pixel and a target pixel; The second similarity is weighted and summed to determine the label information of the target pixel.
  • the video is a color video
  • the position mapping model is trained based on the color video and grayscale video constructed from the color video.
  • the apparatus 800 may further include a constructing unit 802 and an updating unit 804 .
  • the construction unit 802 may be configured to construct a grayscale video based on the color video input by the user.
  • the updating unit 804 may be configured to update the preset location mapping model based on the color video and the grayscale video to obtain the location mapping model.
  • the previous frame includes a starting frame of the video, and the marking information of the starting frame is marked by a user.
  • the previous frame includes adjacent or non-adjacent frames preceding the current frame in the video.
  • the annotation information of the previous frame includes information obtained by at least one of the following annotation methods: points, straight lines, curves, rectangular boxes, irregular polygons, and segmentation masks.
  • the determining unit 820 is further configured to: determine the labeling information of the current frame based at least on the labeling information of the target pixel.
  • the apparatus 800 can be implemented as the system 100, for example, the mapping unit 810 can be implemented as the position mapping module 120, the determination unit 820 can be implemented as the labeling module 130, the construction unit 802 and the update unit 804 can be implemented as Model update module 140 .
  • the division of units in the embodiments of the present disclosure is schematic, and it is only a logical function division. In actual implementation, there may be other division methods.
  • the functional units in the disclosed embodiments can be integrated into one
  • a processor may also exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the apparatus 800 shown in FIG. 8 can be used to implement the above-mentioned video tagging process shown in conjunction with FIGS. 5 to 7 .
  • the present disclosure can also be implemented as a computer program product.
  • a computer program product may include computer readable program instructions for carrying out various aspects of the present disclosure.
  • the present disclosure may be implemented as a computer-readable storage medium, on which computer-readable program instructions are stored, and when a processor executes the instructions, the processor is made to execute the above-mentioned processing procedures.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM) or flash memory, Static Random Access Memory (SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disk (Digital Versatile Discs, DVDs), memory sticks, floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable Read Only Memory
  • SRAM Static Random Access Memory
  • CD-ROM Portable Compact Disc Read-Only Memory
  • DVDs Digital Versatile Disk
  • memory sticks floppy disks, mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer readable program instructions for performing the operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or in the form of a or any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as “C” or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it can be connected to an external computer such as use an Internet service provider to connect via the Internet).
  • electronic circuits such as programmable logic circuits, field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or programmable logic arrays (Programmable Logic Array, PLA), the electronic circuit can execute computer-readable program instructions, thereby implementing various aspects of the present disclosure.
  • These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer readable program instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本公开的实施例提供了一种视频标注方法、装置、计算设备和计算机可读存储介质。该方法包括:根据位置映射模型,确定视频中的当前帧中的目标像素在先前帧中的匹配像素,先前帧具有标注信息;以及基于先前帧的标注信息中与匹配像素有关的部分,确定目标像素的标注信息。以此方式,本公开的实施例基于位置映射模型实现了对视频中各帧的标注。对标注的类型和被标注的对象不限制,对于各种类型的对象都能够确定标注信息,并且即使针对运动轨迹不规则、具有遮挡、物体形变多样等场景也能够得到高质量的标注信息。

Description

视频标注方法、装置、计算设备和计算机可读存储介质 技术领域
本公开涉及人工智能领域,并且更具体地,涉及一种视频标注方法、装置、计算设备和计算机可读存储介质。
背景技术
视频作为最主要的视觉信息载体,包括互联网上用户上传的视频、监控系统保存的视频、影视剧视频等,可以为各种视觉智能应用提供数据输入。
基于视频的很多人工智能技术要依赖于对视频的标注。然而,目前对视频进行标注的方案不仅成本高而且效率低。
发明内容
本公开的示例实施例提供了一种视频标注方法,该方法能够基于位置映射模型实现对视频中各帧的自动标注,且标注效率高。
第一方面,提供了一种视频标注方法。该方法包括:根据位置映射模型,确定视频中的当前帧中的目标像素在先前帧中的匹配像素,先前帧具有标注信息;以及基于先前帧的标注信息中与匹配像素有关的部分,确定目标像素的标注信息。
如此,本公开的实施例中,基于位置映射模型实现了对视频中各帧的标注。对标注的类型和被标注的对象不限制,对于各种类型的对象都能够确定标注信息,并且即使针对运动轨迹不规则、具有遮挡、物体形变多样等场景也能够得到高质量的标注信息。
在第一方面的一些实施例中,该视频为彩色视频,位置映射模型是基于彩色视频以及从彩色视频构建的灰度视频而训练得到的。
在第一方面的一些实施例中,在确定匹配像素之前还包括:基于用户输入的彩色视频,构建灰度视频;以及基于彩色视频和灰度视频对预置位置映射模型进行更新,以得到位置映射模型。
如此,位置映射模型可以基于待标注的视频进行更新,从而在标注之前使得位置映射模型能够适应不同的数据情况,进一步保证的标注的准确性。
在第一方面的一些实施例中,确定目标像素的标注信息包括:从先前帧的标注信息中确定与目标像素在先前帧中的另一匹配像素有关的部分;以及基于先前帧的标注信息中与匹配像素有关的部分以及与另一匹配像素有关的部分,确定目标像素的标注信息。
如此,能够基于多个匹配像素来确定目标像素的标注信息,使得标注结果更加准确。
在第一方面的一些实施例中,确定目标像素的标注信息包括:确定匹配像素与目标像素之间的第一相似度,以及另一匹配像素与目标像素之间的第二相似度;以及通过对第一相似度与第二相似度进行加权求和,确定目标像素的标注信息。
在第一方面的一些实施例中,确定目标像素的标注信息包括:确定匹配像素与目标像素之间的第一相似度,以及另一匹配像素与目标像素之间的第二相似度;以及通过第一相似度与第二相似度的比较结果,确定目标像素的标注信息。
在第一方面的一些实施例中,先前帧包括视频的起始帧,起始帧的标注信息是由用户标 注的。
如此,可以基于准确的用户标注,来确定当前帧中目标像素的标注信息,使得得到的标注信息更加准确。
在第一方面的一些实施例中,先前帧包括视频中的位于当前帧之前的与当前帧相邻或非相邻的帧。
如此,可以基于相邻帧进行标注,充分考虑视频中对象的移动性,使得得到的标注信息更能符合实际情况。
在第一方面的一些实施例中,标注信息包括通过以下至少一种标注方式所得到的信息:点、直线、曲线、矩形框、不规则多边形、和分割掩膜。
如此,本公开可以针对各种形式的标注信息进行后续的标注,对于各种类型的对象都能够确定标注信息,而无需用户针对不同的对象选择不同的预置算法。即使针对运动轨迹不规则、具有遮挡、物体形变多样等场景也能够得到高质量的标注信息。
在第一方面的一些实施例中,还包括:至少基于目标像素的标注信息,确定当前帧的标注信息。
第二方面,提供了一种视频标注装置。该装置包括:映射单元被配置为根据位置映射模型,确定视频中的当前帧中的目标像素在先前帧中的匹配像素,先前帧具有标注信息;以及确定单元被配置为基于先前帧的标注信息中与匹配像素有关的部分,确定目标像素的标注信息。
在第二方面的一些实施例中,视频为彩色视频,位置映射模型是基于彩色视频以及从彩色视频构建的灰度视频而训练得到的。
在第二方面的一些实施例中,该装置还包括:构建单元被配置为基于用户输入的彩色视频,构建灰度视频;以及更新单元被配置为基于彩色视频和灰度视频对预置位置映射模型进行更新,以得到位置映射模型。
在第二方面的一些实施例中,确定单元被配置为从先前帧的标注信息中确定与目标像素在先前帧中的另一匹配像素有关的部分;以及基于先前帧的标注信息中与匹配像素有关的部分以及与另一匹配像素有关的部分,确定目标像素的标注信息。
在第二方面的一些实施例中,确定单元被配置为:确定匹配像素与目标像素之间的第一相似度,以及另一匹配像素与目标像素之间的第二相似度;以及通过对第一相似度与第二相似度进行加权求和,确定目标像素的标注信息。
在第二方面的一些实施例中,先前帧包括视频的起始帧,起始帧的标注信息是由用户标注的。
在第二方面的一些实施例中,先前帧包括视频中的位于当前帧之前的与当前帧相邻或非相邻的帧。
在第二方面的一些实施例中,标注信息包括通过以下至少一种标注方式所得到的信息:点、直线、曲线、矩形框、不规则多边形、和分割掩膜。
在第二方面的一些实施例中,确定单元还被配置为至少基于目标像素的标注信息,确定当前帧的标注信息。
第三方面,提供了一种计算设备,包括处理器以及存储器,所述存储器上存储有由处理器执行的指令,当该指令被处理器执行时使得所述计算设备实现:根据位置映射模型,确定 视频中的当前帧中的目标像素在先前帧中的匹配像素,先前帧具有标注信息;以及基于先前帧的标注信息中与匹配像素有关的部分,确定目标像素的标注信息。
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:从先前帧的标注信息中确定与目标像素在先前帧中的另一匹配像素有关的部分;以及基于先前帧的标注信息中与匹配像素有关的部分以及与另一匹配像素有关的部分,确定目标像素的标注信息。
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:确定匹配像素与目标像素之间的第一相似度,以及另一匹配像素与目标像素之间的第二相似度;以及通过对第一相似度与第二相似度进行加权求和,确定目标像素的标注信息。
在第三方面的一些实施例中,视频为彩色视频,位置映射模型是基于彩色视频以及从彩色视频构建的灰度视频而训练得到的。
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:基于用户输入的彩色视频,构建灰度视频;以及基于彩色视频和灰度视频对预置位置映射模型进行更新,以得到位置映射模型。
在第三方面的一些实施例中,先前帧包括视频的起始帧,起始帧的标注信息是由用户标注的。
在第三方面的一些实施例中,先前帧包括视频中的位于当前帧之前的与当前帧相邻或非相邻的帧。
在第三方面的一些实施例中,标注信息包括通过以下至少一种标注方式所得到的信息:点、直线、曲线、矩形框、不规则多边形、和分割掩膜。
在第三方面的一些实施例中,当该指令被处理器执行时使得所述计算设备实现:至少基于目标像素的标注信息,确定当前帧的标注信息。
第四方面,提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现根据上述第一方面或其任一实施例中的方法的操作。
第五方面,提供了一种芯片或芯片系统。该芯片或芯片系统包括处理电路,被配置为执行根据上述第一方面或其任一实施例中的方法的操作。
第六方面,提供了一种计算机程序或计算机程序产品。该计算机程序或计算机程序产品被有形地存储在计算机可读介质上并且包括计算机可执行指令,计算机可执行指令在被执行时使设备实现根据上述第一方面或其任一实施例中的方法的操作。
附图说明
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及其他方面将变得更加明显。在附图中,相同或相似的附图标注表示相同或相似的元素,其中:
图1示出了根据本公开的实施例的系统的结构示意图;
图2示出了根据本公开的实施例的位置映射模型的一个示意图;
图3示出了根据本公开的实施例的系统被部署于云环境和本地计算设备中的场景的示意图;
图4示出了根据本公开的实施例的计算设备的结构示意图;
图5示出了根据本公开的实施例的视频标注方法的示意流程图;
图6示出了根据本公开的实施例的得到位置映射模型的过程的示意流程图;
图7示出了根据本公开的实施例的确定目标像素的标注信息的过程的示意流程图;
图8示出了根据本公开的实施例的视频标注装置的示意框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
人工智能(Artificial Intelligence,AI)利用计算机来模拟人的某些思维过程和智能行为。人工智能的研究历史有着一条从以“推理”为重点,到以“知识”为重点,再到以“学习”为重点的自然、清晰的脉络。人工智能已经被广泛地应用到了安防、医疗、交通、教育、金融等各个行业。
机器学习(Machine Learning)是人工智能的一个分支,其研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。也就是说,机器学习研究的是如何在经验学习中改善具体算法的性能。
深度学习(Deep Learning)是一类基于深层次神经网络算法的机器学习技术,其主要特征是使用多重非线性变换结构对数据进行处理和分析。主要应用于人工智能领域的感知、决策等场景,例如图像和语音识别、自然语言翻译、计算机博弈等。
视频作为一种重要的视觉信息载体,包括互联网上用户上传的视频、监控系统保存的视频、影视剧视频等,为各种视觉智能应用提供数据输入。当前面向视觉的人工智能技术,尤其是作为主流的深度学习技术,往往依赖于大量的有标注数据进行学习。视频数据由于相比图像多出了时间维度,且数据规模巨大,导致对视频的标注非常困难,不仅成本高而且效率低,进而限制了相关技术在实际中的应用,也限制了视频数据发挥其应有的价值,因此高效地获取带标注的视频数据对于人工智能技术在相关领域的落地至关重要。
对视频进行标注通常需要逐帧进行,即将视频拆解为图像序列后对每帧图像进行标注。一种方式是在用户标注的基础上,使用半自动标注工具来辅助标注,其中半自动标注工具可以使用预置的物体跟踪算法对被标注的物体进行跟踪,这样可以在一定程度上降低标注的难度,提高标注的效率。但是这样的半自动标注工具通常只能针对一种标注任务,因此如果针对不同类型的标注任务,需要不同的半自动标注工具相互独立地进行标注。另外,而且半自动标注工具只能针对特定的物体提供预测辅助能力,同时每次只能跟踪一个物体,从而其使用效果和适用范围是极其有限的,如此导致了无法满足实际需求。
有鉴于此,本公开的实施例提供了一种对视频进行标注,确定视频中目标帧的标注信息。该方法基于位置映射模型进行标注,不依赖于标注的类型和针对的对象,因此能够适用于各种标注,适用范围更广,从而能够满足各种场景的需求。
图1示出了根据本公开的实施例的系统100的结构示意图。如图1所示,系统100可以如图1所示,系统架构100包括输入/输出(Input/Output,I/O)模块110、位置映射模块120和标注模块130。可选地,如图1所示,系统100还可以包括模型更新模块140、数据存储模块150、模型存储模块160和校正模块170。根据各个实施例中操作的需要,图1所示的各个模块之间可以彼此进行通信。
输入/输出模块110可以用于获取待处理视频,例如可以接收由用户输入的视频。
可选地,用户输入的视频可以被存储在数据存储模块150中。作为一个示例,数据存储模块150可以是云服务提供商提供的对象存储服务(Object Storage Service,OBS)对应的数据存储资源。
视频包括图像序列,如第0帧、第1帧、…等。作为示例,该视频中的一帧或多帧具有标注信息。
在一些实施例中,输入/输出模块110所获取的视频的起始帧具有标注信息,也就是说,输入/输出模块110可以用于获取视频,并获取视频的起始帧的标注信息。
起始帧可以是指该视频中需要进行标注的视频段的第一帧,其中,需要进行标注的视频段可以是该视频的全部或部分。在该视频的全部需要进行标注的场景下,该起始帧可以是指位于该视频的开端的第一帧。在仅该视频的部分需要进行标注的场景下,该起始帧可以是指需要进行标注的视频的部分的开端的第一帧。例如,视频的开始部分可能包括一个或多个无效帧、测试帧等,这些帧不需要被进行标注,那么可以将这些不需要被标注的帧之后的需要进行视频标注的视频段的第一帧定义为起始帧。
为了简化描述,便于理解,下文中可以将视频视为“需要进行标注的视频段”。相应地,可以将起始帧定义为视频的第0帧,将位于起始帧之后的帧顺次地定义为第1帧、第2帧、…,将位于起始帧之前的帧(如果存在的话)定义为负帧或无效帧或其他名称等。但是可理解的是,这种定义方式仅是为了本文中实施例描述的方便,不应解释为对本公开的保护范围的限定。
可选地,起始帧的标注信息可以是由用户标注的。例如,标注人员可以根据经验,针对该起始帧中的一个或多个特定部分(如动物、人体、车辆等)进行标注。该标注信息也可以被称为标签、任务标记或其他名称等,本公开中不再一一罗列。
在本公开的实施例中,起始帧的标注信息包括通过以下至少一种标注方式所得到的信息:点、直线、曲线、矩形框、不规则多边形、和分割掩膜(mask)等。作为示例,对起始帧的标注信息可以包括以下部分或全部信息:对起始帧中一个或多个点进行标注所得到的信息、对起始帧中一条或多条直线进行标注所得到的信息、对起始帧中一条或多条曲线进行标注所得到的信息、对起始帧中一个或多个矩形框进行标注所得到信息、对起始帧中一个或多个不规则多边形进行标注所得到的信息、对起始帧中一个或多个分割掩膜进行标注所得到的信息等。
应理解的是,本公开的实施例中对各种标注方式的具体含义不做限定。例如,点可以表示行人,直线可以表示道路,矩形框可以表示动物区域,不规则多边形可以表示姿态等等。
本公开的实施例对视频的原始来源不作限定,例如可以是从开源数据集获取的,例如可以是由图像采集设备采集的,或上述所列的任意组合,或其他等等。
输入/输出模块110可以被实现为彼此独立的输入模块和输出模块,或者也可以被实现 为同时具备输入功能和输出功能的耦合模块。作为示例,可以采用图形用户界面(Graphical User Interface,GUI)或命令行界面(Command-Line Interface,CLI)实现输入/输出模块110。
作为示例,通过输入/输出模块110所获取的视频可以被存储在数据存储模块150中。
位置映射模块120可以用于根据位置映射模型,确定视频中的当前帧中的目标像素在先前帧中的匹配像素。标注模块130可以用于基于先前帧的标注信息中与匹配像素有关的部分,确定目标像素的标注信息。
在一种实现方式中,可以按照时间序列对视频以逐帧(frame-wise)的方式进行标注。示例性地,可以基于起始帧(即第0帧)标注第1帧,基于起始帧和/或第1帧标注第2帧,基于起始帧、第1帧和第2帧中的至少一帧标注第3帧,…,基于起始帧至第t-1帧中的至少一帧标注第t帧,等等。在另一种实现方式中,可以不按照时间序列的顺序进行标注。例如,可以基于起始帧(即第0帧)标注第1帧,基于起始帧和/或第1帧标注第3帧,基于起始帧、第1帧和第3帧中的至少一帧标注第2帧,…,等等。
可以将当前要标注的帧称为当前帧,例如可以是第t帧。可以将已经完成标注的帧称为先前帧,例如在按照时间序列逐帧标注的情形,先前帧可以包括起始帧至第t-1帧中的任一帧。作为一个示例,先前帧可以包括起始帧。作为另一个示例,先前帧可以包括位于当前帧之前且与当前帧相邻的帧。作为再一个示例,先前帧可以包括位于当前帧之前且与当前帧不相邻的帧。简而言之,当前帧为第t帧,先前帧可以包括第p帧,且p和t为正整数,p小于t。
在对当前帧进行标注时,可以逐像素地先确定像素的标注信息,其中当前要标注的像素可以被称为目标像素。
本公开的实施例对选取目标像素的方式不作限定。
在一些实施例中,可以随机地从当前帧的未标注像素中选择一个像素作为目标像素。举例来说,当前帧的未标注像素包括至少一个像素,那么可以从未标注像素中随机选取其中之一作为目标像素。
在一些实施例中,可以以行或列的方式依次地从未标注像素中选择目标像素。举例来说,可以从当前帧的第一行开始,将第一行的第一个像素作为目标像素;随后再将第一行的第二个像素作为目标像素…。举例来说,可以从当前帧的第一列开始,将第一列的第一个像素作为目标像素;随后再将第一列的第二个像素作为目标像素…。
在一些实施例中,可以基于当前帧的前一帧中的被标注像素来选择目标像素。举例来说,如果当前帧的前一帧(如第t-1帧)中的位置(x1,y1)处的像素具有标注信息,那么可以将当前帧中的位置(x1,y1)处的像素作为目标像素。
应理解的是,也可以通过其他方式来确定目标像素,本公开的实施例中不再一一罗列。
本公开的实施例中,“位置映射模型”也可以被称为位置映射算法、位置对应模型、像素匹配模型、机器学习模型或其他名称等,或者也可以被简称为“模型”等,本公开对此不限定。
位置映射模型可以用于确定某图像中的像素在另一图像中的对应像素。具体地,对于给定的两幅图像(假如为第一图像和第二图像),位置映射模型可以确定第一图像中的任一个像素在第二图像中的一个或多个对应像素。
应注意的是,位置映射模型对于给定的两幅图像之间的关系不作限定。例如,第一图像 和第二图像可以是位于同一视频中的两帧,或者可以是位于不同视频中的两帧。例如,第一图像和第二图像位于同一视频中,第一图像可以是位于第二图像之前的帧,第一图像也可以是第二图像之后的帧。
如果用u表示第一图像中的任一个像素,用v表示第二图像中的任一个像素,用g表示两幅图像之间的映射,那么映射g的基本功能可以表示为:s=g(u,v)。
可理解的是,u和v表示像素的相同属性,例如像素的位置、像素的提取特征或者其他相关的信息等等。
等式s=g(u,v)得到的s可以表示第一图像中的像素u和第二图像中的像素v之间对应关系的强弱程度。s可以是一个实数,例如可以是0至1之间的值,且s越大表示对应关系越强。
在上式的一个特例中,可以定义强弱程度仅有0和1两种情况,即s仅为0或1。此时,可以将上式表示为如下的衍生形式:r=g1(u),其中r表示和u对应的像素的表示,例如r可以表示该对应的像素的坐标。可理解,该式实际上等价于判定g1(u)和v的对应关系,此处不再详述。
进一步地,可以基于上式构建第一图像中的一个像素与第二图像中的多个像素之间的关系,例如将上式中的v扩展为包括多个像素的像素集合,相应地s被扩展为相应维度的向量。作为一例,可以将式s=g(u,v)扩展为:[s1,s2,…,sn]=g(u:[v1,v2,…,vn])。
如此,可以基于该映射,确定与第一图像中的像素u所对应的第二图像中的一个或多个像素。在一例中,可以找到s1至sn中最大值,例如为si,那么可以确定第二图像中的像素vi是与第一图像中的像素u所对应的像素。
在另一例中,可以找到s1至sn中大于预设值(例如0.5或0.8等)的一个或多个s,那么可以确定一个或多个s对应的一个或多个v是第二图像中的与第一图像中的像素u所对应的像素。如图2所示,通过位置映射g,针对在第一图像210中的像素u,可以确定在第二图像220中的对应像素包括v1和v2。
本公开的实施例中,位置映射模型可以被预先构建并被存储在模型存储模块160中,但是本公开的实施例对位置映射模型的构建方式不作限定。也就是说,本公开的实施例中,对视频进行标注时所使用的位置映射模型为预置位置映射模型。
预置位置映射模型可以是基于训练图像集进行建模并训练得到的。具体地,可以基于训练图像集中各个训练图像中的像素信息进行建模并训练得到的。像素信息可以包括像素的位置信息、像素的颜色信息、像素的特征值、像素的运动信息等等。
在一些实施例中,可以基于各个像素的运动信息进行建模,以得到预置位置映射模型。在另一些实施例中,可以基于各个像素的表观特征进行建模,以得到预置位置映射模型。在另一些实施例中,可以基于各个像素的运动信息和表观特征进行建模,以得到预置位置映射模型。
示例性地,运动信息可以是采用诸如光流等的运动场、形变场等技术所确定的,其中光流可以是基于相邻两帧之间的像素的移动方向和移动距离等构建的。示例性地,表观特征可以是诸如颜色(如红-绿-蓝(RGB))等特征,本公开实施例对此不限定。
可理解,本公开实施例对预置位置映射模型的模型结构不作限定。作为一例,可以在建模时确定预置位置映射模型的结构,例如可以参照卷积神经网络(Convolutional Neural  Network,CNN)模型的结构,可选地包括输入层、卷积层、反卷积层、池化层、全连接层、输出层等。
预置位置映射模型中包括大量的参数,可以表示该模型中的计算公式或计算因子的权重,并且可以通过训练对参数进行迭代更新。预置位置映射模型的参数还包括超参数(hyper-parameter),用于指导预置位置映射模型的构建或训练。超参数例如包括模型训练的迭代(iteration)次数、初始学习率(leaning rate)、批尺寸(batch size)、模型的层数、每层神经元的个数等。超参数可以是通过训练集对模型进行训练获得的参数,也可以是预先设定的参数,预先设定的参数指不会通过对模型的训练而被更新。
通过训练得到预置位置映射模型的过程可以参照当前已知或将来待开发的模型训练算法。作为示意性描述,该训练过程可以是:构建训练集,将训练集中的训练数据项输入到预置位置映射模型,利用损失函数(loss function)的损失值对预置位置映射模型的参数进行调整。训练集中的每个训练数据项迭代地对预置位置映射模型进行训练,进而使得预置位置映射模型的参数不断调整。在训练过程中的损失函数是用于衡量预置位置映射模型被训练的程度的函数。
可见,本公开的实施例中借助于位置映射模型来对视频中的帧进行标注,不依赖于特定的标注任务,可以采用统一视角来建模不同的标注形式,从而可以支持对全图各种不同类型的对象同时进行标注。并且,本公开实施例中的标注方式对于场景没有特定的要求,对被标注的物体的类别和运动方式不作限定,即使针对诸如遮挡等复杂场景,也能够达到较高的标注质量。
如上所述,位置映射模型可以用于确定某图像中的像素在另一图像中的对应像素。例如针对第一图像中的任一像素,位置映射模型可以确定第二图像中的一个或多个对应像素。
在一些实施例中,位置映射模块120所使用的位置映射模型可以是从模型存储模块160中获取的预置位置映射模型。
在另一些实施例中,模型更新模块140可以对模型存储模块160中的预置位置映射模型进行更新,以得到更新后的位置映射模型。具体地,可以基于从输入/输出模块110所获取的视频对位置映射模型进行更新,相应地,位置映射模块120所使用的位置映射模型可以是该更新后的位置映射模型。也就是说,位置映射模块120所使用的位置映射模型可以是基于视频被更新而得到的。可选地,该更新后的位置映射模型也可以被存储在模型存储模块160中。
具体地,如果输入/输出模块110所获取的视频为彩色视频,那么模型更新模块140可以基于该彩色视频以及从彩色视频所构建的灰度视频来训练位置映射模型。
在一些实施例中,模型更新模块140可以基于用户输入的彩色视频,构建灰度视频。随后基于该彩色视频和灰度视频对预置映射模型进行更新,从而得到位置映射模块120可使用的位置映射模型。
可以通过重新着色,基于彩色视频中的每一彩色帧构建对应的灰度帧,以得到灰度视频。可以基于彩色视频和灰度视频构建训练数据集,并基于该训练数据集对预置位置映射模型进行训练,以实现对预置映射模型的更新。可理解,训练数据集包括多个训练数据项,每一训练数据项包括彩色帧和对应的灰度帧。还可理解,可以采用梯度下降基于训练数据集对位置映射模型进行更新,也可以采用其他方式进行更新,这里不再罗列。
在另一些实施例中,模型更新模块140可以基于用户输入的彩色视频,构建另一彩色视频。随后基于该彩色视频和另一彩色视频对预置映射模型进行更新,从而得到位置映射模块120可使用的位置映射模型。在其他实施例中,也可以通过其他的方式,基于用户输入的彩色视频来实现对预置映射模型的更新,本公开中不再罗列。
可理解,上述模型更新模块140基于待标注视频对位置映射模型的更新过程也可以称为对位置映射模型的训练过程。如此,可以基于待标注的视频对预置位置映射模型进行更新,从而能够预先适应不同的数据情况,使用更新的位置映射模型对视频进行标注,能够进一步确保标注的质量。
在一些实施例中,可以通过输入/输出模块110获取用户的指令。如果该指令指示对预置位置更新模型进行更新,则模型更新模块140基于用户输入的视频对预置位置更新模型进行更新,以得到更新后的位置更新模型。进一步地,该更新后的位置更新模型可以在之后由位置映射模块120使用。如果该指令指示不对预置位置更新模型进行更新,则位置映射模块120可以使用预置位置映射模型。
在从模型存储模块160获取预置位置映射模型后或者在从模型更新模块140获取更新后的位置映射模型后,基于位置映射模型,位置映射模块120可以确定当前帧中的目标像素在先前帧中的匹配像素。具体地,当前帧可以理解为如图2中的第一图像,先前帧可以理解为如图2中的第二图像,那么可以通过位置映射模型将当前帧中的目标像素映射到先前帧中的一个或多个对应像素,即至少一个匹配像素。
可理解的是,位置映射模块120还可以确定当前帧中的目标像素在其他先前帧中的匹配像素。这样,位置映射模块120能够确定目标像素在多个先前帧中的匹配像素。这种方式同时考虑了时间维度和空间维度,使得确定标注信息的参考更加丰富全面,进而能够确保标注的精度和准确性。
标注模块130可以用于确定目标像素的标注信息。
在一些实施例中,可以基于至少一个匹配像素与目标像素之间的相似度来确定目标像素的标注信息。可理解的是,至少一个匹配像素可以是在一个先前帧中的至少一个匹配像素,也可以是在多个先前帧中的多个匹配像素。
可选地,可以基于与目标像素的相似度最大的匹配像素,来确定目标像素的标注信息。具体地,可以将与目标像素的相似度最大的匹配像素有关的标注部分,作为目标像素的标注信息。
以两个匹配像素为例,假设至少一个匹配像素包括一个匹配像素和另一匹配像素。那么可以确定一个匹配像素与目标像素之间的第一相似度,确定另一匹配像素与目标像素之间的第二相似度。比较第一相似度和第二相似度,如果第一相似度大于第二相似度,则基于一个匹配像素来确定目标像素的标注信息。如果第一相似度小于第二相似度,则基于另一匹配像素来确定目标像素的标注信息。
本公开实施例对确定相似度的方式不作限定。示例性地,可以通过计算两个像素之间的距离来确定这两个像素之间的相似度。举例而言,针对像素i和像素j,可以通过特征提取得到像素i的特征f i,通过特征提取得到像素j的特征f j。随后可以将特征f i与特征f j之间的相似度作为像素i与像素j之间的相似度。可选地,可以采用特征提取器,采用神经网络或局部特征描述子等方式进行特征提取。可选地,可以通过计算内积、欧式距离等方式来计 算相似度。
在一些示例中,计算第一相似度的方式和计算第二相似度的方式可以是一致的,例如都采用欧式距离作为相似度。这样,能够保证被比较的不同相似度的一致性,使得确定的结果更加准确。在另一些示例中,计算第一相似度的方式和计算第二相似度的方式可以是不一致的,这样能够满足多样化场景的需求。
假设目标像素表示为像素j,至少一个匹配像素中与目标像素的相似度最大的匹配像素为像素i,且该像素i在先前帧中被标注有c i,那么可以确定像素j(即目标像素)的标注信息为:y j=c i
本另一些实施例中,可以使用权重来衡量至少一个匹配像素的重要性和贡献大小。进一步地,可以基于至少一个匹配像素和其权重来确定目标像素的标注信息。具体地,可以将至少一个匹配像素有关的标注部分的加权求和,作为目标像素的标注信息。
在一些实施例中,可以基于像素之间的相似度来确定权重。具体地,可以确定至少一个匹配像素分别与目标像素之间的相似度,随后基于总的相似度进行归一化来确定至少一个匹配像素的每个匹配像素的权重。
本公开实施例对确定相似度的方式不作限定。示例性地,针对任意两个像素,可以通过这两个像素之间的距离来确定这两个像素之间的相似度。
举例而言,针对像素i和像素j,可以通过特征提取得到像素i的特征f i,通过特征提取得到像素j的特征f j。随后可以将特征f i与特征f j之间的相似度作为像素i与像素j之间的相似度。可选地,可以采用特征提取器,采用神经网络或局部特征描述子等方式进行特征提取。可选地,可以通过计算内积、欧式距离等方式来计算相似度。
假设目标像素表示为像素j,至少一个匹配像素包括k个匹配像素,那么至少一个匹配像素中的像素i所对应的权重可以表示为:
Figure PCTCN2022081027-appb-000001
上式中,T表示转置,
Figure PCTCN2022081027-appb-000002
表示f i与f j之间的相似度。随后,可以将这k个匹配像素的标注部分进行加权求和,得到像素j(即目标像素)的标注信息为:
Figure PCTCN2022081027-appb-000003
如此,通过上面结合位置映射模块120和标注模块130的相关描述,可以基于位置映射模型,确定当前帧中目标像素的标注信息。
标注模块130还可以至少基于当前帧中目标像素的标注信息,确定当前帧的标注信息。示例性地,可以将当前帧中各个像素的标注信息的综合,确定为当前帧的标志信息。
在一些实施例中,在得到当前帧中各个像素的标注信息之后,校正模块170可以对各个像素的标注信息进行校正,从而得到当前帧的标注信息。
可以基于被标注的对象或标注的类型进行校正。例如,被标注的对象是视频中的直线,那么可以在得到像素的标注信息之后,通过分段线性拟合等方式进行调整。例如,标注的类型为分割掩膜,那么可以在得到像素的标注信息之后,通过平滑处理来剔除对部分像素的错误标注。例如,标注的类型为矩形框,那么可以在得到矩形框的顶点像素的标注信息之后,通过边缘信息、区域特征匹配程度等来调整矩形框的顶点位置以得到更加标准和紧致的矩形 框。
可理解的是,在将第t帧作为当前帧,得到该第t帧的标注信息之后,可以将第t+1帧作为当前帧,进一步得到第t+1帧的标注信息,…,通过这样的方式,能够得到对于该视频中各帧的标注信息。示例性地,可以将对视频的标注信息称为标注结果。
在一些实施例中,视频的标注信息(即标注结果)可以经由输入/输出模块110输出给用户,例如可以通过可视化方式呈现给用户。从而,用户能够对视频的标注信息进行手动修正等。
如此,本公开的实施例中,基于位置映射模型实现了对视频中各帧的标注。并且本公开的实施例中的标注方式对标注的类型和被标注的对象不限制,例如可以是点、直线、曲线、矩形框、不规则多边形、和分割掩膜中的一种或多种。对于各种类型的对象都能够确定标注信息,而无需用户针对不同的对象选择不同的预置算法。即使针对运动轨迹不规则、具有遮挡、物体形变多样等场景也能够得到高质量的标注信息。另外,本公开的实施例中的位置映射模型还可以基于待标注的视频进行更新,从而在标注之前使得位置映射模型能够适应不同的数据情况,进一步保证的标注的准确性。
可理解,图1所示的系统100可以是能够与用户进行交互的系统,该系统100可以是软件系统、硬件系统、或软硬结合的系统。
在一些示例中,该系统100可以被实现为计算设备或者计算设备的一部分,其中计算设备包括但不限于台式机、移动终端、可穿戴设备、服务器、云服务器等。
系统100可以被部署于云环境和本地计算设备的至少一个中。作为一例,系统100被全部部署在云环境中。作为另一例,系统100的部分模块被部署在云环境中,系统100的另部分模块被部署在本地计算设备中。作为又一例,系统100被全部部署在本地计算设备中。
作为一个示例,图3示出了根据本公开的实施例的系统100被部署于云环境和本地计算设备中的场景300的示意图。如图3所示,系统100被分布式地部署在云环境310和终端计算设备320中,其中,模型更新模块140和模型存储模块160被部署在云环境310中,输入/输出模块110、位置映射模块120、标注模块130、数据存储模块150和校正模块170被部署在本地计算设备320中。
应理解的是,图3仅是示意,本公开的实施例对系统100的哪些部分具体被部署在哪里不作限定,实际应用时,可以根据本地计算设备320的计算能力、云环境310的资源占用情况或实际的需求等进行适应性的部署。
作为另一个示例,图1中所示的系统也可以被部署在本地计算设备中。例如系统100可以被单独部署在一台本地计算设备上,或者可以被分布式地部署在多台本地计算设备上,本公开对此不限定。
图4示出了根据本公开的实施例的计算设备400的结构示意图。图4中的计算设备400可以被实现为图1中的系统100被部署的设备。例如计算设备400可以被实现为图3中的云环境310中的设备或者本地计算设备320。应理解,图4所示的计算设备400也可以被视为计算设备集群。
如图4所示,计算设备400包括存储器410、处理器420、通信接口430以及总线440,其中,总线440用于计算设备400的各个部件彼此之间的通信。
存储器410可以是只读存储器(Read Only Memory,ROM),随机存取存储器(Random Access Memory,RAM),硬盘,快闪存储器或其任意组合。存储器410可以存储程序,当存储器410中存储的程序被处理器420执行时,处理器420和通信接口430用于执行如上所述的系统100中各个模块能够执行的过程。应理解,处理器420和通信接口430也可以用于执行本说明书下文所述的视频标注方法的实施例中的部分或全部内容。存储器还可以存储视频和位置映射模型。例如,存储器410中的一部分存储资源被划分成一个数据存储模块,用于存储视频,如待标注视频等,存储器410中的一部分存储资源被划分成模型存储模块,用于存储位置映射模型。
处理器420可以采用中央处理单元(Central Processing Unit,CPU),专用集成电路(Application-Specific Integrated Circuit,ASIC),图形处理单元(Graphics Processing Unit,GPU)或其任意组合。处理器420可以包括一个或多个芯片。处理器420可以包括加速器,例如神经处理单元(Neural Processing Unit,NPU)。
通信接口430使用例如收发器一类的收发模块,来实现计算设备400与其他设备或通信网络之间的通信。例如,可以通过通信接口430获取数据。
总线440可包括在计算设备400各个部件(例如,存储器410、处理器420、通信接口430)之间传送信息的通路。
图5示出了根据本公开的实施例的视频标注方法500的示意流程图。图5所示的方法500可以由系统100执行。
如图5所示,在框510,根据位置映射模型,确定视频中的当前帧中的目标像素在先前帧中的匹配像素,先前帧具有标注信息。
在一些实施例中,先前帧位于目标帧之前且已经完成标注。在一些示例中,先前帧包括视频的起始帧,该起始帧的标注信息是由用户标注的。在一些示例中,先前帧包括视频中的位于当前帧之前的与当前帧相邻或非相邻的帧。
在一些实施例中,先前帧的标注信息可以包括通过以下至少一种标注方式所得到的信息:点、直线、曲线、矩形框、不规则多边形、和分割掩膜。
在一些实施例中,视频为彩色视频,且该位置映射模型是基于彩色视频以及从彩色视频构建的灰度视频而训练得到的。
图6示出了根据本公开的实施例的得到位置映射模型的过程600的示意流程图。
在框610,基于用户输入的彩色视频,构建灰度视频。
可以通过将彩色视频中的每一帧进行重新着色,以得到对应的灰度帧。
在框620,基于彩色视频和灰度视频对预置位置映射模型进行更新,以得到位置映射模型。
可以基于彩色视频和灰度视频,采用梯度下降的方法对预置位置映射模型进行更新,从而得到更新后的位置映射模型。
在一些实施例中,还可以确定目标像素在其他一个或多个当前帧中的匹配像素。
在框520,基于先前帧的标注信息中与匹配像素有关的部分,确定目标像素的标注信息。
作为一例,可以将与匹配像素有关的标注信息,作为目标像素的标注信息。作为另一例,可以基于匹配像素和另外一个或多个匹配像素,来确定目标像素的标注信息。
图7示出了根据本公开的实施例的确定目标像素的标注信息的过程700的示意流程图。
在框710,从先前帧的标注信息中确定与目标像素在所述先前帧中的另一匹配像素有关的部分。
也就是说,针对当前帧中的目标像素,位置映射模型可以确定在先前帧中的多个匹配像素,实现多点映射。
在框720,基于先前帧的标注信息中与匹配像素有关的部分以及与另一匹配像素有关的部分,确定目标像素的标注信息。
在一些示例中,可以确定匹配像素与目标像素之间的第一相似度,确定另一匹配像素与目标像素之间的第二相似度。进一步,可以通过对第一相似度和第二相似度进行加权求和,确定目标像素的标注信息。
在另一些示例中,可以确定匹配像素与目标像素之间的第一相似度,确定另一匹配像素与目标像素之间的第二相似度。进一步,可以通过第一相似度和第二相似度的比较结果,确定目标像素的标注信息。具体地,基于相似度较大值对应的匹配像素,来确定目标像素的标注信息。
举例来说,第一相似度大于第二相似度,那么可以将匹配像素的标注信息作为目标像素的标注信息。举例来说,第二相似度大于第一相似度,那么可以将另一匹配像素的标注信息作为目标像素的标注信息。
可选地,如图5所示,在框530,还可以至少基于目标像素的标注信息,确定当前帧的标注信息。
在一些实施例中,可以对当前帧中各像素的标注信息进行校正,从而得到当前帧的标注信息。
如此,本公开的实施例中,基于位置映射模型实现了对视频中各帧的标注。并且本公开的实施例中的标注方式对标注的类型和被标注的对象不限制,例如可以是点、直线、曲线、矩形框、不规则多边形、分割掩膜中的一种或多种。对于各种类型的对象都能够确定标注信息,而无需用户针对不同的对象选择不同的预置算法。即使针对运动轨迹不规则、具有遮挡、物体形变多样等场景也能够得到高质量的标注信息。另外,本公开的实施例中的位置映射模型还可以基于待标注的视频进行更新,从而在标注之前使得位置映射模型能够适应不同的数据情况,进一步保证的标注的准确性。
可理解的是,本公开实施例中结合图5至图7所描述的过程,可以参照上面结合图1至图4所描述的模块等的功能,为了简洁,不再重复。
图8示出了根据本公开的实施例的视频标注装置800的示意框图。装置800可以通过软件、硬件或者两者结合的方式实现。在一些实施例中,装置800可以为实现图1所示的系统100中的部分或全部功能的软件或硬件装置。
如图8所示,装置800包括映射单元810和确定单元820。映射单元810被配置为根据位置映射模型,确定视频中的当前帧中的目标像素在先前帧中的匹配像素,先前帧具有标注信息。确定单元820被配置为基于先前帧的标注信息中与匹配像素有关的部分,确定目标像素的标注信息。
在一些实施例中,确定单元820被配置为从先前帧的所述标注信息中确定与目标像素在先前帧中的另一匹配像素有关的部分;以及基于先前帧的标注信息中与匹配像素有关的部分以及与另一匹配像素有关的部分,确定目标像素的标注信息。
在一些实施例中,确定单元820被配置为确定匹配像素与目标像素之间的第一相似度,以及另一匹配像素与目标像素之间的第二相似度;以及通过对第一相似度与第二相似度进行加权求和,确定目标像素的标注信息。
在一些实施例中,视频为彩色视频,并且位置映射模型是基于彩色视频以及从彩色视频构建的灰度视频而训练得到的。
在一些实施例中,如图8所示,该装置800还可以包括构建单元802和更新单元804。构建单元802可以被配置为基于用户输入的彩色视频,构建灰度视频。更新单元804可以被配置为基于彩色视频和灰度视频对预置位置映射模型进行更新,以得到位置映射模型。
在一些实施例中,先前帧包括所述视频的起始帧,起始帧的标注信息是由用户标注的。
在一些实施例中,先前帧包括视频中的位于当前帧之前的与当前帧相邻或非相邻的帧。
在一些实施例中,先前帧的标注信息包括通过以下至少一种标注方式所得到的信息:点、直线、曲线、矩形框、不规则多边形、和分割掩膜。
在一些实施例中,确定单元820还被配置为:至少基于目标像素的标注信息,确定当前帧的标注信息。
可选地,装置800可以实现为系统100,示例性地,映射单元810可以被实现为位置映射模块120,确定单元820可以被实现为标注模块130,构建单元802和更新单元804可以被实现为模型更新模块140。
本公开的实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时也可以有另外的划分方式,另外,在公开的实施例中的各功能单元可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上单元集成为一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
图8所示的装置800能够用于实现上述结合图5至图7所示的视频标注的过程。
本公开还可以实现为计算机程序产品。计算机程序产品可以包括用于执行本公开的各个方面的计算机可读程序指令。本公开可以实现为计算机可读存储介质,其上存储有计算机可读程序指令,当处理器运行所述指令时,使得处理器执行上述的处理过程。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)或闪存、静态随机存取存储器(Static Random Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、数字多功能盘(Digital Versatile Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网 关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机可读程序指令可以是汇编指令、指令集架构(Instruction Set Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或可编程逻辑阵列(Programmable Logic Array,PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机可读程序指令的组合来实现。

Claims (20)

  1. 一种视频标注方法,其特征在于,包括:
    根据位置映射模型,确定视频中的当前帧中的目标像素在先前帧中的匹配像素,所述先前帧具有标注信息;以及
    基于所述先前帧的标注信息中与所述匹配像素有关的部分,确定所述目标像素的标注信息。
  2. 根据权利要求1所述的方法,其特征在于,所述视频为彩色视频,所述位置映射模型是基于所述彩色视频以及从所述彩色视频构建的灰度视频而训练得到的。
  3. 根据权利要求2所述的方法,其特征在于,在所述确定所述匹配像素之前还包括:
    基于用户输入的所述彩色视频,构建灰度视频;以及
    基于所述彩色视频和所述灰度视频对预置位置映射模型进行更新,以得到所述位置映射模型。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,确定所述目标像素的标注信息包括:
    从所述先前帧的标注信息中确定与所述目标像素在所述先前帧中的另一匹配像素有关的部分;以及
    基于所述先前帧的标注信息中与所述匹配像素有关的部分以及与所述另一匹配像素有关的部分,确定所述目标像素的标注信息。
  5. 根据权利要求4所述的方法,其特征在于,确定所述目标像素的标注信息包括:
    确定所述匹配像素与所述目标像素之间的第一相似度,以及所述另一匹配像素与所述目标像素之间的第二相似度;以及
    通过对所述第一相似度与所述第二相似度进行加权求和,确定所述目标像素的标注信息。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述先前帧包括所述视频的起始帧,所述起始帧的标注信息是由用户标注的。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述先前帧包括所述视频中的位于所述当前帧之前的与所述当前帧相邻或非相邻的帧。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述先前帧的所述标注信息包括通过以下至少一种标注方式所得到的信息:
    点、直线、曲线、矩形框、不规则多边形、和分割掩膜。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,还包括:
    至少基于所述目标像素的标注信息,确定所述当前帧的标注信息。
  10. 一种视频标注装置,其特征在于,包括:
    映射单元,被配置为根据位置映射模型,确定视频中的当前帧中的目标像素在先前帧中的匹配像素,所述先前帧具有标注信息;以及
    确定单元,被配置为基于所述先前帧的标注信息中与所述匹配像素有关的部分,确定所述目标像素的标注信息。
  11. 根据权利要求10所述的装置,其特征在于,所述视频为彩色视频,所述位置映射模型是基于所述彩色视频以及从所述彩色视频构建的灰度视频而训练得到的。
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括:
    构建单元,被配置为基于用户输入的所述彩色视频,构建灰度视频;以及
    更新单元,被配置为基于所述彩色视频和所述灰度视频对预置位置映射模型进行更新,以得到所述位置映射模型。
  13. 根据权利要求10至12中任一项所述的装置,其特征在于,所述确定单元被配置为:
    从所述先前帧的所述标注信息中确定与所述目标像素在所述先前帧中的另一匹配像素有关的部分;以及
    基于所述先前帧的标注信息中与所述匹配像素有关的部分以及与所述另一匹配像素有关的部分,确定所述目标像素的标注信息。
  14. 根据权利要求13所述的装置,其特征在于,所述确定单元被配置为:
    确定所述匹配像素与所述目标像素之间的第一相似度,以及所述另一匹配像素与所述目标像素之间的第二相似度;以及
    通过对所述第一相似度与所述第二相似度进行加权求和,确定所述目标像素的标注信息。
  15. 根据权利要求10至14中任一项所述的装置,其特征在于,所述先前帧包括所述视频的起始帧,所述起始帧的标注信息是由用户标注的。
  16. 根据权利要求10至15中任一项所述的装置,其特征在于,所述先前帧包括所述视频中的位于所述当前帧之前的与所述当前帧相邻或非相邻的帧。
  17. 根据权利要求10至16中任一项所述的装置,其特征在于,所述先前帧的所述标注信息包括通过以下至少一种标注方式所得到的信息:
    点、直线、曲线、矩形框、不规则多边形、和分割掩膜。
  18. 根据权利要求10至17中任一项所述的装置,其特征在于,所述确定单元还被配置为:
    至少基于所述目标像素的标注信息,确定所述当前帧的标注信息。
  19. 一种计算设备,包括处理器和存储器,所述存储器存储有计算机程序,当所述处理器读取并执行所述计算机程序时,使得所述计算设备执行根据权利要求1至9中任一项所述的方法。
  20. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现根据权利要求1至9中任一项所述的方法。
PCT/CN2022/081027 2021-06-16 2022-03-15 视频标注方法、装置、计算设备和计算机可读存储介质 WO2022262337A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110666108.4 2021-06-16
CN202110666108.4A CN115482426A (zh) 2021-06-16 2021-06-16 视频标注方法、装置、计算设备和计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2022262337A1 true WO2022262337A1 (zh) 2022-12-22

Family

ID=84419151

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/081027 WO2022262337A1 (zh) 2021-06-16 2022-03-15 视频标注方法、装置、计算设备和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN115482426A (zh)
WO (1) WO2022262337A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880338B (zh) * 2023-03-02 2023-06-02 浙江大华技术股份有限公司 标注方法、标注装置及计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663015A (zh) * 2012-03-21 2012-09-12 上海大学 基于特征袋模型和监督学习的视频语义标注方法
CN109753975A (zh) * 2019-02-02 2019-05-14 杭州睿琪软件有限公司 一种训练样本获得方法、装置、电子设备和存储介质
CN110148105A (zh) * 2015-05-22 2019-08-20 中国科学院西安光学精密机械研究所 基于迁移学习和视频帧关联学习的视频分析方法
CN110705405A (zh) * 2019-09-20 2020-01-17 阿里巴巴集团控股有限公司 目标标注的方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663015A (zh) * 2012-03-21 2012-09-12 上海大学 基于特征袋模型和监督学习的视频语义标注方法
CN110148105A (zh) * 2015-05-22 2019-08-20 中国科学院西安光学精密机械研究所 基于迁移学习和视频帧关联学习的视频分析方法
CN109753975A (zh) * 2019-02-02 2019-05-14 杭州睿琪软件有限公司 一种训练样本获得方法、装置、电子设备和存储介质
CN110705405A (zh) * 2019-09-20 2020-01-17 阿里巴巴集团控股有限公司 目标标注的方法及装置

Also Published As

Publication number Publication date
CN115482426A (zh) 2022-12-16

Similar Documents

Publication Publication Date Title
JP7236545B2 (ja) ビデオターゲット追跡方法と装置、コンピュータ装置、プログラム
JP6807471B2 (ja) セマンティックセグメンテーションモデルの訓練方法および装置、電子機器、ならびに記憶媒体
US20210335002A1 (en) Method, apparatus, terminal, and storage medium for training model
CN106920243B (zh) 改进的全卷积神经网络的陶瓷材质件序列图像分割方法
US9436895B1 (en) Method for determining similarity of objects represented in images
EP3968179A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
CN108038435B (zh) 一种基于卷积神经网络的特征提取与目标跟踪方法
CN111680678B (zh) 目标区域识别方法、装置、设备及可读存储介质
US11335093B2 (en) Visual tracking by colorization
KR102667737B1 (ko) 특징점 포지셔닝 방법 및 장치
JP7225614B2 (ja) ディープラーニング分類モデルの訓練装置及び方法
CN108334878B (zh) 视频图像检测方法、装置、设备及可读存储介质
Ayalew et al. Unsupervised domain adaptation for plant organ counting
WO2022218396A1 (zh) 图像处理方法、装置和计算机可读存储介质
WO2023273668A1 (zh) 图像分类方法、装置、设备、存储介质及程序产品
WO2022218012A1 (zh) 特征提取方法、装置、设备、存储介质以及程序产品
CN111488925A (zh) 一种数据标注方法、装置、电子设备及存储介质
CN113657387A (zh) 基于神经网络的半监督三维点云语义分割方法
WO2022262337A1 (zh) 视频标注方法、装置、计算设备和计算机可读存储介质
Lin et al. An effective deep learning framework for cell segmentation in microscopy images
US20230072445A1 (en) Self-supervised video representation learning by exploring spatiotemporal continuity
Rani et al. An effectual classical dance pose estimation and classification system employing convolution neural network–long shortterm memory (CNN-LSTM) network for video sequences
CN111582449B (zh) 一种目标域检测网络的训练方法、装置、设备及存储介质
CN116310128A (zh) 基于实例分割与三维重建的动态环境单目多物体slam方法
Khattab et al. Modified GrabCut for human face segmentation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22823828

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22823828

Country of ref document: EP

Kind code of ref document: A1