CN112070071B

CN112070071B - Method and device for labeling objects in video, computer equipment and storage medium

Info

Publication number: CN112070071B
Application number: CN202011254673.1A
Authority: CN
Inventors: 马聪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-03-26
Anticipated expiration: 2040-11-11
Also published as: CN112070071A

Abstract

The application relates to a method and a device for labeling an object in a video, computer equipment and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: acquiring a sub-image area in each image frame of a video; associating the sub-image areas in each image frame based on the similarity of the sub-image areas in each image frame to obtain at least two track segments; merging the at least two track segments to obtain at least one area track; and in each image frame, marking each sub-image area in the area track as the area where the same specified type object is located. By the scheme, when the training data set is constructed for the artificial intelligence scenes such as automatic driving and the like, the specified type objects in the image frame do not need to be labeled manually, and the efficiency of labeling the objects in the video can be greatly improved.

Description

Method and device for labeling objects in video, computer equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for labeling an object in a video, a computer device, and a storage medium.

Background

In Artificial Intelligence (AI) scenarios involving image recognition, a large amount of image training data is typically required for machine learning. Among them, marking out the needed objects from the image frames of the video is an important channel for acquiring such image training data.

In the related art, when a desired object (such as a person or an object) is marked out from an image frame of a video, a developer is usually required to perform manual marking. For example, for a video, a developer manually frames the position of a required object in each image frame of the video, and sets the same number for the same object in each image frame, so as to complete the labeling of the object in the video.

However, the above manual labeling scheme requires the user to frame-by-frame select objects from the video, which takes a long time, resulting in a low efficiency of labeling the objects in the video.

Disclosure of Invention

The embodiment of the application provides a method and a device for marking an object in a video, a computer device and a storage medium, which can improve the efficiency of marking the same object in the video.

In one aspect, a method for labeling an object in a video is provided, and the method includes:

acquiring a sub-image area in each image frame of a video; the sub-image area is an area of an image of a specified type object in a corresponding image frame;

associating the sub-image areas in each image frame based on the similarity of the sub-image areas in each image frame to obtain at least two track segments; the track segment is composed of one sub-image area in each continuous image frame;

merging the at least two track segments to obtain at least one area track;

and in each image frame, marking each sub-image area in the area track as the area where the same specified type object is located.

In one aspect, an apparatus for labeling an object in a video is provided, the apparatus comprising:

the area acquisition module is used for acquiring sub-image areas in each image frame of the video; the sub-image area is an area of an image of a specified type object in a corresponding image frame;

the area association module is used for associating the sub-image areas in the image frames based on the similarity of the sub-image areas in the image frames to obtain at least two track segments; the track segment is composed of one sub-image area in each continuous image frame;

a merging module, configured to merge the at least two track segments to obtain at least one area track;

and the object marking module is used for marking each sub-image area in the area track as the area where the same specified type object is located in each image frame.

In one possible implementation manner, the area association module includes:

the image processing device comprises a region determining unit, a matching unit and a matching unit, wherein the region determining unit is used for determining a second region meeting a matching condition from each sub-image region in a second image frame based on the region similarity between a first region in a first image frame and each sub-image region in the second image frame; the first image frame and the second image frame are two adjacent image frames of the respective image frames; the first region is a region of a target track segment corresponding in the first image frame;

a region adding unit, configured to add the second region as a region of the target track segment in the second image frame.

In a possible implementation manner, the area determination unit is configured to,

acquiring a first type similarity between the first region and each sub-image region in the second image frame, wherein the first type similarity is obtained based on the position similarity, the size similarity and the apparent feature similarity between the two regions;

and determining the corresponding area, of the sub-image areas in the second image frame, of which the first type similarity meets the specified condition as the second area.

In a possible implementation, the region determining unit is further configured to,

in response to no corresponding region, of which the first type similarity meets the specified condition, in each sub-image region in the second image frame, acquiring a second type similarity between the first region and each sub-image region in the second image frame, wherein the second type similarity is obtained based on an apparent feature similarity between the two regions;

and determining the corresponding area, of which the second type similarity meets the specified condition, in each sub-image area in the second image frame as the second area.

In a possible implementation manner, the region adding unit is further configured to,

in response to there being no second region that satisfies the matching condition among the sub-image regions in the second image frame and the number of consecutive virtual regions in the target track segment being below a number threshold, predicting the virtual region of the target track segment in the second image frame based on each region in the target track segment;

and adding the virtual area obtained by prediction as the area of the target track segment in the second image frame.

In one possible implementation manner, the merging module includes:

a similarity obtaining unit, configured to obtain a similarity between the at least two track segments;

a merging unit, configured to merge the at least two trajectory segments based on a similarity between the at least two trajectory segments, so as to obtain the at least one position trajectory.

In a possible implementation manner, the similarity obtaining unit includes:

a position similarity obtaining subunit, configured to obtain a position similarity between a first track segment and a second track segment in the at least two track segments; the first image frame where the last sub-image area in the first track segment is located before the second image frame where the first sub-image area in the second track segment is located in the time domain;

a size similarity obtaining subunit, configured to obtain a size similarity between the first track segment and the second track segment based on a size of a last sub-image area in the first track segment and a size of a first sub-image area in the second track segment;

an apparent similarity obtaining subunit, configured to obtain an apparent feature similarity between the first track segment and the second track segment based on an apparent feature of at least one sub-image region in the first track segment and an apparent feature of at least one sub-image region in the second track segment;

a segment similarity obtaining subunit, configured to obtain a similarity between the first track segment and the second track segment based on the position similarity, the size similarity, and the apparent feature similarity between the first track segment and the second track segment.

In a possible implementation manner, the location similarity obtaining subunit is configured to,

performing track prediction in image frames subsequent to the first image frame based on the position of each sub-image region in the first track segment to obtain a prediction region extending from the first track segment to the second image frame;

and acquiring the position similarity between the prediction area of the first track segment extending to the second image frame and the first sub-image area of the second track segment as the position similarity between the first track segment and the second track segment.

In one possible implementation, the size similarity obtaining subunit is configured to,

based on an initial size similarity between a size of a last sub-image region in the first track segment and a size of a first sub-image region in the second track segment;

correcting the initial size similarity through an objective index function to obtain the size similarity of the first track segment and the second track segment; the value of the objective exponential function is inversely related to the time interval between the first image frame and the second image frame.

In a possible implementation manner, the segment similarity obtaining subunit is configured to,

fusing the apparent features of at least two sub-image regions which are discrete in time domain in the first track segment to obtain the apparent feature of the first track segment;

fusing the apparent features of at least two sub-image regions which are discrete in the time domain in the second track segment to obtain the apparent feature of the second track segment;

and acquiring the similarity between the apparent features of the first track segment and the apparent features of the second track segment as the similarity of the apparent features of the first track segment and the second track segment.

In one possible implementation manner, the merging unit includes:

an undirected graph constructing subunit, configured to construct a weighted undirected graph by using the at least two track segments as nodes and using the similarity between the at least two track segments as a weight of an edge;

the graph partitioning subunit is used for performing graph partitioning on the weighted undirected graph to obtain at least one optimal subgraph;

and the merging subunit is used for merging the track segments belonging to the same optimal subgraph to obtain the at least one position track.

In a possible implementation manner, the graph partitioning subunit is configured to perform graph partitioning on the weighted undirected graph by using a depth-first search algorithm in combination with a greedy matching algorithm, so as to obtain the at least one optimal subgraph.

In one aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the object labeling method in the video.

In one aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the object labeling method in the video.

In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the object labeling method in the video.

The technical scheme provided by the application can comprise the following beneficial effects:

the method comprises the steps of firstly determining sub-image areas corresponding to specified type objects in each image frame in a video, then combining the track sections based on the similarity between the sub-image areas in adjacent image frames or track sections formed by the areas of the same object in continuous image frames to obtain a plurality of area tracks, and finally marking the objects corresponding to the same area track as the same object.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a system configuration diagram of an object labeling system according to each embodiment of the present application.

Fig. 2 is a flowchart illustrating an object labeling method in a video according to an exemplary embodiment.

FIG. 3 is a flow diagram illustrating object annotation in a video according to an exemplary embodiment.

Fig. 4 is a schematic diagram of a vehicle annotation in a video according to the embodiment shown in fig. 3.

Fig. 5 is a flowchart illustrating an object labeling method in a video according to an exemplary embodiment.

Fig. 6 is a schematic diagram of the execution flow of the tracking algorithm according to the embodiment shown in fig. 5.

FIG. 7 is a flow diagram illustrating vehicle annotation in a video according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating a structure of an object labeling apparatus in a video according to an exemplary embodiment.

FIG. 9 is a block diagram illustrating a configuration of a computer device, according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Before describing the various embodiments shown herein, several concepts related to the present application will be described.

1) Artificial Intelligence (AI).

AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

2) Machine Learning (ML).

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

3) The automatic driving technology generally comprises technologies such as high-precision maps, environment perception, behavior decision, path planning, motion control and the like, and has wide application prospects.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

4) Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

The scheme provided by the embodiment of the application is applied to scenes such as machine learning technology, automatic driving technology, cloud technology and the like related to artificial intelligence, so that the same specified type object can be rapidly marked out from each image frame of the video.

Referring to fig. 1, a system configuration diagram of an object labeling system according to various embodiments of the present application is shown. As shown in fig. 1, the system includes a server 120, a database 140, and a number of terminals 160.

The server 120 is a server, or a plurality of servers, or a virtualization platform, or a cloud computing service center.

Server 120 may be a server that provides a service to mark the same object from a video. The server 120 may be composed of one or more functional units.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The terminal 160 may be, but is not limited to, a vehicle-mounted terminal, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The database 140 may be a Redis database, or may be another type of database. The database 140 is used to store various types of data, such as videos to be annotated, annotation results for each image frame in the videos, and the like.

The terminal 160 is connected to the server 120 via a communication network. Optionally, the communication network is a wired network or a wireless network.

Optionally, the system may further include a management device (not shown in fig. 1), which is connected to the server 120 through a communication network. Optionally, the communication network is a wired network or a wireless network.

Optionally, the wireless network or wired network described above uses standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

Reference is now made to FIG. 2, which is a flowchart illustrating an object annotation method in a video according to an exemplary embodiment, where the object annotation method in a video can be used in a computer device. The computer device may be a terminal or a server, for example, the computer device may be the terminal 160 or the server 120 in the system shown in fig. 1. As shown in fig. 2, the object labeling in the video may include the following steps.

Step 21, obtaining a sub-image area in each image frame of the video; the sub-image area is an area of the image of the specified type object in the corresponding image frame.

The sub-image area refers to a designated type object in the image frame, and the image area corresponding to the position in the current image frame may be, for example, an image area surrounded by a smallest bounding box of the designated type object in the current image frame.

The object of the specified type is an object which needs to be marked out from the image frame.

In one possible implementation, the object type of the specified type object may be set by a developer according to task requirements of the annotation task. For example, when the labeling task is a task of labeling a person from each image frame, the object type of the specified type object is the person type.

Step 22, associating the sub-image areas in each image frame based on the similarity of the sub-image areas in each image frame to obtain at least two track segments; the track segment is formed by one sub-image region in each of the successive image frames.

In a possible implementation manner, the computer device associates the sub-image regions in the two adjacent image frames based on the similarity of the sub-image regions in the two adjacent image frames, that is, the sub-image regions with the similarity satisfying the condition in the two adjacent sub-image regions are regarded as the sub-image regions of the same object in different image frames and are associated into one track segment, and the track segments are processed frame by frame in such a manner, so that a plurality of track segments can be obtained.

And step 23, merging the at least two track segments to obtain at least one area track.

Since an object in a video may be blocked in some image frames or temporarily get out of the view range of the image frames, which may result in a situation where an object can obtain multiple track segments in each image frame of the video, and the multiple track segments are discontinuous in a time domain, the solution shown in the embodiment of the present application further merges track segments belonging to the same object in at least two obtained track segments to obtain a complete area track of the object in the video.

In one possible implementation manner, the computer device merges at least two track segments based on the similarity between the at least two track segments to obtain at least one region track.

And 24, marking each sub-image area in the area track as the area where the same specified type object is located in each image frame.

In a possible implementation manner, for each sub-image region in the same region track, the computer device may label the same object Identifier (ID) for each sub-image region, so as to label the same specified type of object in different image frames in the video.

To sum up, in the scheme shown in the embodiment of the present application, for each image frame in a video, a sub-image region corresponding to an object of a specified type in each image frame is determined, then based on a similarity between sub-image regions in adjacent image frames or a track segment formed by regions of the same object in consecutive image frames, each track segment is merged to obtain a plurality of region tracks, and finally, the object corresponding to the same region track is labeled as the same object.

The scheme shown in the embodiment of the application can be applied to offline tracking of videos and continuous frame object labeling of the videos. Reference is now made to FIG. 3, which is a flowchart illustrating object annotation in a video, according to an exemplary embodiment. As shown in fig. 3, based on the method shown in fig. 2, the process of labeling the object in the video may be inputting a sequence of consecutive frame images in the video, and performing the object detection step 31, the offline tracking step 32, and the step of labeling the object in the consecutive frame image sequence in sequence to obtain a labeling result.

Wherein the object detection: the method is to identify a specified type object in each image frame in a continuous image frame sequence, for example, each input image frame may be identified through a pre-trained machine learning model (such as a deep learning model, a convolutional neural network model, etc.), and a position area where the specified type object is located in each image frame is determined. Corresponding to step 21 in the embodiment shown in fig. 2 described above.

Off-line tracking: the method is characterized in that after the whole video is recorded, the global information is utilized to track continuous targets in the video. Corresponding to step 22 in the embodiment illustrated in fig. 2 described above.

Labeling of continuous frame objects: the method is characterized in that the same object appearing in continuous multiple frames is marked in a video, and the track and the unique identity of each object are determined. Corresponding to

steps

23 and 24 in the embodiment shown in fig. 2 described above.

The scheme shown in each embodiment of the application can be used for labeling various specified types of objects in an offline video, so that an image-based training set is constructed through a labeling result in the following process, and a machine learning model related to image processing is trained.

For example, the solutions shown in the embodiments of the present application may be applied to automatically mark tracks of vehicles from a video in an automatic driving related training data set construction scenario, and mark the same ID for the same vehicle in different image frames.

For example, when training data of an automatic driving scene is constructed, a camera is arranged on a vehicle, for example, the camera is arranged on the top of the vehicle, during the driving process of the vehicle, a video is recorded by the camera, an offline video obtained by recording is uploaded or copied to a computer device, the computer device identifies the vehicle of each image frame in the offline video to obtain a sub-image area where the vehicle of each image frame is located, then the sub-image areas in each image frame are matched frame by frame to obtain a plurality of track segments, then the track segments corresponding to the same vehicle in the plurality of track segments are combined to obtain a complete track of each vehicle in the video, finally, according to the result of combining the track segments, vehicle labeling is carried out in the video, the same vehicle is labeled as the same ID in different vehicles in the video, and after subsequent manual examination, i.e. a training data set relating to automatic driving can be constructed based on the labeling results.

According to the scheme shown in the embodiment of the application, for the situation that the same vehicle appears in more than two discontinuous video segments in the video, the fact that the same ID is automatically marked on the vehicle in the more than two discontinuous segments can be achieved. For example, please refer to fig. 4, which shows a schematic diagram of a vehicle annotation in a video according to an embodiment of the present application. As shown in fig. 4, in the image frame 41 captured at the time t, the vehicle 41a is partially blocked by the vehicle 41b, in the image frame 42 captured at the time t + n, the vehicle 41a is completely blocked by the vehicle 41b with ID 1 (at this time, the vehicle 41a is not visible in the image frame 42), and in the image frame 43 captured at the time t +2n, the vehicle 41a crosses the vehicle 41b and completely appears in the image frame, that is, the time interval in which the vehicle 41a appears in the video is discontinuous, and according to the solution shown in the above embodiment of the present application, the computer device may implement that the vehicle 41a is labeled with the same ID in the image frame 41, the image frame 42 and the image frame 43 in the video (fig. 4 shows that the ID of the vehicle 41a is 5 in different image frames).

Besides vehicle identification in an automatic driving scene, the scheme shown in the embodiment of the present application can also be used for other automatic object labeling scenes, such as labeling the same pedestrian, the same building, and the like from each image frame of a video.

Reference is now made to FIG. 5, which is a flowchart illustrating an object annotation method in a video according to an exemplary embodiment, where the object annotation method in a video can be used in a computer device. The computer device may be a terminal or a server, for example, the computer device may be the terminal 160 or the server 120 in the system shown in fig. 1. As shown in fig. 5, the object labeling in the video may include the following steps.

Step 501, obtaining a sub-image area in each image frame of a video; the sub-image area is an area of the image of the specified type object in the corresponding image frame.

Each image frame in the video is also referred to as a continuous frame image sequence, and refers to a series of images formed by continuous frames of a video to be labeled, for example, in an image data labeling scene related to automatic driving, the continuous frame image sequence may be recorded by a 60-degree camera in front of an automatic driving vehicle.

In the embodiment of the present application, it is preferable to perform target detection on each image frame, that is, to identify a sub-image region in each image frame where a specified type object is located, for example, a bounding box of the region of each specified type object in the image frame may be generated, for example, in an automatic driving scene, the specified type object may be a vehicle or a pedestrian.

Step 502, associating the sub-image areas in each image frame based on the similarity of the sub-image areas in each image frame to obtain at least two track segments; the track segment is formed by one sub-image region in each of the successive image frames.

In one possible implementation, when the sub-image regions in each image frame are associated based on the similarity of the sub-image regions in each image frame to obtain at least two track segments, the computer device may perform the following steps.

S502a, determining a second area satisfying a matching condition from among the respective sub-image areas in the second image frame based on the area similarity between the first area in the first image frame and the respective sub-image areas in the second image frame; the first image frame and the second image frame are two adjacent image frames of the respective image frames; the first region is a region of the target track segment corresponding to the first image frame.

In one possible implementation, the determining, based on the region similarity between the first region in the first image frame and each sub-image region in the second image frame, a second region that satisfies the matching condition from each sub-image region in the second image frame includes:

acquiring a first type similarity between the first area and each sub-image area in the second image frame, wherein the first type similarity is obtained based on the position similarity, the size similarity and the apparent feature similarity between the two areas;

In the embodiment of the present application, the algorithm for acquiring track segments may be referred to as a tracking algorithm, and the input of the tracking algorithm may be a rectangular bounding box of all the specified type objects in each frame of image output by the target detection algorithm. For each frame of image, firstly executing 2-dimensional object detection algorithm, and obtaining a bounding box set in a t-1 frame

And obtaining a bounding box set at the t frame

. This is the input data for the tracking algorithm.

In order to label the ID of the same object in consecutive image frames, the same object between two adjacent image frames needs to be matched to ensure that the same object is labeled with the same ID. Therefore, the scheme shown in the application can calculate the pairwise similarity of the sub-image regions between two adjacent image frames based on the position, the size and the apparent similarity information of the sub-image regions

And correlating (e.g., using a greedy algorithm) the matched sub-image regions to form track segments of the sub-image regions in successive image frames.

The position, size, and apparent similarity of the sub-image regions can be calculated by the following formulas, respectively.

Wherein the computer device may calculate the position similarity between the two sub-image regions by a ratio of an intersection and a union of the two sub-image regions.

Where w and h are the width and height of the bounding box of the respective sub-image region. That is, the computer device may calculate the size similarity between the two sub-image areas by the width and height of the two sub-image areas.

Wherein

And

is the apparent feature value of the image in the bounding box of the corresponding two sub-image regions, which can be computed using a feature extraction algorithm, such as a fast Histogram of Oriented Gradient (fastHOG) algorithm.

And

represents the average of the feature values of the corresponding sub-image regions, and N represents the feature dimension.

Finally, the overall similarity between two sub-image regions

Resulting from the above three-term addition (such as direct addition or weighted addition).

Or, two sub-unitsGlobal similarity between image regions

The average value and the weighted average are obtained by the three items.

In one possible implementation manner, in response to that there is no corresponding area, in which the first type similarity satisfies the specified condition, in each sub-image area in the second image frame, a second type similarity between the first area and each sub-image area in the second image frame is obtained, where the second type similarity is obtained based on an apparent feature similarity between the two areas; and determining the corresponding area, of the sub-image areas in the second image frame, of which the second type similarity meets the specified condition as the second area.

In a 2-dimensional field of view, both the distance and direction of motion of an object affect the speed at which the object appears. For example, in an autonomous driving scenario, the relative speed of a nearby vehicle (i.e., a designated type of object) and the reverse vehicle of another lane is also fast. In the case where the frame rate of the video is not high, the position of some fast-moving vehicles is likely to be changed drastically between frames. In both cases, the vehicle in the current frame may be separated from the corresponding vehicle in the previous frame, i.e. the vehicle overlap is very small or zero, increasing the tracking difficulty. In contrast, in the embodiment of the application, a two-stage association strategy is adopted, association is performed based on the position, the size and the apparent similarity, and if the second region is not associated with the matched second region, association is performed based on the apparent similarity, so that the operation resources are saved and the association effect is improved.

S502b, adding the second region as the region of the target track segment in the second image frame.

In the embodiment of the present application, for the first region of the target track segment in the first image frame, if the computer device is associated to the second region in the second image frame, the second region is added to the target track segment, and the region matching is continued in the next image frame of the second image frame according to the above steps.

In one possible implementation, in response to that, of the sub-image regions in the second image frame, there is no second region that satisfies the matching condition, and the number of consecutive virtual regions in the target track segment is lower than a number threshold, predicting the virtual region of the target track segment in the second image frame based on each region in the target track segment; and adding the predicted virtual area as the area of the target track segment in the second image frame.

Since the object detection algorithm still has certain limitations, certain objects may be missed in some frames. In addition, the number of targets in the scene is not limited, and situations such as new target entry, target disappearance, temporary target occlusion, and the like may occur. Therefore, in the embodiment of the present application, a single-target tracker may be additionally used to check the sub-image area generated by the detection algorithm to span the case of detection loss.

In the embodiment of the present application, please refer to fig. 6, which shows a schematic flowchart of an execution flow of a tracking algorithm according to the embodiment of the present application. As shown in FIG. 6, the computer device may assign a single target tracker to each trajectory. After the track is initialized, the track is associated with the specified type object in the next image frame by frame, and if the association is successful, one-step track extension is completed (S61). If the association is not successful, then a single target tracker is used for tracking the existing trajectory, forming a trajectory hypothesis (S62). If the assumed virtual area can be re-associated with the upper sub-image area within a certain time period (S63), such as within the next 5 image frames, the normal trajectory extension process is continued. If the single-target tracking algorithm is relied on for a long time, the track confidence is considered low, and waiting is suspended (S64), for example, the track is not extended continuously, but the last region in the current track is used for association matching in the subsequent image frames. If consecutive multiframes (such as consecutive 50 frames) fail to match, or a video boundary, i.e., an end track, is detected (S65). Conversely, if the sub-image regions in the previous subsequent image frame are re-matched, the trace can still be reactivated, the match restored and the extension continued.

The scheme shows a life cycle management process of a track formed by each object by a tracking algorithm, and has certain resistance to target loss caused by shielding, missing detection and the like.

However, the above flow utilizes only the history information of the target. When an object is occluded for a long time, it is often unrecoverable and there is no correction capability for the associated confusion that may occur with different objects. There is a need to further perform offline track segment association. I.e. the at least two track segments are merged to obtain at least one region track.

Step 503, obtaining the similarity between the at least two track segments.

In a possible implementation manner, the obtaining the similarity between the at least two track segments includes:

acquiring the position similarity between a first track segment and a second track segment in the at least two track segments; the first image frame where the last sub-image area in the first track segment is located before the second image frame where the first sub-image area in the second track segment is located in time domain;

acquiring the size similarity of the first track segment and the second track segment based on the size of the last sub-image area in the first track segment and the size of the first sub-image area in the second track segment;

acquiring the apparent feature similarity of the first track segment and the second track segment based on the apparent feature of at least one sub-image region in the first track segment and the apparent feature of at least one sub-image region in the second track segment;

and acquiring the similarity between the first track segment and the second track segment based on the position similarity, the size similarity and the apparent feature similarity between the first track segment and the second track segment.

Taking an automatic driving scene as an example, in an actual driving scene, front vehicles may generate serious mutual occlusion, and in addition, due to the motion of the vehicles, the camera view may be greatly shifted as a whole, so that the target between consecutive frames is difficult to be associated. Therefore, through the preliminary tracking process, the trajectory formed by the same object may still be broken into multiple segments. That is, assuming that N-entry target track fragments are generated in one interval T, there may be several fragments belonging to the same target, and there are usually multiple targets in the interval. To solve this problem, the track segments need to be re-associated within a certain time interval.

For the N track segments in a certain time interval T, the computer equipment can calculate the similarity between every two track segments according to the positions, the target sizes and the apparent similarities of the track segments. However, different track segments of the same object are in different times and spaces, and therefore, the computer device may take the following measures to enhance the stability of the similarity comparison.

S503a, when the position similarity between the first track segment and the second track segment of the at least two track segments is obtained,

performing track prediction in image frames subsequent to the first image frame based on the positions of the sub-image areas in the first track segment to obtain a prediction area extending from the first track segment to the second image frame;

and acquiring the position similarity between the first track segment extending to the prediction area in the second image frame and the first sub-image area in the second track segment as the position similarity between the first track segment and the second track segment.

In terms of location correlation, the computer device may predict the corresponding locations across time intervals using motion model fitting on the trajectory segments to improve the correlation success rate. Namely, fitting the continuous positions of a track segment by using a Kalman filter, predicting within a period of time (namely, subsequently connecting a plurality of image frames) until the starting time of the target track segment is predicted, and then calculating the position similarity of the predicted region and the first region of the target track segment.

S503b, when the similarity of the sizes of the first track segment and the second track segment is obtained based on the size of the last sub-image area in the first track segment and the size of the first sub-image area in the second track segment,

based on an initial size similarity between the size of the last sub-image region in the first track segment and the size of the first sub-image region in the second track segment;

In the embodiment of the application, in the aspect of size correlation, a time interval parameter is introduced, and an exponential function which is in negative correlation with the time interval is applied to the original similarity value, so that the correlation of two track segments which are relatively close in time is stricter, and the possible size mutation type error correlation is restrained.

S503c, when the similarity of the apparent features of the first track segment and the second track segment is obtained based on the apparent features of at least one sub-image region in the first track segment and the apparent features of at least one sub-image region in the second track segment,

fusing the apparent features of at least two sub-image regions which are discrete in time domain in the second track segment to obtain the apparent feature of the second track segment;

In the embodiment of the application, at the aspect of apparent association, the computer device samples the target appearance at a plurality of time points in the track segment to extract features so as to inhibit the target appearance change caused by different time illumination and other conditions.

In the embodiment of the application, when calculating the similarity between two track segments, the computer device may select two track segments with an interval time not exceeding a preset time interval threshold value to perform similarity calculation and merging judgment.

In this embodiment of the present application, the computer device limits the time interval between the two trajectory segments determined by merging, that is, merges the trajectory segments with a smaller time interval, so as to ensure the accuracy of the merging of the trajectory segments.

Step 504, merging the at least two track segments based on the similarity between the at least two track segments to obtain the at least one position track.

In a possible implementation manner, the merging the at least two track segments based on the similarity between the at least two track segments to obtain the at least one position track includes:

taking the at least two track segments as nodes, and taking the similarity between the at least two track segments as the weight of an edge to construct a weighted undirected graph;

carrying out graph segmentation on the weighted undirected graph to obtain at least one optimal subgraph;

and merging the track segments belonging to the same optimal subgraph to obtain the at least one position track.

In a possible implementation manner, the graph partitioning the weighted undirected graph to obtain at least one optimal sub-graph includes:

and carrying out graph segmentation on the weighted undirected graph by combining a depth-first search algorithm and a greedy matching algorithm to obtain the at least one optimal subgraph.

In the embodiment of the present application, the correlation between different segments can be regarded as a weighted undirected graph, and may be circled (e.g., three segments belong to the same track, and are similar to each other). The similarity between the calculated track segments is the weight of the connection. The re-association of the target track segment is to divide this weighted graph into optimal sub-graphs, so that the nodes of each sub-graph correspond to the same target ID. The problem can be solved by using a depth-first search algorithm and a greedy matching algorithm.

In the step 504, a scheme of associating with a greedy matching algorithm by using a depth-first search algorithm may also be implemented by using other types of algorithms, for example, after the similarity between the track segments is obtained through calculation, the solution may also be performed by using methods such as a minimum-cost network flow.

And 505, marking each sub-image area in the area track as an area where the same specified type object is located in each image frame.

Through the scheme, the track segments can be re-associated, and the unique ID of each target in the video global is determined. After the section is offline and re-associated, the front-back consistency of the target ID can be kept under the condition of complete shielding.

The labeling result obtained by the scheme can be input into a manual examination terminal, and after the manual examination and correction, a model training set for constructing the object label in the image can be output.

That is to say, in each image frame, each sub-image region in the region trajectory is labeled as a region where the same specified type object is located, so as to obtain labeling information of the target object in each image frame, where the target object is any specified type object in each image frame, and the labeling information is used to indicate the region of the target object in each image frame and an object identifier of the target object; based on the image frames and the labeling information of the target object in the image frames/the labeling information of the target object in the image frames after manual examination, a training sample corresponding to the target object is constructed, a label in the training sample is set based on an image processing task corresponding to a machine learning model to be trained, the machine learning model is trained through the training sample, and the image processing task is executed on an input image through the trained machine learning model.

In one possible implementation, the execution type object is a vehicle, and the machine learning model is an image processing model used in an automatic driving scene.

Reference is now made to FIG. 7, which is a flow chart illustrating vehicle annotation in a video according to an exemplary embodiment. As shown in fig. 7, the video capturing vehicle captures an offline video 71 through a camera mounted on the vehicle and inputs the offline video 71 into the computer device.

The computer equipment identifies vehicles in each image frame in the offline video 71 to obtain a vehicle position area 72 in each image frame, then associates vehicle position areas frame by frame based on position similarity, size similarity and apparent feature similarity through a greedy algorithm to obtain a plurality of track segments 73, then associates similarity of similar track segments 73 in a time domain to obtain at least one area track 74, and labels vehicles belonging to the same area track 74 in the video as the same ID to obtain a labeling result 75.

The computer device inputs the labeling result 75 into the manual review terminal, and after the manual review of the labeling result 75, the labeling result 76 after the manual review is output. A training data set for an automated driving scenario may subsequently be constructed based on the manually reviewed annotation results 76.

Fig. 8 is a block diagram illustrating an exemplary embodiment of an apparatus for annotating objects in video. The device for labeling the object in the video can implement all or part of the steps in the method provided by the embodiment shown in fig. 2 or fig. 5. The device for labeling the object in the video can comprise:

an area obtaining module 801, configured to obtain sub-image areas in each image frame of a video; the sub-image area is an area of an image of a specified type object in a corresponding image frame;

a region association module 802, configured to associate sub-image regions in each image frame based on similarity of the sub-image regions in each image frame, so as to obtain at least two track segments; the track segment is composed of one sub-image area in each continuous image frame;

a merging module 803, configured to merge the at least two track segments to obtain at least one area track;

an object labeling module 804, configured to label, in each image frame, each sub-image region in the region track as a region where an object of the same specified type is located.

In one possible implementation manner, the area association module includes:

In a possible implementation manner, the merging module 803 includes:

In a possible implementation manner, the similarity obtaining unit includes:

In one possible implementation manner, the merging unit includes:

FIG. 9 is a schematic diagram illustrating a configuration of a computer device, according to an example embodiment. The computer device may be implemented as the computer device in the above-described method embodiments. The computer device 900 includes a central processing unit 901, a system Memory 904 including a Random Access Memory (RAM) 902 and a Read-Only Memory (ROM) 903, and a system bus 905 connecting the system Memory 904 and the central processing unit 901. The computer device 900 also includes a basic input/output system 906 for facilitating information transfer between the various elements within the computer, and a mass storage device 907 for storing an operating system 913, application programs 914, and other program modules 915.

The mass storage device 907 is connected to the central processing unit 901 through a mass storage controller (not shown) connected to the system bus 905. The mass storage device 907 and its associated computer-readable media provide non-volatile storage for the computer device 900. That is, the mass storage device 907 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, flash memory or other solid state storage technology, CD-ROM, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 904 and mass storage device 907 described above may be collectively referred to as memory.

The computer device 900 may be connected to the internet or other network device through a network interface unit 911 connected to the system bus 905.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the central processor 901 implements all or part of the steps of the method shown in fig. 2, fig. 2 or fig. 5 by executing the one or more programs.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as a memory comprising computer programs (instructions), executable by a processor of a computer device to perform the methods shown in the various embodiments of the present application, is also provided. For example, the non-transitory computer readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods shown in the various embodiments described above.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for labeling objects in a video, the method comprising:

merging the at least two track segments to obtain at least one area track;

in each image frame, marking the same object identifier for each sub-image area in the area track, wherein the sub-image areas with the same object identifier in each image frame are the areas where the same specified type of object is located;

wherein, the associating the sub-image areas in each image frame based on the similarity of the sub-image areas in each image frame to obtain at least two track segments comprises:

determining a second area meeting a matching condition from each sub-image area in a second image frame based on the area similarity between a first area in a first image frame and each sub-image area in the second image frame; the first image frame and the second image frame are two adjacent image frames of the respective image frames; the first region is a region of a target track segment corresponding in the first image frame;

adding the second region as a region of the target track segment in the second image frame;

the determining, based on the region similarity between the first region in the first image frame and each sub-image region in the second image frame, a second region satisfying a matching condition from each sub-image region in the second image frame includes:

determining, as the second area, an area in each sub-image area in the second image frame, where the corresponding first-type similarity satisfies the matching condition;

in response to no corresponding region, of which the first type similarity meets the matching condition, in each sub-image region in the second image frame, acquiring a second type similarity between the first region and each sub-image region in the second image frame, wherein the second type similarity is obtained based on an apparent feature similarity between the two regions;

and determining the corresponding area, of the sub-image areas in the second image frame, of which the second type similarity meets the matching condition as the second area.

2. The method of claim 1, further comprising:

3. The method according to claim 1, wherein said merging the at least two track segments to obtain at least one region track comprises:

acquiring the similarity between the at least two track segments;

and combining the at least two track segments based on the similarity between the at least two track segments to obtain the at least one region track.

4. The method of claim 3, wherein the obtaining the similarity between the at least two track segments comprises:

acquiring the position similarity between a first track segment and a second track segment in the at least two track segments; the third image frame where the last sub-image area in the first track segment is located before the fourth image frame where the first sub-image area in the second track segment is located in the time domain;

5. The method of claim 4, wherein obtaining the position similarity between a first track segment and a second track segment of the at least two track segments comprises:

performing track prediction in image frames subsequent to the third image frame based on the position of each sub-image region in the first track segment to obtain a prediction region extending from the first track segment to the fourth image frame;

and acquiring the position similarity between the prediction area of the first track segment extending into the fourth image frame and the first sub-image area in the second track segment as the position similarity between the first track segment and the second track segment.

6. The method of claim 4, wherein obtaining a similarity between the sizes of the first track segment and the second track segment based on the size of the last sub-image area in the first track segment and the size of the first sub-image area in the second track segment comprises:

correcting the initial size similarity through an objective index function to obtain the size similarity of the first track segment and the second track segment; the value of the objective exponential function is inversely related to the time interval between the third image frame and the fourth image frame.

7. The method of claim 4, wherein the obtaining the apparent feature similarity of the first track segment and the second track segment based on the apparent feature of the at least one sub-image region in the first track segment and the apparent feature of the at least one sub-image region in the second track segment comprises:

8. The method according to claim 3, wherein the merging the at least two track segments based on the similarity between the at least two track segments to obtain the at least one region track comprises:

and merging track segments belonging to the same optimal subgraph to obtain the at least one region track.

9. The method of claim 8, wherein the graph partitioning the weighted undirected graph to obtain at least one optimal subgraph comprises:

10. An apparatus for labeling objects in a video, the apparatus comprising:

the object marking module is used for marking the same object identifier for each sub-image area in the area track in each image frame, and the sub-image areas with the same object identifier in each image frame are the areas where the same specified type of objects are located;

the area association module comprises an area determining unit and an area adding unit;

the area determining unit is used for determining a second area meeting a matching condition from each sub-image area in a second image frame based on the area similarity between a first area in a first image frame and each sub-image area in the second image frame; the first image frame and the second image frame are two adjacent image frames of the respective image frames; the first region is a region of a target track segment corresponding in the first image frame;

the region adding unit is used for adding the second region as the region of the target track segment in the second image frame;

the region determining unit is used for acquiring a first type similarity between the first region and each sub-image region in the second image frame, wherein the first type similarity is obtained based on the position similarity, the size similarity and the apparent feature similarity between the two regions; determining, as the second area, an area in each sub-image area in the second image frame, where the corresponding first-type similarity satisfies the matching condition;

the region determining unit is further configured to, in response to that there is no corresponding region, in each sub-image region in the second image frame, for which the first type similarity satisfies the matching condition, acquire a second type similarity between the first region and each sub-image region in the second image frame, where the second type similarity is obtained based on an apparent feature similarity between two regions; and determining the corresponding area, of the sub-image areas in the second image frame, of which the second type similarity meets the matching condition as the second area.

11. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method of object annotation in video according to any one of claims 1 to 9.

12. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of object annotation in video according to any one of claims 1 to 9.