CN117275025A

CN117275025A - Processing system for batch image annotation

Info

Publication number: CN117275025A
Application number: CN202311438309.4A
Authority: CN
Inventors: 张学森; 孙涤非; 任轶
Original assignee: Beijing Daoyi Shuhui Technology Co ltd
Current assignee: Beijing Daoyi Shuhui Technology Co ltd
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2023-12-22

Abstract

The embodiment of the invention relates to a processing system for batch image annotation, which comprises the following components: the system comprises a task scheduling module, a task input module, a manual labeling module, a manual auditing module, a task output module, a multi-mode target detection model, an image feature learning model and an image segmentation model; the system can shorten the working time of the labeling work, improve the working efficiency of the labeling work and reduce the labeling cost of the labeling work.

Description

Processing system for batch image annotation

Technical Field

The invention relates to the field of data processing, in particular to a processing system for batch image annotation.

Background

In the automatic driving field, massive images need to be acquired for training of various models, and the acquired massive images need to be subjected to image annotation. At present, conventional image marking work is finished manually, and the conventional working mode has low working efficiency, long marking time and high marking cost.

Disclosure of Invention

The object of the present invention is to provide a processing system for batch image annotation, which aims at the defects of the prior art, and the system comprises: the system comprises a task scheduling module, a task input module, a manual labeling module, a manual auditing module, a task output module, a multi-mode target detection model, an image feature learning model and an image segmentation model; the task scheduling module is used for sorting out the image sequence to be detected from the labeling task received by the task input module; the manual annotation module is used for carrying out target type text confirmation according to interaction between the annotation mode and a user and selecting part of images to be detected from the image sequence to be detected for pre-annotation processing; the task scheduling module invokes the multi-mode target detection model, the image feature learning model and the image segmentation model again, and performs target detection, low-resolution detection frame filtering and semantic segmentation processing on the image sequence to be detected according to the target type text sequence and the annotation frame data set output by the manual annotation module to obtain a corresponding detection frame segmentation data set; the manual auditing module carries out manual auditing according to the image sequence to be inspected, the detection frame data set and the detection frame segmentation data set; and the task scheduling module forms the audit output of the manual audit module into a corresponding task output data packet and outputs the data packet through the task output module. When the system processes the massive image labeling task each time, only a few images are selected from the massive images according to the target types to be labeled in advance to perform pre-labeling, then the system can automatically label the rest massive images according to the pre-labeled target types and the labeling frame, and a manual auditing interface is provided for auditing labeling results. The system can shorten the working time of the labeling work, improve the working efficiency of the labeling work and reduce the labeling cost of the labeling work.

To achieve the above object, an embodiment of the present invention provides a processing system for batch image annotation, the system including: the system comprises a task scheduling module, a task input module, a manual labeling module, a manual auditing module, a task output module, a multi-mode target detection model, an image feature learning model and an image segmentation model;

the task scheduling module is respectively connected with the task input module, the manual annotation module, the manual auditing module, the task output module, the multi-mode target detection model, the image feature learning model and the image segmentation model; the multi-mode target detection model defaults to a grouping DINO model; the image feature learning model defaults to a DINov2 model; the image segmentation model adopts a SAM model by default;

the task input module is used for sending a first labeling task input by a user to the task scheduling module; the first labeling task comprises a first labeling mode, a first task data type and first task data; the first annotation mode comprises a simple annotation mode and a complex annotation mode; the first task data type comprises an image type and a video type; the first task data corresponding to the first task data type is an image sequence when the first task data type is an image type, and the first task data corresponding to the first task data type is a video data when the first task data type is a video type;

The task scheduling module is used for extracting the corresponding first annotation mode, the first task data type and the first task data from the received first annotation task; the first task data type is identified, if the first task data type is an image type, the first task data is used as a corresponding first image sequence to be detected, if the first task data type is a video type, video framing image extraction processing is carried out on the first task data, and all the extracted images are formed into the corresponding first image sequence to be detected according to time sequence; the first labeling mode and the first image sequence to be detected are sent to the manual labeling module;

the manual annotation module is used for carrying out target type text confirmation according to interaction between the first annotation mode and a user to obtain a corresponding first target type text sequence when the first annotation mode and the first image sequence to be detected are received; selecting part of images to be detected from the first image sequence to be detected according to the first labeling mode and the interaction between the first target type text sequence and a user, and performing pre-labeling processing to obtain a corresponding first labeling frame data set; the first target type text sequence and the first annotation frame data set are returned to the task scheduling module;

The task scheduling module is further used for calling the multi-mode target detection model to perform target detection processing on the first image sequence to be detected according to the first target type text sequence to obtain a corresponding first detection frame data set when the first target type text sequence and the first annotation frame data set are received; calling the image feature learning model to respectively carry out corresponding labeling/detection frame image feature recognition processing on the first labeling frame data set and the first detection frame data set to obtain a corresponding first labeling frame feature set and a corresponding first detection frame feature set; performing low-resolution detection frame filtering processing on the first detection frame data set according to the first annotation frame feature set and the first detection frame feature set; invoking the image segmentation model to carry out detection frame image semantic segmentation processing on the filtered first detection frame data set to obtain a corresponding first detection frame segmentation data set; the first image sequence to be detected, the first detection frame data set and the first detection frame segmentation data set are sent to the manual auditing module;

The manual auditing module is used for conducting manual auditing processing according to the received first image sequence to be checked, the first detection frame data set and the first detection frame segmentation data set, outputting a corresponding first auditing image sequence, a corresponding first auditing detection frame data set and a corresponding first auditing detection frame segmentation data set, and sending back to the task scheduling module;

the task scheduling module is further used for generating a corresponding first task output data packet by the received first examination image sequence, the first examination detection frame data set and the first examination detection frame segmentation data set; and outputting the first task output data packet to a user through the task output module.

Preferably, the first image sequence to be detected includes a plurality of first images to be detected, and each first image to be detected corresponds to a first image identifier;

the first target type text sequence includes one or more first target type texts; when the first annotation mode is a simple annotation mode, the first target type text sequence consists of a plurality of first target type texts, and each first target type text is a target type noun without a fixed language; when the first annotation mode is a complex annotation mode, the first target type text sequence only comprises one first target type text, and the unique first target type text is a target type noun phrase with one or more fixed languages;

The first annotation frame data set comprises a plurality of first annotation frame data; the first annotation frame data comprises a first father image identification, a first annotation frame image, a first annotation frame center point coordinate, a first annotation frame size, a first annotation frame orientation and a first annotation frame type; the first father image identification corresponds to one first image identification; the first annotation frame type corresponds to one first target type text;

the first detection frame data set comprises a plurality of first detection frame data; the first detection frame data comprises a second father image identifier, a first detection frame image, a first detection frame center point coordinate, a first detection frame size, a first detection frame orientation and a first detection frame type; the second father image identification corresponds to one of the first image identifications; the first detection frame type corresponds to one first target type text;

the first detection frame segmentation data set comprises a plurality of first detection frame segmentation data; the first detection frame segmentation data comprise a second detection frame identifier and a first detection frame semantic segmentation map; the second detection frame identifier corresponds to one of the first detection frame identifiers; the pixel semantics of the first detection frame semantic segmentation map include foreground semantics and background semantics, and the front Jing Yuyi corresponds to one of the first detection frame types.

Preferably, the manual labeling module is specifically configured to identify the first labeling mode when the target type text is confirmed to obtain a corresponding first target type text sequence according to interaction between the first labeling mode and a user;

if the first annotation mode is a simple annotation mode, providing a first simple target type input page for a user; receiving a plurality of target type nouns input by a user through the first simple target type input page, taking each input target type noun as a corresponding first target type text, and forming a corresponding first target type text sequence by all obtained first target type texts;

if the first annotation mode is a complex annotation mode, providing a first complex target type input page for a user; and receiving a target type noun phrase with one or more fixed languages input by a user through the first complex target type input page as the corresponding first target type text, and forming the corresponding first target type text sequence by the unique first target type text.

Preferably, the manual labeling module is specifically configured to provide a first pre-labeling page for a user when the first labeling frame data set corresponding to a pre-labeling process is obtained by selecting a part of images to be detected from the first image sequence according to the first labeling mode and the first target type text sequence and user interaction, and arrange and display all the first images to be detected of the first image sequence on the first pre-labeling page;

When any one of the first images to be detected is selected by a user, the currently selected first image to be detected is used as a corresponding current image; providing a marking frame drawing function for a user to draw marking frames on the current image so as to obtain one or more corresponding first marking frames; the first image identification of the current image is used as the first father image identification of each first annotation frame; extracting the annotation frame images of the first annotation frames on the current image to serve as the corresponding first annotation frame images; the coordinate of the center point of the marking frame, the size of the marking frame and the orientation of the marking frame of each first marking frame on the current image are used as the corresponding coordinate of the center point of the first marking frame, the size of the first marking frame and the orientation of the first marking frame;

when any first annotation frame is selected by a user, taking the currently selected first annotation frame as a corresponding current annotation frame; identifying the first labeling mode; if the first annotation mode is a simple annotation mode, providing an annotation frame type marking function for a user to optionally select one first target type text from the first target type text sequence as a corresponding first annotation frame type to mark the current annotation frame; if the first annotation mode is a complex annotation mode, taking a unique first target type text in the first target type text sequence as a corresponding current target type text, displaying a first prompt message with a confirmation option and a cancel option to a user, prompting whether the current target type text is to be used as the first annotation frame type corresponding to the current annotation frame or not through the first prompt message, and setting the first annotation frame type corresponding to the current annotation frame as the corresponding current target type text when the user selects the confirmation option of the first prompt message;

When a pre-annotation submitting option preset on the first pre-annotation page is selected by a user, forming corresponding first annotation frame data by the first father image identifier, the first annotation frame image, the first annotation frame center point coordinate, the first annotation frame size, the first annotation frame orientation and the first annotation frame type corresponding to each first annotation frame; and the corresponding first annotation frame data set is composed of all the obtained first annotation frame data.

Preferably, the task scheduling module is specifically configured to traverse the first to-be-detected image of the first to-be-detected image sequence when the multi-mode target detection model is invoked to perform target detection processing on the first to-be-detected image sequence according to the first target type text sequence to obtain a corresponding first detection frame data set; the first image to be detected which is traversed currently is used as a corresponding current image to be detected, and the first image identifier corresponding to the current image to be detected is used as a corresponding current image identifier; inputting the first target type text sequence and the current to-be-detected image into the multi-mode target detection model, and carrying out directional target detection on the current to-be-detected image by the multi-mode target detection model according to one or more first target type texts in the first target type text sequence and outputting a corresponding first detection frame-text pair set; if the first detection frame-text pair set is not empty, carrying out detection frame data assembly according to the current image identification, the current image to be detected and the first detection frame-text pair set to obtain a corresponding first detection frame data subset; when the traversing is finished, combining all the obtained first detection frame data subsets to form a corresponding first detection frame data set;

Wherein the first set of detection box-text pairs comprises a plurality of first detection box-text pairs; the first detection box-text pair comprises a first target detection box and a first text; the first target detection frame comprises a first target detection frame center point coordinate, a first target detection frame size and a first target detection frame orientation; the first text corresponds to one of the first target type texts in the sequence when the number of the first target type texts in the first target type text sequence is not unique; and when the number of the first target type texts in the first target type text sequence is unique, the first text corresponds to the unique first target type text in the sequence.

Further, the task scheduling module is specifically configured to traverse the first detection frame-text pair of the first detection frame-text pair set when the corresponding first detection frame data subset is obtained by performing detection frame data assembly according to the current image identifier, the current image to be detected, and the first detection frame-text pair set; traversing, wherein the first detection frame-text pair currently traversed is used as a corresponding current detection frame-text pair; the current image identifier is used as the corresponding second father image identifier; a unique detection frame identifier is allocated to the first target detection frame of the current detection frame-text pair as the corresponding first detection frame identifier; extracting a detection frame image of the first target detection frame of the current detection frame-text pair on the current image to be detected as a corresponding first detection frame image; the first target detection frame center point coordinates, the first target detection frame size and the first target detection frame orientation of the first target detection frame of the current detection frame-text pair are used as the corresponding first detection frame center point coordinates, the first detection frame size and the first detection frame orientation; and taking the first text of the current detection frame-text pair as the corresponding first detection frame type; the obtained second father image identification, the first detection frame image, the first detection frame center point coordinate, the first detection frame size, the first detection frame orientation and the first detection frame type form corresponding first detection frame data; and when the traversal is finished, the corresponding first detection frame data subset is formed by all the obtained first detection frame data.

Preferably, the task scheduling module is specifically configured to, when the invoking the image feature learning model performs corresponding labeling/detection frame image feature recognition processing on the first labeling frame data set and the first detection frame data set to obtain a corresponding first labeling frame feature set and a first detection frame feature set, input the first labeling frame images of the first labeling frame data set into the image feature learning model, and perform image feature extraction processing on the first labeling frame images by using the image feature learning model to obtain corresponding first labeling frame features; inputting the first detection frame images of the first detection frame data set into the image feature learning model, and carrying out image feature extraction processing on the first detection frame images by the image feature learning model to obtain corresponding first detection frame features; and the corresponding first labeling frame feature set is formed by all the obtained first labeling frame features, and the corresponding first detection frame feature set is formed by all the obtained first detection frame features.

Preferably, the task scheduling module is specifically configured to traverse a first detection frame feature of the first detection frame feature set when the low-resolution detection frame filtering process is performed on the first detection frame data set according to the first labeling frame feature set and the first detection frame feature set; the first detection frame characteristic of the current traversal is used as a corresponding current detection frame characteristic, and the first detection frame type of the first detection frame data corresponding to the current detection frame characteristic is used as a corresponding current detection frame type; and taking each first annotation frame data matched with the current detection frame type in the first annotation frame data set as corresponding matched annotation frame data; and taking the first annotation frame features corresponding to the matched annotation frame data in the first annotation frame feature set as corresponding similar annotation frame features; matching and scoring the current detection frame characteristics and the similar marking frame characteristics based on a Hungary matching algorithm to obtain corresponding first scores, and averaging all the obtained first scores to generate corresponding first average scores; and deleting the first detection frame data corresponding to the current detection frame characteristics from the first detection frame data set when the first average score is lower than a preset scoring threshold.

Preferably, the task scheduling module is specifically configured to traverse the first detection frame data of the first detection frame data set when the image segmentation model is invoked to perform detection frame image semantic segmentation processing on the filtered first detection frame data set to obtain a corresponding first detection frame segmentation data set; traversing, wherein the first detection frame data which is traversed currently is used as corresponding current detection frame data; inputting the first detection frame image of the current detection frame data into the image segmentation model, and performing pixel-level foreground and background pixel semantic segmentation processing on the first detection frame image by the image segmentation model to generate a corresponding first detection frame semantic segmentation map; marking each pixel point with pixel semantics not being background semantics on the first detection frame semantic segmentation map as a corresponding first foreground pixel point, and setting the pixel semantics of each first foreground pixel point as the first detection frame type of the current detection frame data; the first detection frame identifier of the current detection frame data is used as the corresponding second detection frame identifier; the second detection frame mark and the first detection frame semantic segmentation map are obtained to form corresponding first detection frame segmentation data; and when the traversal is finished, the corresponding first detection frame segmentation data set is formed by all the obtained first detection frame segmentation data.

Preferably, the manual auditing module is specifically configured to, when performing manual auditing processing according to the received first image sequence to be inspected, the first detection frame data set, and the first detection frame segmentation data set and outputting a corresponding first audit image sequence, first audit detection frame data set, and first audit detection frame segmentation data set to send back to the task scheduling module, combine the first detection frame data set and the first detection frame segmentation data set according to a corresponding relation of detection frame identifiers to obtain a corresponding second detection frame data set; wherein the second set of detection frame data includes a plurality of second detection frame data; the second detection frame data comprises the second father image identifier, the first detection frame image, the first detection frame center point coordinate, the first detection frame size, the first detection frame orientation, the first detection frame type and the first detection frame semantic segmentation map;

traversing each first to-be-detected image of the first to-be-detected image sequence; traversing, wherein the first to-be-detected image which is traversed currently is used as a corresponding current to-be-detected image; the first image identifier corresponding to the current image to be detected is used as a corresponding current image identifier; the second detection frame data matched with the current image identifier in the second father image identifier in the second detection frame data set is recorded as corresponding first matching detection frame data; identifying whether the number of the first matching detection frame data is zero or not; if the number of the first matching detection frame data is zero, marking the current image to be detected as a corresponding first image to be filtered; if the number of the first matching detection frame data is not zero, corresponding detection frame drawing, front Jing Yuyi pixel coloring and text prompt frame drawing processing are carried out on the current to-be-detected image according to all the first matching detection frame data to obtain a corresponding first examination and approval image; when the traversal is finished, providing a first image examination page for a user, and displaying all the first examination images on the first image examination page in a arraying way;

When any one of the first examination and delivery images is selected by a user, displaying a second prompt message with a confirmation option and a cancel option to the user, prompting whether the currently selected first examination and delivery image is to be marked as a disqualified image or not through the second prompt message, and marking the currently selected first examination and delivery image as the corresponding first image to be filtered when the user selects the confirmation option of the second prompt message;

when a preset examination ending option on the first image examination page is selected by a user, deleting the first to-be-detected images corresponding to the first to-be-filtered images in the first to-be-detected image sequence, and taking the deleted image sequence as the corresponding first examination image sequence; the second father image identification in the first detection frame data set and the first detection frame data corresponding to each first image to be filtered are used as corresponding first detection frame data to be deleted; deleting the first detection frame segmentation data corresponding to each piece of first detection frame data to be deleted by the second detection frame identification in the first detection frame segmentation data set, and taking the deleted data set as the corresponding first examination detection frame segmentation data set; deleting all the first detection frame data to be deleted in the first detection frame data set, and taking the deleted data set as the corresponding first check detection frame data set; and sending the obtained first audit image sequence, the first audit detection frame data set and the first audit detection frame segmentation data set back to the task scheduling module.

Further, the manual auditing module is specifically configured to traverse each first matching detection frame data when the corresponding detection frame drawing, the coloring of the front Jing Yuyi pixel points and the text prompt box drawing process are performed on the current image to be inspected according to all the first matching detection frame data to obtain a corresponding first examination and approval image; traversing, wherein the first matching detection frame data in the current traversal is used as corresponding current matching detection frame data; drawing a detection frame on the current image to be detected according to the first detection frame center point coordinate, the first detection frame size and the first detection frame orientation of the current matching detection frame data to obtain a corresponding first drawing frame; the foreground semantic pixel point marking is carried out on the image in the first drawing frame according to the first detection frame semantic segmentation map of the current matching detection frame data, and the preset first color is used for setting the color of the front Jing Yuyi pixel point of the first drawing frame; drawing a text prompt box at a designated position on the first drawing box to serve as a corresponding first text box, and setting the text content of the first text box as the first detection box type of the current matching detection box data; and when the traversal is finished, taking the current to-be-inspected image added with the drawing information as the corresponding first review image.

The embodiment of the invention provides a processing system for batch image annotation, which comprises the following components: the system comprises a task scheduling module, a task input module, a manual labeling module, a manual auditing module, a task output module, a multi-mode target detection model, an image feature learning model and an image segmentation model; the task scheduling module is used for sorting out the image sequence to be detected from the labeling task received by the task input module; the manual annotation module is used for carrying out target type text confirmation according to interaction between the annotation mode and a user and selecting part of images to be detected from the image sequence to be detected for pre-annotation processing; the task scheduling module invokes the multi-mode target detection model, the image feature learning model and the image segmentation model again, and performs target detection, low-resolution detection frame filtering and semantic segmentation processing on the image sequence to be detected according to the target type text sequence and the annotation frame data set output by the manual annotation module to obtain a corresponding detection frame segmentation data set; the manual auditing module carries out manual auditing according to the image sequence to be inspected, the detection frame data set and the detection frame segmentation data set; and the task scheduling module forms the audit output of the manual audit module into a corresponding task output data packet and outputs the data packet through the task output module. When the system processes the massive image labeling task each time, only a few images are selected from the massive images according to the target types to be labeled in advance to perform pre-labeling, then the system can automatically label the rest massive images according to the pre-labeled target types and the labeling frame, and a manual auditing interface is provided for auditing labeling results. The system not only shortens the working time of the labeling work, but also improves the working efficiency of the labeling work and reduces the labeling cost of the labeling work.

Drawings

FIG. 1 is a schematic block diagram of a processing system for batch image annotation according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a schematic block diagram of a processing system for batch image labeling according to an embodiment of the present invention, where, as shown in FIG. 1, the system includes: the system comprises a task scheduling module 1, a task input module 2, a manual labeling module 3, a manual auditing module 4, a task output module 5, a multi-mode target detection model 6, an image feature learning model 7 and an image segmentation model 8. The task scheduling module 1 is respectively connected with the task input module 2, the manual annotation module 3, the manual auditing module 4, the task output module 5, the multi-mode target detection model 6, the image feature learning model 7 and the image segmentation model 8.

The task input module 2 is configured to send a first labeling task input by a user to the task scheduling module 1. The first labeling task comprises a first labeling mode, a first task data type and first task data; the first annotation mode comprises a simple annotation mode and a complex annotation mode; the first task data type includes an image type and a video type; the first task data corresponding to the first task data type is an image sequence when the first task data type is an image type, and the first task data corresponding to the first task data type is a video data when the first task data type is a video type.

The task scheduling module 1 is used for extracting a corresponding first labeling mode, a first task data type and first task data from the received first labeling task; the first task data type is identified, if the first task data type is the image type, the first task data is used as a corresponding first image sequence to be detected, if the first task data type is the video type, video framing image extraction processing is carried out on the first task data, and all the extracted images are formed into the corresponding first image sequence to be detected according to time sequence; and the first annotation mode and the first image sequence to be detected are sent to the manual annotation module 3. The first image sequence comprises a plurality of first images to be detected, and each first image to be detected corresponds to one first image identifier.

The manual annotation module 3 is used for carrying out target type text confirmation according to interaction between the first annotation mode and a user to obtain a corresponding first target type text sequence when the first annotation mode and the first image sequence to be detected are received; selecting part of images to be detected from the first image sequence to be detected according to the first annotation mode and the first target type text sequence and interaction of the user, and performing pre-annotation processing to obtain a corresponding first annotation frame data set; and the first target type text sequence and the first annotation frame data set are returned to the task scheduling module 1.

Wherein the first target type text sequence comprises one or more first target type texts; when the first annotation mode is a simple annotation mode, the first target type text sequence consists of a plurality of first target type texts, and each first target type text is a target type noun without a fixed language; when the first annotation mode is a complex annotation mode, the first target type text sequence comprises only one first target type text, and the unique first target type text is a target type noun phrase with one or more dialects. The first annotation frame data set comprises a plurality of first annotation frame data; the first annotation frame data comprises a first father image identification, a first annotation frame image, a first annotation frame center point coordinate, a first annotation frame size, a first annotation frame orientation and a first annotation frame type; the first father image identification corresponds to a first image identification; the first callout box type corresponds to a first target type text.

In a specific implementation manner of the embodiment of the present invention, the manual labeling module 3 is specifically configured to, when performing target type text confirmation with user interaction according to the first labeling mode to obtain a corresponding first target type text sequence:

step A1, identifying a first labeling mode;

step A2, if the first labeling mode is a simple labeling mode, providing a first simple target type input page for a user; receiving a plurality of target type nouns input by a user through a first simple target type input page, taking each input target type noun as a corresponding first target type text, and forming a corresponding first target type text sequence by all obtained first target type texts;

for example, in the case that the first annotation mode is a simple annotation mode, the first simple object type input page receives three object type nouns input by the user: "automobile", "tree", "pedestrian"; then the first target type text sequence is { "car", "tree", "pedestrian" };

step A3, if the first labeling mode is a complex labeling mode, providing a first complex target type input page for a user; and receiving a target type noun phrase with one or more fixed languages input by a user through the first complex target type input page as a corresponding first target type text, and forming a corresponding first target type text sequence by the unique first target type text.

For example, in the case that the first annotation mode is a complex annotation mode, the first complex object type input page receives a noun phrase of an object type input by the user: pedestrians on a lane; the resulting text sequence of the first target type is { "pedestrian on lane" }.

In another specific implementation manner of the embodiment of the present invention, the manual labeling module 3 is specifically configured to, when selecting a portion of an image to be detected from the first image sequence to be detected according to the first labeling mode and the first target type text sequence and user interaction, perform pre-labeling processing to obtain a corresponding first labeling frame data set:

step B1, providing a first pre-marked page for a user, and displaying all first images to be detected of a first image sequence on the first pre-marked page in an arrangement way;

step B2, when any one of the first images to be detected is selected by a user, taking the currently selected first image to be detected as a corresponding current image; providing a marking frame drawing function for a user to draw the marking frame on the current image so as to obtain one or more corresponding first marking frames; the first image identification of the current image is used as a first father image identification of each first annotation frame; extracting the annotation frame images of the first annotation frames on the current image to serve as corresponding first annotation frame images; the coordinate of the center point of the marking frame, the size of the marking frame and the orientation of the marking frame of each first marking frame on the current image are used as the corresponding coordinate of the center point of the first marking frame, the size of the first marking frame and the orientation of the first marking frame;

Step B3, when any first annotation frame is selected by a user, taking the currently selected first annotation frame as a corresponding current annotation frame; identifying the first labeling mode; if the first annotation mode is a simple annotation mode, providing an annotation frame type marking function for a user to select a first target type text from a first target type text sequence as a corresponding first annotation frame type to mark the current annotation frame; if the first annotation mode is a complex annotation mode, taking a unique first target type text in a first target type text sequence as a corresponding current target type text, displaying a first prompt message with a confirmation option and a cancel option to a user, prompting whether the current target type text is to be used as a first annotation frame type corresponding to the current annotation frame or not through the first prompt message, and setting the first annotation frame type corresponding to the current annotation frame as a corresponding current target type text when the user selects the confirmation option of the first prompt message;

step B4, when a pre-annotation submitting option preset on the first pre-annotation page is selected by a user, forming corresponding first annotation frame data by a first father image identifier, a first annotation frame image, a first annotation frame center point coordinate, a first annotation frame size, a first annotation frame orientation and a first annotation frame type corresponding to each first annotation frame; and the corresponding first marking frame data set is formed by all the obtained first marking frame data.

The task scheduling module 1 is further configured to, when receiving the first target type text sequence and the first label frame data set, invoke the multi-mode target detection model 6 to perform target detection processing on the first image sequence to be detected according to the first target type text sequence to obtain a corresponding first detection frame data set; calling an image feature learning model 7 to respectively carry out corresponding labeling/detection frame image feature recognition processing on the first labeling frame data set and the first detection frame data set to obtain a corresponding first labeling frame feature set and a corresponding first detection frame feature set; performing low-resolution detection frame filtering processing on the first detection frame data set according to the first marking frame feature set and the first detection frame feature set; invoking an image segmentation model 8 to perform detection frame image semantic segmentation processing on the filtered first detection frame data set to obtain a corresponding first detection frame segmentation data set; and the first image sequence to be detected, the first detection frame data set and the first detection frame segmentation data set are sent to the manual auditing module 4.

Wherein, the multi-mode target detection model 6 defaults to a grouping DINO model; the image feature learning model 7 adopts a DINov2 model by default; the image segmentation model 8 defaults to the SAM model. The first detection frame data set comprises a plurality of first detection frame data; the first detection frame data comprises a second father image identifier, a first detection frame image, a first detection frame center point coordinate, a first detection frame size, a first detection frame orientation and a first detection frame type; the second father image identification corresponds to a first image identification; the first detection frame type corresponds to a first target type text. The first detection frame segmentation data set comprises a plurality of first detection frame segmentation data; the first detection frame segmentation data comprise a second detection frame identification and a first detection frame semantic segmentation map; the second detection frame identifier corresponds to one first detection frame identifier; the pixel semantics of the first detection frame semantic segmentation map include foreground semantics and background semantics, the front Jing Yuyi corresponding to one first detection frame type.

In another specific implementation manner of the embodiment of the present invention, the task scheduling module 1 is specifically configured to traverse a first to-be-detected image of the first to-be-detected image sequence when invoking the multi-mode target detection model 6 to perform target detection processing on the first to-be-detected image sequence according to the first target type text sequence to obtain a corresponding first detection frame data set; the first image to be detected which is traversed at present is used as a corresponding current image to be detected, and a first image identifier corresponding to the current image to be detected is used as a corresponding current image identifier; inputting a first target type text sequence and a current image to be detected into a multi-mode target detection model 6, and carrying out directional target detection on the current image to be detected by the multi-mode target detection model 6 according to one or more first target type texts in the first target type text sequence and outputting a corresponding first detection frame-text pair set; if the first detection frame-text pair set is not empty, carrying out detection frame data assembly according to the current image identification, the current image to be detected and the first detection frame-text pair set to obtain a corresponding first detection frame data subset; when the traversing is finished, combining all the obtained first detection frame data subsets to form a corresponding first detection frame data set;

Wherein the first set of detection box-text pairs comprises a plurality of first detection box-text pairs; the first detection box-text pair comprises a first target detection box and a first text; the first target detection frame comprises a first target detection frame center point coordinate, a first target detection frame size and a first target detection frame orientation; when the number of the first target type texts in the first target type text sequence is not the same, the first text corresponds to one first target type text in the sequence; the first text corresponds to a unique first object type text in the sequence when the number of first object type texts in the sequence of first object type texts is unique.

Here, the multi-modal object detection model 6 in the embodiment of the present invention adopts a grouping DINO model by default, where the grouping DINO model is a multi-modal object detection big model implemented based on a transformation model structure, and as shown in paper "grouping DINO: marrying DINO with Grounded Pre-Training for Open-Set Object Detection", the big model is composed of an image feature extraction module, a text feature extraction module, a feature enhancement module, a language guidance query selection module, and a cross-modal decoder module, and as shown in the paper, the big model performs dual-mode (text, image) feature extraction and fusion on an input object type text and an input image, performs object detection based on fusion features, and combines the input object type text and a detected object detection box (bbox) into a detection box-text pair for output.

In another specific implementation manner of the embodiment of the present invention, the task scheduling module 1 is specifically configured to traverse a first detection frame-text pair of the first detection frame-text pair set when the detection frame data is assembled according to the current image identifier, the current image to be detected, and the first detection frame-text pair set to obtain a corresponding first detection frame data subset; traversing, wherein the first detection frame-text pair in the current traversal is used as the corresponding current detection frame-text pair; the current image identifier is used as a corresponding second father image identifier; a unique detection frame identifier is allocated to a first target detection frame of the current detection frame-text pair as a corresponding first detection frame identifier; extracting a detection frame image of a first target detection frame of the current detection frame-text pair on a current image to be detected as a corresponding first detection frame image; the first target detection frame center point coordinate, the first target detection frame size and the first target detection frame orientation of the first target detection frame of the current detection frame-text pair are used as corresponding first detection frame center point coordinate, first detection frame size and first detection frame orientation; taking a first text of the current detection frame-text pair as a corresponding first detection frame type; the obtained second father image identification, the first detection frame image, the first detection frame center point coordinate, the first detection frame size, the first detection frame orientation and the first detection frame type form corresponding first detection frame data; and when the traversal is finished, the corresponding first detection frame data subset is formed by all the obtained first detection frame data.

In another specific implementation manner of the embodiment of the present invention, the task scheduling module 1 is specifically configured to, when the image feature learning model 7 is invoked, perform corresponding labeling/detection frame image feature recognition processing on the first labeling frame data set and the first detection frame data set to obtain a corresponding first labeling frame feature set and a corresponding first detection frame feature set, input first labeling frame images of each first labeling frame data of the first labeling frame data set into the image feature learning model 7, and perform image feature extraction processing on each first labeling frame image by the image feature learning model 7 to obtain a corresponding first labeling frame feature; inputting first detection frame images of the first detection frame data set into an image feature learning model 7, and carrying out image feature extraction processing on the first detection frame images by the image feature learning model 7 to obtain corresponding first detection frame features; and the corresponding first marking frame feature set is formed by all the obtained first marking frame features, and the corresponding first detection frame feature set is formed by all the obtained first detection frame features.

Here, the image feature learning model 7 in the embodiment of the present invention adopts a DINOv2 model as a default, and the DINOv2 model is a visual large model, as shown in paper DINOv2: learning Robust Visual Features without Supervision, and the model can perform image feature learning (extraction) on an input arbitrary scale image.

In another specific implementation manner of the embodiment of the present invention, the task scheduling module 1 is specifically configured to traverse the first detection frame feature of the first detection frame feature set when performing low-resolution detection frame filtering processing on the first detection frame data set according to the first label frame feature set and the first detection frame feature set; the first detection frame characteristic of the current traversal is used as the corresponding current detection frame characteristic, and the first detection frame type of the first detection frame data corresponding to the current detection frame characteristic is used as the corresponding current detection frame type; and taking each first marking frame data of which the first marking frame type is matched with the current detection frame type in the first marking frame data set as corresponding matched marking frame data; the first marking frame features corresponding to the matching marking frame data in the first marking frame feature set are used as corresponding similar marking frame features; matching and scoring the current detection frame characteristics and the similar marking frame characteristics based on a Hungary matching algorithm to obtain corresponding first scores, and averaging all the obtained first scores to generate corresponding first average scores; and deleting the first detection frame data corresponding to the current detection frame characteristics from the first detection frame data set when the first average score is lower than a preset scoring threshold.

In another specific implementation manner of the embodiment of the present invention, the task scheduling module 1 is specifically configured to traverse first detection frame data of the first detection frame data set when the image segmentation model 8 is invoked to perform detection frame image semantic segmentation processing on the filtered first detection frame data set to obtain a corresponding first detection frame segmentation data set; traversing, wherein the first detection frame data in the current traversal is used as corresponding current detection frame data; inputting a first detection frame image of the current detection frame data into an image segmentation model 8, and performing pixel-level foreground and background pixel semantic segmentation processing on the first detection frame image by the image segmentation model 8 to generate a corresponding first detection frame semantic segmentation map; marking each pixel point with pixel semantics not being background semantics on the first detection frame semantic segmentation map as a corresponding first foreground pixel point, and setting the pixel semantics of each first foreground pixel point as a first detection frame type of current detection frame data; the first detection frame identifier of the current detection frame data is used as a corresponding second detection frame identifier; the second detection frame mark and the first detection frame semantic segmentation map form corresponding first detection frame segmentation data; and when the traversal is finished, the obtained first detection frame segmentation data form a corresponding first detection frame segmentation data set.

Here, the image segmentation model 8 in the embodiment of the present invention adopts a SAM model by default, and the SAM model is generally called Segment Anything Model, which is a large model for image segmentation.

The manual auditing module 4 is configured to perform manual auditing processing according to the received first image sequence to be inspected, the first detection frame data set, and the first detection frame segmentation data set, and output a corresponding first auditing image sequence, first auditing detection frame data set, and first auditing detection frame segmentation data set, and send back to the task scheduling module 1, specifically:

step C1, merging the first detection frame data set and the first detection frame segmentation data set according to the corresponding relation of the detection frame identifications to obtain a corresponding second detection frame data set;

wherein the second detection frame data set includes a plurality of second detection frame data; the second detection frame data comprises a second father image identifier, a first detection frame image, a first detection frame center point coordinate, a first detection frame size, a first detection frame orientation, a first detection frame type and a first detection frame semantic segmentation map;

step C2, traversing each first to-be-detected image of the first to-be-detected image sequence; traversing, and taking the first currently traversed image to be detected as a corresponding current image to be detected; the first image identifier corresponding to the current image to be detected is used as the corresponding current image identifier; and recording second detection frame data, in which a second father image identifier is matched with the current image identifier, in the second detection frame data set as corresponding first matching detection frame data; identifying whether the number of the first matching detection frame data is zero or not; if the number of the first matching detection frame data is zero, marking the current image to be detected as a corresponding first image to be filtered; if the number of the first matching detection frame data is not zero, corresponding detection frame drawing, front Jing Yuyi pixel coloring and text prompt frame drawing processing are carried out on the current to-be-detected image according to all the first matching detection frame data to obtain a corresponding first examination and approval image; when the traversal is finished, providing a first image examination page for a user, and displaying all the first examination images on the first image examination page in a arraying way;

In another specific implementation manner of the embodiment of the present invention, the manual review module 4 is specifically configured to traverse each first matching detection frame data when a corresponding detection frame drawing, a front Jing Yuyi pixel coloring and a text prompt box drawing process are performed on the current to-be-detected image according to all the first matching detection frame data to obtain a corresponding first delivery image; traversing, wherein the first matching detection frame data in the current traversal is used as corresponding current matching detection frame data; drawing the detection frame on the current image to be detected according to the first detection frame center point coordinate, the first detection frame size and the first detection frame orientation of the current matching detection frame data to obtain a corresponding first drawing frame; the foreground semantic pixel point marking is carried out on the image in the first drawing frame according to the first detection frame semantic segmentation diagram of the current matching detection frame data, and the preset first color is used for setting the color of the front Jing Yuyi pixel point of the first drawing frame; drawing a text prompt box at a designated position on the first drawing box to serve as a corresponding first text box, and setting the text content of the first text box as a first detection box type of the current matching detection box data; when the traversal is finished, taking the current image to be inspected, to which the drawing information is added, as a corresponding first examination image;

Step C3, when any first examination and delivery image is selected by a user, displaying a second prompt message with a confirmation option and a cancel option to the user, prompting whether the currently selected first examination and delivery image is to be marked as an unqualified image or not to the user through the second prompt message, and marking the currently selected first examination and delivery image as a corresponding first image to be filtered when the user selects the confirmation option of the second prompt message;

step C4, deleting the first to-be-detected images corresponding to the first to-be-filtered images in the first to-be-detected image sequence when the preset examination ending options on the first image examination page are selected by the user, and taking the deleted image sequence as a corresponding first examination image sequence; the second father image identification in the first detection frame data set and the first detection frame data corresponding to each first image to be filtered are used as corresponding first detection frame data to be deleted; deleting first detection frame segmentation data corresponding to each piece of first detection frame data to be deleted from a second detection frame identification in the first detection frame segmentation data set, and taking the deleted data set as a corresponding first check detection frame segmentation data set; deleting all first detection frame data to be deleted in the first detection frame data set, and taking the deleted data set as a corresponding first check detection frame data set;

And step C5, the obtained first audit image sequence, the first audit detection frame data set and the first audit detection frame segmentation data set are returned to the task scheduling module 1.

The task scheduling module 1 is further configured to generate a corresponding first task output data packet by using the received first audit image sequence, the first audit detection frame data set and the first audit detection frame segmentation data set; and outputs the first task output data packet to the user through the task output module 5.

Those of skill would further appreciate that the steps of a system, module, unit, and algorithm described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the various illustrative components and steps have been described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a system, module, unit, or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A processing system for batch image annotation, the system comprising: the system comprises a task scheduling module, a task input module, a manual labeling module, a manual auditing module, a task output module, a multi-mode target detection model, an image feature learning model and an image segmentation model;

2. The processing system for batch image annotation of claim 1 wherein,

the first image sequence to be detected comprises a plurality of first images to be detected, and each first image to be detected corresponds to one first image identifier;

3. The processing system for batch image annotation of claim 2 wherein,

the manual annotation module is specifically configured to identify the first annotation mode when the target type text is confirmed to obtain a corresponding first target type text sequence according to interaction between the first annotation mode and a user;

4. The processing system for batch image annotation of claim 2 wherein,

the manual annotation module is specifically configured to provide a first pre-annotation page for a user when the first annotation frame dataset corresponding to the first annotation frame dataset is obtained by pre-annotating a part of images to be detected selected from the first image sequence according to the first annotation mode and the first target type text sequence, and the first pre-annotation page is used for displaying all the first images to be detected of the first image sequence in a arrayed manner;

5. The processing system for batch image annotation of claim 2 wherein,

the task scheduling module is specifically configured to traverse the first to-be-detected image of the first to-be-detected image sequence when the multi-mode target detection model is invoked to perform target detection processing on the first to-be-detected image sequence according to the first target type text sequence to obtain a corresponding first detection frame data set; the first image to be detected which is traversed currently is used as a corresponding current image to be detected, and the first image identifier corresponding to the current image to be detected is used as a corresponding current image identifier; inputting the first target type text sequence and the current to-be-detected image into the multi-mode target detection model, and carrying out directional target detection on the current to-be-detected image by the multi-mode target detection model according to one or more first target type texts in the first target type text sequence and outputting a corresponding first detection frame-text pair set; if the first detection frame-text pair set is not empty, carrying out detection frame data assembly according to the current image identification, the current image to be detected and the first detection frame-text pair set to obtain a corresponding first detection frame data subset; when the traversing is finished, combining all the obtained first detection frame data subsets to form a corresponding first detection frame data set;

6. The processing system for batch image annotation of claim 5,

the task scheduling module is specifically configured to traverse the first detection frame-text pair of the first detection frame-text pair set when the corresponding first detection frame data subset is obtained by performing detection frame data assembly according to the current image identifier, the current image to be detected, and the first detection frame-text pair set; traversing, wherein the first detection frame-text pair currently traversed is used as a corresponding current detection frame-text pair; the current image identifier is used as the corresponding second father image identifier; a unique detection frame identifier is allocated to the first target detection frame of the current detection frame-text pair as the corresponding first detection frame identifier; extracting a detection frame image of the first target detection frame of the current detection frame-text pair on the current image to be detected as a corresponding first detection frame image; the first target detection frame center point coordinates, the first target detection frame size and the first target detection frame orientation of the first target detection frame of the current detection frame-text pair are used as the corresponding first detection frame center point coordinates, the first detection frame size and the first detection frame orientation; and taking the first text of the current detection frame-text pair as the corresponding first detection frame type; the obtained second father image identification, the first detection frame image, the first detection frame center point coordinate, the first detection frame size, the first detection frame orientation and the first detection frame type form corresponding first detection frame data; and when the traversal is finished, the corresponding first detection frame data subset is formed by all the obtained first detection frame data.

7. The processing system for batch image annotation of claim 2 wherein,

the task scheduling module is specifically configured to, when the image feature learning model is invoked to perform corresponding labeling/detection frame image feature recognition processing on the first labeling frame data set and the first detection frame data set to obtain a corresponding first labeling frame feature set and a corresponding first detection frame feature set, input the first labeling frame images of the first labeling frame data sets into the image feature learning model, and perform image feature extraction processing on the first labeling frame images by using the image feature learning model to obtain corresponding first labeling frame features; inputting the first detection frame images of the first detection frame data set into the image feature learning model, and carrying out image feature extraction processing on the first detection frame images by the image feature learning model to obtain corresponding first detection frame features; and the corresponding first labeling frame feature set is formed by all the obtained first labeling frame features, and the corresponding first detection frame feature set is formed by all the obtained first detection frame features.

8. The processing system for batch image annotation of claim 2 wherein,

the task scheduling module is specifically configured to traverse a first detection frame feature of the first detection frame feature set when the low-resolution detection frame filtering processing is performed on the first detection frame data set according to the first labeling frame feature set and the first detection frame feature set; the first detection frame characteristic of the current traversal is used as a corresponding current detection frame characteristic, and the first detection frame type of the first detection frame data corresponding to the current detection frame characteristic is used as a corresponding current detection frame type; and taking each first annotation frame data matched with the current detection frame type in the first annotation frame data set as corresponding matched annotation frame data; and taking the first annotation frame features corresponding to the matched annotation frame data in the first annotation frame feature set as corresponding similar annotation frame features; matching and scoring the current detection frame characteristics and the similar marking frame characteristics based on a Hungary matching algorithm to obtain corresponding first scores, and averaging all the obtained first scores to generate corresponding first average scores; and deleting the first detection frame data corresponding to the current detection frame characteristics from the first detection frame data set when the first average score is lower than a preset scoring threshold.

9. The processing system for batch image annotation of claim 2 wherein,

the task scheduling module is specifically configured to traverse the first detection frame data of the first detection frame data set when the filtered first detection frame data set is subjected to detection frame image semantic segmentation processing by using the image segmentation model to obtain a corresponding first detection frame segmentation data set; traversing, wherein the first detection frame data which is traversed currently is used as corresponding current detection frame data; inputting the first detection frame image of the current detection frame data into the image segmentation model, and performing pixel-level foreground and background pixel semantic segmentation processing on the first detection frame image by the image segmentation model to generate a corresponding first detection frame semantic segmentation map; marking each pixel point with pixel semantics not being background semantics on the first detection frame semantic segmentation map as a corresponding first foreground pixel point, and setting the pixel semantics of each first foreground pixel point as the first detection frame type of the current detection frame data; the first detection frame identifier of the current detection frame data is used as the corresponding second detection frame identifier; the second detection frame mark and the first detection frame semantic segmentation map are obtained to form corresponding first detection frame segmentation data; and when the traversal is finished, the corresponding first detection frame segmentation data set is formed by all the obtained first detection frame segmentation data.

10. The processing system for batch image annotation of claim 2 wherein,

the manual auditing module is specifically configured to combine the first detection frame data set and the first detection frame segmentation data set according to a corresponding relation of detection frame identifiers when the first to-be-inspected image sequence, the first detection frame data set and the first detection frame segmentation data set are received, perform manual auditing processing, and output a corresponding first auditing image sequence, a corresponding first auditing detection frame data set and a corresponding first auditing detection frame segmentation data set to the task scheduling module; wherein the second set of detection frame data includes a plurality of second detection frame data; the second detection frame data comprises the second father image identifier, the first detection frame image, the first detection frame center point coordinate, the first detection frame size, the first detection frame orientation, the first detection frame type and the first detection frame semantic segmentation map;

11. The processing system for batch image annotation of claim 10 wherein,

the manual auditing module is specifically configured to traverse each first matching detection frame data when the corresponding first examination image is obtained by performing corresponding detection frame drawing, front Jing Yuyi pixel coloring and text prompt box drawing processing on the current image to be examined according to all the first matching detection frame data; traversing, wherein the first matching detection frame data in the current traversal is used as corresponding current matching detection frame data; drawing a detection frame on the current image to be detected according to the first detection frame center point coordinate, the first detection frame size and the first detection frame orientation of the current matching detection frame data to obtain a corresponding first drawing frame; the foreground semantic pixel point marking is carried out on the image in the first drawing frame according to the first detection frame semantic segmentation map of the current matching detection frame data, and the preset first color is used for setting the color of the front Jing Yuyi pixel point of the first drawing frame; drawing a text prompt box at a designated position on the first drawing box to serve as a corresponding first text box, and setting the text content of the first text box as the first detection box type of the current matching detection box data; and when the traversal is finished, taking the current to-be-inspected image added with the drawing information as the corresponding first review image.