CN117152553A

CN117152553A - Image label generation method, device and system, medium and computing device

Info

Publication number: CN117152553A
Application number: CN202310947864.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shenzhen Konuositeng Technology Co ltd
Current assignee: Shenzhen Konuositeng Technology Co ltd
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-12-01

Abstract

A label generation method, apparatus and system, medium and computing device for an image, the method comprising: acquiring a first image comprising a target object and acquiring an initial pose of the target object when the first image is acquired; acquiring a three-dimensional model of the target object; projecting the three-dimensional model onto the first image based on the initial pose to obtain a projection pixel region; calibrating the initial pose based on the overlapping degree between the projection pixel region and the target pixel region where the target object is located in the first image to obtain a calibrated pose; generating a second image including the target object based on the first image; and generating tag information of the target object in the second image based on the calibration pose.

Description

Image label generation method, device and system, medium and computing device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method, an apparatus, a system, a medium, and a computing device for generating a label of an image.

Background

Typically, a neural network is employed to track the target object. However, training of neural networks relies on a large number of tagged images. If the labels of the images are inaccurate, the trained neural network is inaccurate, so that the tracking result of the target object is inaccurate. However, the related art generally acquires a tag of a target object based on a measurement result of the target object by a sensor, and the acquired tag is inaccurate due to an error in the measurement result of the sensor.

Disclosure of Invention

Based on this, the embodiment of the disclosure provides a label generation method, device and system, medium and computing equipment of an image so as to generate more accurate labels for the image.

In a first aspect, an embodiment of the present disclosure provides a method for generating a label of an image, the method including: acquiring a first image comprising a target object and acquiring an initial pose of the target object when the first image is acquired; acquiring a three-dimensional model of the target object; projecting the three-dimensional model onto the first image based on the initial pose to obtain a projection pixel region; calibrating the initial pose based on the overlapping degree between the projection pixel region and the target pixel region where the target object is located in the first image to obtain a calibrated pose; generating a second image including the target object based on the first image; and generating tag information of the target object in the second image based on the calibration pose.

In some embodiments, the first image is obtained by image acquisition of the target object in a first preset background.

In some embodiments, the generating a second image including the target object based on the first image includes: and replacing the first preset background in the first image with a second preset background to obtain the second image.

In some embodiments, the method further comprises: post-processing the second image; the post-processing includes at least one of: blurring processing, sharpening processing, noise reduction processing, and enhancement processing.

In some embodiments, the calibration pose is a pose of the target object when the degree of overlap is maximized.

In some embodiments, the calibrating the initial pose based on the overlapping degree between the projected pixel region and the target pixel region where the target object is located in the first image, to obtain a calibrated pose includes: after optimizing the initial pose by adopting a preset pose optimization algorithm, recalculating the overlapping degree between the projection pixel region and the target pixel region; and determining the pose corresponding to the projection pixel region with the largest overlapping degree of the target pixel region as the calibration pose.

In some embodiments, the degree of overlap between the projected pixel region and the target pixel region is determined based on IoU, GIoU, or die loss of the projected pixel region and the target pixel region.

In some embodiments, the method further comprises: acquiring a mask of the target object in the first image; the degree of overlap is determined based on the mask of the target object and the projected pixel region.

In some embodiments, prior to determining the overlap based on the mask of the target object and the projected pixel region, the method further comprises: and smoothing the mask.

In some embodiments, the target object comprises at least one surgical instrument, each surgical instrument being held on one of the robotic arms of the surgical robot, and the robotic arm being provided with a sensor for acquiring an initial pose of the surgical instrument held on the robotic arm; the first image is acquired by an image acquisition device.

In some embodiments, the three-dimensional model of the surgical instrument corresponds to a type and model of the surgical instrument; the obtaining the three-dimensional model of the target object comprises the following steps: according to the type and model of the surgical instrument held on the mechanical arm, a three-dimensional model of the surgical instrument held on the mechanical arm is obtained.

In some embodiments, the method further comprises: determining a type and model of surgical instrument to be held on each robotic arm based on a travel log of the surgical robot; or determine the type and model of surgical instrument to hold on each robotic arm based on user input.

In some embodiments, the tag information includes the calibration pose, a type and model of the surgical instrument.

In some embodiments, the second image and tag information of the target object in the second image are used to train a neural network, which is used to track the target object.

In a second aspect, an embodiment of the present disclosure provides a label generating apparatus for an image, the apparatus including: the acquisition module is used for acquiring a first image comprising a target object, an initial pose of the target object when the first image is acquired, and a three-dimensional model of the target object; the projection module is used for projecting the three-dimensional model of the target object onto the first image based on the initial pose to obtain a projection pixel region; the calibration module is used for calibrating the initial pose based on the overlapping degree between the projection pixel region and the target pixel region where the target object is located in the first image, so as to obtain a calibration pose; a generation module for generating a second image including the target object based on the first image; and the determining module is used for determining the label information of the target object in the second image based on the calibration pose.

In a third aspect, embodiments of the present disclosure provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method according to any of the embodiments of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computing device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the embodiments of the first aspect when the program is executed.

In a fifth aspect, embodiments of the present disclosure provide a label generation system for an image, the system comprising: the image acquisition device is used for acquiring a first image of the target object; the pose sensor is used for acquiring the initial pose of the target object when the first image is acquired; and a computing device according to the fourth aspect.

In some embodiments, the target object is a surgical instrument; the system further comprises: the surgical robot comprises at least one mechanical arm, each mechanical arm is used for holding one surgical instrument, and each mechanical arm is provided with the pose sensor.

In some embodiments, the target object is a surgical instrument; the system further comprises: the surgical operation robot comprises at least two mechanical arms, the image acquisition device and the surgical operation instrument are respectively held by different mechanical arms, and the pose sensor is arranged on the mechanical arms which at least hold the surgical operation instrument.

In the embodiment of the disclosure, a three-dimensional model of a target object is projected onto a first image based on an initial pose of the target object when the first image is acquired by acquiring the first image, a projected pixel area is obtained, and then the initial pose is calibrated based on the overlapping degree between the projected pixel area and the target pixel area, so that a more accurate calibration pose of the target object in the first image is obtained. Based on the calibration pose, the label information of the target object in the second image comprising the target object is generated, so that more accurate label information can be obtained, and the accuracy of the label information is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 is a schematic diagram of a robotic surgical system of an embodiment of the present disclosure.

Fig. 2 is a schematic view of a patient side robot of an embodiment of the present disclosure.

Fig. 3 is a flowchart of a label generation method of an image of an embodiment of the present disclosure.

Fig. 4 is a general flow chart of the method shown in fig. 3.

Fig. 5 is a schematic diagram of an image in the process flow shown in fig. 4.

Fig. 6 is a flowchart of a label generation method of an image of another embodiment of the present disclosure.

Fig. 7 is a general flow chart of the method shown in fig. 6.

Fig. 8 is a flowchart of a method of tracking a target object of an image of an embodiment of the present disclosure.

Fig. 9A is a schematic structural diagram of a neural network of an embodiment of the present disclosure.

Fig. 9B is a schematic diagram of a more specific neural network according to an embodiment of the present disclosure.

Fig. 10A is a general flow chart of the method shown in fig. 8.

Fig. 10B is a schematic diagram of a multi-target tracking process.

Fig. 11 is a block diagram of a label generating apparatus of an image of an embodiment of the present disclosure.

Fig. 12 is a block diagram of a label generating apparatus of an image of another embodiment of the present disclosure.

Fig. 13 is a block diagram of a target object tracking device of an embodiment of the present disclosure.

FIG. 14 is a schematic diagram of a computing device of an embodiment of the present disclosure.

Fig. 15 is a schematic diagram of a label generation system for an image of an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In order to better understand the technical solutions in the embodiments of the present disclosure and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

A large number of tagged images need to be used to train the neural network Net before it can be used to track the target object Obj. The following is illustrative in connection with a specific application scenario. It is to be understood that the following description is merely exemplary in nature and is in no way intended to limit the present disclosure.

In a surgical scenario, the target object Obj includes a surgical instrument X. Surgical instrument X is widely used in a variety of procedures, and a surgeon may manipulate surgical instrument X through robotic surgical system 10 to perform a surgical procedure. As shown in fig. 1, a schematic diagram of a robotic surgical system 10 is provided. In operation, a patient is positioned in front of a patient-side robot (Patient Side Robot, PSR) 101, the patient-side robot 101 including one or more robotic arms 101a, the distal end of each robotic arm 101a being configured to hold one or more surgical instruments X, the Surgeon may control the robotic arms 101a via a Surgeon Console (SGC) 102 to control the surgical instruments X to perform a surgical procedure on the patient. The robotic arm 101a may also hold an image acquisition device (e.g., an endoscope camera, not shown), and the surgeon may control the robotic arm 101a holding the endoscope camera via the physician's console 102 to move and hold the endoscope camera near the patient's focal region for acquiring a surgical view including the patient's focal and its surrounding tissue, and the surgical instrument X. During surgery, the surgical instrument X and/or the endoscope camera on the robotic arm 101a is inserted into the patient through a pre-set aperture in the patient and may be rotated about the pre-set aperture center point (commonly referred to as the remote center point of motion (Remote Center of Motion point, RCM)). The images captured by the endoscope camera are transmitted to a Vision Cart (VCT) 103 for image processing and recording, and the processed images are displayed on respective display devices of the Vision Cart 103 and a doctor console 102 for viewing by doctors and other surgical staff.

During surgery, it is often desirable to have accurate pose information of the surgical instrument available to enable real-time tracking of the surgical instrument, which is very advantageous if the surgical instrument is not within the field of view of the endoscopic camera or is occluded in the field of view of the camera. In some cases, a neural network is utilized to track surgical instruments. The neural network model needs to be trained before being deployed to the actual application scenario.

Data set preparation

Training of neural network models requires preparation of a large number of images containing the target object and labels in the images that pertain to the information of the target object as a dataset. In an application scenario of the disclosed embodiment, the target object Obj includes a surgical instrument X, and the tag information may include accurate pose information of the surgical instrument X.

Fig. 2 shows a schematic view of a patient-side robot 101. As shown in fig. 2, the patient side robot 101 includes a chassis 101b, a push handle 101c, and at least one robot arm 101a (only one robot arm 101a is shown in the figure for convenience of illustration), and each robot arm 101a includes an adjustment arm 101a-1 and an operation arm 101a-2. The robotic arm 101a-2 includes one or more sensors thereon, such as a displacement meter, an orientation sensor, and/or a position sensor. The detected values of the sensors can be used to obtain kinematic data of the manipulator 101a and the surgical instrument X held on the manipulator 101a, for example, pose information of the surgical instrument X. However, since there is an error in the measurement result of the sensor and an error in the transmission of the mechanical arm 101a is accumulated step by step, the kinematic data acquired by the sensor is noisy, and the pose information of the surgical instrument X generated based on the noisy kinematic data is relatively low in accuracy, and cannot be used for training of the neural network.

Example 1

To solve at least the above problems, an embodiment of the present disclosure provides a label generating method of an image, referring to fig. 3, the method includes:

step S11: acquiring a first image Img1 comprising a target object Obj and acquiring an initial Pose Pose0 of the target object Obj when the first image Img1 is acquired;

step S12: acquiring a three-dimensional model Mod of a target object Obj;

step S13: projecting a three-dimensional model Mod onto a first image Img1 based on the initial Pose Pose0 to obtain a projection pixel region Rm;

step S14: based on the overlapping degree between the projection pixel region Rm and a target pixel region Ro of a target object Obj in the first image Img1, calibrating the initial Pose Pose0 to obtain a calibrated Pose Pose1;

step S15: generating a second image Img2 including the target object Obj based on the first image Img 1;

step S16: tag information of the target object Obj in the second image Img2 is generated based on the calibration Pose 1.

Details of the implementation of label generation of the images of the present disclosure are exemplified below.

In step S11, the target object Obj may be a surgical instrument X. However, it is understood that in other application scenarios, the target object Obj may be another object, for example, in an image monitoring scenario, the target object Obj may be a monitored object such as a person or an animal; in a traffic scenario, the target object Obj may be a vehicle. For ease of explanation, the following describes aspects of embodiments of the present disclosure, taking the surgical scenario illustrated in fig. 1 and 2 as an example.

The surgical instrument X may be image-acquired by an image acquisition device, resulting in a first image Img1 comprising the surgical instrument X. The surgical instrument X includes, but is not limited to, one or more of a scalpel, a tissue shears, forceps, a needle holder, vascular forceps, and the like. Each surgical instrument X may be held on one of the robotic arms 101a of the surgical robot. The image acquisition device may also be held on the robotic arm 101a of the surgical robot, or mounted on a stand, or fixed at other locations (e.g., a wall or a table top). A Pose sensor may be provided on the mechanical arm 101a holding the surgical instrument X, for acquiring an initial Pose0 of the surgical instrument X held on the mechanical arm 101a when the first image Img1 is acquired. In one example, the robotic arm 101a includes a plurality of sequentially coupled arms, adjacent ones of the arms are coupled by rotational joints, and the surgical instrument X is mounted on the distal arm, and the pose sensor may include an encoder disposed at each rotational joint, a displacement meter disposed on a linear drive module of the distal arm, an encoder disposed on a tool drive module of the distal arm, and the like. The initial Pose Pose0 is a noisy Pose due to errors of the Pose sensor and the like, and cannot accurately reflect the true Pose of the surgical instrument. Further, a pose sensor may be disposed on the mechanical arm 101a for holding the image capturing device, for detecting the pose of the image capturing device.

In some embodiments, the first image Img1 may be obtained by image acquisition of the surgical instrument X in a first preset context. Wherein the difference between the pixel value of the first preset background and the pixel value of the surgical instrument X may be greater than the preset value. For example, in the case where the surgical instrument X is white in color, the first preset background may be black. Furthermore, the first preset background may also be a solid-colored background (i.e., only one color is included), and the first preset background has fewer textures. In this way, the interference of the color and the texture of the first preset background on the subsequent processing process of the first image Img1 can be reduced, so that the accuracy of the acquired label information is improved. Alternatively, the image of the surgical instrument X may be acquired in an actual application scenario (e.g., a surgical scenario), so as to obtain the first image Img1.

In step S12, a three-dimensional model Mod of the surgical instrument X may be acquired. Each surgical instrument X has a determined type and model, and a corresponding three-dimensional model Mod. For example, surgical instruments can be classified into the following types according to their function: surgical knife, tissue shears, forceps, vascular forceps, etc. For each type of surgical instrument, it may be classified into different models according to its characteristics of structure, size, etc. The three-dimensional model Mod of the surgical instrument X is established when its design is completed or before production, and the method of establishing the three-dimensional model Mod is not particularly limited in the present disclosure. The three-dimensional model Mod of the surgical instrument X may be stored in advance on a corresponding memory, whereby the three-dimensional model Mod of the surgical instrument X held on the robot arm 101a may be acquired from the memory according to the type and model of the surgical instrument X held on the robot arm 101 a. For example, if the surgical instrument X held by the mechanical arm 1a is a 10-gauge scalpel, the three-dimensional model Mod obtained is the three-dimensional model Mod corresponding to the 10-gauge scalpel; the surgical instrument X held by the No. 2 mechanical arm 101a is a straight vascular clamp, and the obtained three-dimensional model Mod is a three-dimensional model Mod corresponding to the straight vascular clamp.

In some embodiments, the surgical robot may automatically recognize the type and model of the surgical instrument X held on each of the robotic arms 101a of the patient side robot 101 and record in the corresponding travel log. The type and model of the surgical instrument X held on the respective robotic arms 101a may be determined based on the operation log of the surgical robot. In other embodiments, the type and model of surgical instrument X held on each robotic arm 101a may also be determined based on user input. For example, the surgeon may manually input the type and model of surgical instrument X held on each robotic arm 101a on an input interface of the surgeon console 102.

Further, the correspondence between each type and model of surgical instrument X and the three-dimensional model Mod may be established in advance. After the types and models of the surgical instruments X held on the respective robotic arms 101a are acquired, the three-dimensional model Mod corresponding to the surgical instrument X may be acquired based on the above-described correspondence. Through the mode, the three-dimensional model Mod corresponding to the surgical instrument X can be automatically obtained, so that manual operation is reduced, and labor cost is reduced.

In step S13, the three-dimensional model Mod may be a three-dimensional model in a physical coordinate system, for example, a standard model located at the origin of the physical coordinate system and having a specified pose. In the case that the target object is the surgical instrument X, the initial Pose0 of the surgical instrument X may be a Pose measured by a robot encoder on a robot where the surgical instrument X is located, and the Pose may be a Pose of the surgical instrument X in a physical coordinate system such as a PSR-based coordinate system or a world coordinate system. Based on the initial Pose Pose0 described above, the three-dimensional model Mod may be projected onto the first image Img 1. In particular, a transformation matrix of the image acquisition device may be obtained, which may be obtained by calibrating the image acquisition device, for achieving a transformation between a physical coordinate system (e.g. the above-mentioned PSR-based coordinate system) and the image acquisition device coordinate system. Based on the conversion matrix of the image acquisition device, the three-dimensional model Mod may be projected onto the first image Img 1.

In an ideal case, the projection pixel region Rm of the three-dimensional model Mod on the first image Img1 is completely coincident with the target pixel region Ro in which the target object Obj is located in the first image Img 1. However, since there is a certain error in the initial Pose Pose0, the two do not overlap completely in actual cases, and the error in the initial Pose Pose0 is inversely related to the degree of overlap between the projection pixel region Rm and the target pixel region Ro to some extent. Therefore, in step S14, the initial Pose0 may be calibrated based on the degree of overlap between the projection pixel region Rm and the target pixel region Ro.

The overlap between the projection pixel region Rm and the target pixel region Ro may be determined based on an intersection ratio (Intersection over Union, ioU), a generalized intersection ratio (Generalized Intersection over Union, GIoU), or a dice loss (dice loss), or other parameters that can characterize the overlap.

In some embodiments, a mask of the target object Obj in the first image Img1 may be acquired, and the degree of overlap between the projection pixel region Rm and the target pixel region Ro may be determined based on the mask of the target object Obj and the projection pixel region Rm. The image processing may be performed on the first image Img1 to remove a background area in the first image Img1, so as to obtain a mask of the target object Obj in the first image Img 1. Or, the mask of the object Obj in the first image Img1 may be obtained by a manual labeling method. Optionally, in the case that the first image Img1 is obtained by performing image acquisition on the surgical instrument X under the first preset background, since the first preset background is generally different from the target object Obj, the mask may be automatically obtained by image processing; under the condition that the image acquisition is performed on the surgical instrument X in the actual application scene to obtain the first image Img1, the mask can be obtained in a manual labeling mode, so that the influence of a complex background is reduced, and the obtaining accuracy of the mask is improved. By acquiring the mask, the influence of the background area on the overlapping degree calculation process can be reduced, so that the accuracy of the calculated overlapping degree is improved, and the accuracy of the acquired label information is further improved. Further, before the degree of overlap between the projection pixel region Rm and the target pixel region Ro is determined based on the mask and the projection pixel region Rm of the target object Obj, the mask may be further smoothed. By performing smoothing processing, the influence of random noise can be reduced, and abnormal pixel points can be eliminated, thereby improving the accuracy and reliability of the obtained mask.

In some embodiments, the calibration Pose Pose1 is the Pose of the target object Obj when the degree of overlap is maximized. Specifically, the calibration Pose Pose1 can be obtained by: after optimizing the initial Pose Pose0 by adopting a preset Pose optimization algorithm, recalculating the overlapping degree between the projection pixel region Rm and the target pixel region Ro; the Pose corresponding to the projection pixel region Rm having the largest overlapping degree with the target pixel region Ro is determined as the calibration Pose1. The pose optimization algorithm may be a gradient-based optimization algorithm, or other global optimization algorithm. Several iterative optimizations may be employed to determine the calibration Pose Pose1. In the first iteration, determining the overlapping degree between the projection pixel region Rm and the target pixel region Ro corresponding to the initial Pose Pose0, and optimizing the initial Pose Pose0 to obtain the Pose after the first iteration optimization. And during the second iterative optimization, determining the overlapping degree between the projection pixel region Rm and the target pixel region Ro corresponding to the pose after the first iterative optimization, and obtaining the pose after the second iterative optimization for the pose after the first iterative optimization. And the like, until a preset iteration termination condition is met, for example, the iteration times reach a preset time threshold, the algorithm execution time reaches a preset time threshold, or the maximum overlapping degree obtained in the iteration process reaches a preset overlapping degree threshold, or the overlapping degree obtained in the iteration process reaches a local maximum value, and the like. Thus, the calibration Pose Pose1 can be regarded as the true Pose of the target object Obj.

In step S15, a second image Img2 including the target object Obj may be generated based on the first image Img 1. For example, the first preset background in the first image Img1 may be replaced by the second preset background, so as to obtain the second image Img2. The second predetermined background is typically obtained by photographing human tissue by an image acquisition device, such as an endoscope. Further, before replacing the second preset background, operations such as brightness adjustment, azimuth transformation and/or scale transformation can be performed on the target object Obj in the first image Img 1. Wherein the second preset background may be related to the application scenario. For example, in the surgical scene in the foregoing embodiment, the second preset background may be a background in the surgical scene.

In some embodiments, the second image Img2 may also be post-processed so that the second image Img2 is more closely related to the truly taken surgical picture. Wherein post-processing may include, but is not limited to, at least one of: blurring processing, sharpening processing, noise reduction processing, and enhancement processing.

In step S16, tag information of the target object Obj in the second image Img2 may be generated based on the calibration Pose1. The tag information may include a calibration Pose Pose1, among others. In the case where the target object Obj is the surgical instrument X, the tag information may also include the type and model of the surgical instrument X. When the second image Img2 is obtained by performing a processing method such as gradation processing or background conversion on the first image Img1, the Pose of the target object Obj in the second image Img2 is the same as the Pose of the target object Obj in the first image Img1, and therefore, the calibration Pose1 can be directly determined as one of the tag information of the target object Obj in the second image Img2. In the case of performing processing such as azimuth conversion or scale conversion on the target object Obj of the first image Img1, the Pose of the target object Obj in the second image Img2 is different from the Pose of the target object Obj in the first image Img1, and therefore, the calibration Pose1 can be mapped based on the pixel mapping relationship between the second image Img2 and the first image Img1 to obtain a mapped Pose, and the mapped Pose is determined as one of the tag information of the target object Obj in the second image Img2.

In the case where a plurality of surgical instruments X are included in the first image, the above-described process may be performed for each surgical instrument X separately, thereby obtaining tag information of each surgical instrument X in the second image corresponding to the first image.

The overall flow of the label generation method of the image of the embodiment of the present disclosure will be described below with reference to fig. 4 by taking a surgical scenario as an example. The label generation method of the image comprises the following steps:

step S21: a first image Img1 comprising the surgical instrument X is acquired in a first preset context.

Step S22: an initial Pose Pose0 of the surgical instrument X is acquired.

Step S23: a three-dimensional model Mod of the surgical instrument X is acquired.

Step S24: the mask of the surgical instrument X is extracted from the first image Img1.

Step S25: the three-dimensional model Mod is projected onto the first image Img1.

Step S26: the degree of overlap between the mask of the surgical instrument X and the projected image area of the three-dimensional model Mod on the first image Img1 is calculated.

Step S27: it is determined whether the degree of overlap is maximized. If yes, go to step S29, otherwise go to step S28.

Step S28: the initial Pose Pose0 is adjusted and the process returns to step S26.

Step S29: the Pose when the overlap is maximized is determined as the calibration Pose Pose1.

Step S30: the first preset background in the first image Img1 is replaced by the second preset background.

Step S31: the second image Img2 is post-processed. The calibration Pose Pose1 can be used as label information corresponding to the post-processed second image Img2.

It will be appreciated that the order of execution of the steps in the above method may not be performed according to the reference numerals of the steps, for example, step S21, step S22 and step S23, and step S24 and step S25 may be performed in parallel, or may be performed in any order.

Fig. 5 shows a schematic diagram of an image generated in the process flow shown in fig. 4. Firstly, image acquisition is carried out on a surgical instrument X under a background of a single color to obtain a first image Img1, and a three-dimensional model Mod of a target object Obj is projected onto the first image Img1 to obtain a projection pixel region Rm. The first image Img1 is segmented to obtain a mask of the surgical instrument X in the first image Img 1. And (3) carrying out Pose optimization based on the mask and the projection pixel region Rm to obtain a calibration Pose Pose1. Then, the background in the first image Img1 is replaced with the surgical scene, resulting in a second image Img2.

In the above embodiment, by acquiring the first image Img1 including the target object Obj, projecting the three-dimensional model Mod of the target object Obj onto the first image Img1 based on the initial Pose Pose0 of the target object Obj when the first image Img1 is acquired to obtain the projection pixel region Rm, and calibrating the initial Pose Pose0 based on the overlapping degree between the projection pixel region Rm and the target pixel region Ro, the more accurate calibration Pose Pose1 of the target object Obj in the first image Img1 is obtained. Based on the calibration Pose Pose1, generating the tag information of the target object Obj in the second image Img2 including the target object Obj can acquire more accurate tag information.

In some embodiments, the second image Img2 and the tag information of the target object Obj in the second image Img2 may be used to train the neural network Net, and the trained neural network Net may be used to track the target object Obj. For example, in a surgical scene, the neural network Net may be trained using the second image Img2 including the surgical instrument as a target object and the actually photographed human tissue as a background acquired by the method in the foregoing embodiment, and the tag information acquired by the method in the foregoing embodiment as a data set. The trained neural network Net can be used to track surgical instrument X during surgery. In order to improve the accuracy of the tracking result, a large number of data sets are required to train the neural network Net, and the method can automatically and quickly generate the large number of data sets without depending on an actual operation screen. The data set generated by the method is rich in variety and comprises images of different surgical instruments in different surgical scenes and the calibration pose of the surgical instruments in the images.

In addition, corresponding to the above method, the embodiment of the present disclosure further provides a label generating apparatus for an image, referring to fig. 11, the apparatus including:

an obtaining module 110, configured to obtain a first image Img1 including a target object Obj, an initial Pose else 0 of the target object Obj when the first image Img1 is acquired, and a three-dimensional model Mod for obtaining the target object Obj;

the projection module 120 is configured to project a three-dimensional model Mod of the target object Obj onto the first image Img1 based on the initial Pose else 0, to obtain a projection pixel region Rm;

the calibration module 130 is configured to calibrate the initial Pose Pose0 based on the degree of overlap between the projection pixel region Rm and a target pixel region Ro where the target object Obj is located in the first image Img1, to obtain a calibrated Pose Pose1;

a generating module 140 for generating a second image Img2 including the target object Obj based on the first image Img 1;

the determining module 150 is configured to determine tag information of the target object Obj in the second image Img2 based on the calibration Pose else 1.

In some embodiments, the generating module is specifically configured to: and replacing the first preset background in the first image Img1 with a second preset background to obtain a second image Img2.

In some embodiments, the apparatus further comprises: the post-processing module is used for carrying out post-processing on the second image Img 2; the post-processing includes at least one of: blurring processing, sharpening processing, noise reduction processing, and enhancement processing.

In some embodiments, the calibration Pose Pose1 is the Pose of the target object Obj when the degree of overlap is maximized.

In some embodiments, the calibration module is specifically configured to: after optimizing the initial Pose Pose0 by adopting a preset Pose optimization algorithm, recalculating the overlapping degree between the projection pixel region Rm and the target pixel region Ro; the Pose corresponding to the projection pixel region Rm having the largest overlapping degree with the target pixel region Ro is determined as the calibration Pose1.

In some embodiments, the degree of overlap between the projected pixel region Rm and the target pixel region Ro is determined based on the IoU, GIoU, or dice loss of the projected pixel region Rm and the target pixel region Ro.

In some embodiments, the apparatus further comprises: the mask acquisition module is used for acquiring a mask of the target object Obj in the first image Img 1; and the overlapping degree determining module is used for determining the overlapping degree based on the mask of the target object Obj and the projection pixel region Rm.

In some embodiments, prior to calibrating the module, the apparatus further comprises: and the smoothing processing module is used for carrying out smoothing processing on the mask.

In some embodiments, the target object Obj includes at least one surgical instrument X, each surgical instrument X being held on one of the robotic arms 101a of the surgical robot, and the robotic arm 101a being provided with a sensor for acquiring an initial pose of the surgical instrument X held on the robotic arm 101 a; the first image Img1 is acquired by an image acquisition device.

In some embodiments, the three-dimensional model Mod of the surgical instrument X corresponds to the type and model of the surgical instrument X; the acquisition module is specifically configured to: a three-dimensional model Mod of the surgical instrument X held on the robot arm 101a is acquired according to the type and model of the surgical instrument X held on the robot arm 101 a.

In some embodiments, the apparatus further comprises: a type and model determining module for determining the type and model of the surgical instrument X held on the respective robotic arms 101a based on the operation log of the surgical robot; or determines the type and model of surgical instrument X held on each robotic arm 101a based on user input.

In some embodiments, the tag information includes the calibration Pose Pose1, the type and model of the surgical instrument X.

In some embodiments, the second image Img2 and the tag information of the target object Obj in the second image Img2 are used to train a neural network that is used to track the target object Obj.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

Example two

Referring to fig. 6, an embodiment of the present disclosure further provides a label generating method of an image, the method including:

step S41: acquiring an original video frame f0 comprising a target object Obj, and acquiring an initial Pose Pose0 of the target object Obj when the original video frame f0 is acquired;

step S42: acquiring a three-dimensional model Mod of a target object Obj;

step S43: rendering the three-dimensional model Mod based on the initial Pose Pose0 to obtain a rendered video frame fr;

step S44: determining a pose conversion relation T between the target object Obj in the original video frame f0 and the target object Obj in the rendered video frame fr based on the optical flow field of the target object Obj in the original video frame f0 and the optical flow field of the target object Obj in the rendered video frame fr;

Step S45: calibrating the initial Pose Pose0 based on the Pose conversion relation T to obtain a calibrated Pose Pose1;

step S46: and generating tag information of the target object Obj in the original video frame f0 based on the calibration Pose Pose 1.

In step S41, the original video frame f0 may include one or more frames of video in the video. The original video frame f0 may be a video frame acquired in a specified scene. Generally, the specified scene refers to a scene in which the target object Obj is actually applied, for example, a surgical scene. In the original video frame f0, a background of a specified scene may be included in addition to the target object Obj. For example, in the case where the specified scene is a surgical scene, the background of the specified scene may include a tissue within a subject body in which a surgical operation is performed, such as a human body or an animal body. In some embodiments, each frame of the original video frame f0 may be the first image Img1 in the first embodiment.

The target object Obj may be the surgical instrument X or other objects, and the specific type of the target object Obj may be different according to the actual application scenario. In embodiments where the target object Obj is a surgical instrument X, the surgical instrument X may be held on the robotic arm 101a of the surgical robot. The mechanical arm 101a of the surgical robot may further hold an image acquisition device for acquiring the original video frame f0. Alternatively, the image acquisition device for acquiring the original video frame f0 may be mounted on a stand or fixed at other locations (e.g., wall, table top, or patient bedside). A Pose sensor may be provided on the mechanical arm 101a holding the surgical instrument X, for acquiring an initial Pose0 of the surgical instrument X held on the mechanical arm 101a when the first image Img1 is acquired. Further, a pose sensor may be disposed on the mechanical arm 101a for holding the image capturing device, for detecting the pose of the image capturing device. In one example, the robotic arm 101a includes a plurality of sequentially connected link arms, adjacent link arms being connected by a revolute joint, and the position sensor may include an encoder disposed at each revolute joint for measuring the relative rotational angle between the adjacent two link arms.

In step S42, a three-dimensional model Mod of the surgical instrument X may be acquired. The specific embodiment of step S42 can be referred to the previous description of step S12, and will not be repeated here.

In step S43, the three-dimensional model Mod may be a three-dimensional model in a physical coordinate system, for example, a standard model located at the origin of the physical coordinate system and having a specified pose. In the case that the target object is the surgical instrument X, the initial Pose0 of the surgical instrument X may be a Pose measured by a robot encoder on a robot where the surgical instrument X is located, and the Pose may be a Pose of the surgical instrument X in a physical coordinate system such as a PSR-based coordinate system or a world coordinate system. Based on the initial Pose Pose0, the three-dimensional model Mod can be rendered, so that a rendered video frame fr comprising the three-dimensional model Mod is obtained. During rendering, the three-dimensional model Mod can be projected into the coordinate system of the image acquisition device according to the initial Pose Pose0 to obtain a projected video frame, and then the projected video frame is rendered to obtain a rendered video frame fr. For example, the initial Pose Pose0 may be converted into a Pose under the coordinate system of the image capturing device based on the conversion matrix of the image capturing device, and then the three-dimensional model Mod is projected into the coordinate system of the image capturing device based on the converted Pose, so as to obtain a projected video frame. The transformation matrix is used for representing the transformation relation between the coordinate system of the image acquisition device and the physical coordinate system. After the projected video frame is obtained, the target object Obj in the projected video frame may be rendered based on the pre-generated color map and texture map.

Assuming that the number of the original video frames f0 is N (N is a positive integer), the three-dimensional model Mod can be rendered based on the initial Pose Pose0 of the target object Obj when each original video frame f0 is acquired, so as to obtain N frames of rendered video frames fr, where each frame of rendered video frame fr corresponds to one frame of original video frame f0.

In step S44, an Optical Flow field (Optical Flow) is used to describe motion information in the image sequence. An optical flow field can be understood as the amount of displacement in time of each pixel in an image. Based on the optical flow field of the target object Obj in the original video frame f0 and the optical flow field of the target object Obj in the rendered video frame fr, the motion and change conditions between the target object Obj in the original video frame f0 and the target object Obj in the rendered video frame fr can be analyzed to determine the pose difference of the target object Obj in the two frames of video frames, which can be represented by the pose conversion relationship T (which can be a conversion matrix). For example, the above-described pose conversion relationship T may be estimated using a RANSAC (Random Sample Consensus) algorithm or a two-dimensional template matching algorithm.

In some embodiments, image segmentation may be further performed on the original video frame f0 to obtain a target pixel region Ro where the target object Obj is located in the original video frame f0, where the image segmentation operation on the original video frame f0 may be implemented by a pre-trained image segmentation network, or may be performed based on information manually marked by a user. By image segmentation, the background area in the original video frame can be removed, and only the target pixel area Ro where the target object Obj is located in the original video frame f0 is reserved, so that the influence of the background area on subsequent operations is reduced. Similarly, the target pixel region Ro where the target object Obj is located in the subsequent video frame of the original video frame f0 may also be acquired. The following video frame of the original video frame f0 may be an h (h is a positive integer) frame video frame following the original video frame f0. For example, assuming that the original video frame f0 is the 1 st video frame in the video, the following video frame of the original video frame f0 may be the 2 nd or following video frame in the video. Similar to the processing manner of the original video frame f0, the subsequent video frame of the original video frame f0 may be subjected to image segmentation, so as to obtain the target pixel region Ro where the target object Obj is located in the subsequent video frame of the original video frame f0. Then, the optical flow field of the target object Obj in the original video frame f0 may be determined based on the target pixel region Ro in which the target object Obj is located in the original video frame f0 and the target pixel region Ro in which the target object Obj is located in the subsequent video frame of the original video frame f0.

In other embodiments, image segmentation may not be performed, but the optical flow field of the target object Obj in the original video frame f0 and the optical flow field of the target object Obj in the rendered video frame fr may be directly performed.

In step S45, the initial Pose0 may be converted based on the Pose conversion relationship T obtained in step S44, thereby obtaining the calibration Pose1.

Step S46 can be referred to as step S16 in the first embodiment, and will not be described herein.

According to the embodiment of the disclosure, by acquiring the optical flow field of the target object Obj in the original video frame f0 and rendering the three-dimensional model Mod of the target object Obj based on the initial Pose Pose0 of the target object Obj when the original video frame f0 is acquired, after the rendered video frame fr is obtained, acquiring the optical flow field of the target object Obj in the rendered video frame fr, and then determining the Pose conversion relation T between the target object Obj in the original video frame f0 and the target object Obj in the rendered video frame fr based on the two optical flow fields. The Pose conversion relation T can reflect the difference between the motion information of the target object Obj in the original video frame f0 and the motion information of the target object Obj in the rendered video frame fr, so that the initial Pose Pose0 is calibrated based on the Pose conversion relation T, the more accurate calibration Pose Pose1 can be obtained, and the label information generated based on the calibration Pose Pose1 is more accurate.

In some embodiments, the tag information may include a calibration Pose Pose1. In the case where the target object Obj is the surgical instrument X, the tag information may also include the type and model of the surgical instrument X.

The overall flow of the label generation method of the image of the embodiment of the present disclosure will be described below with reference to fig. 7 by taking a surgical scenario as an example. The label generation method of the image comprises the following steps:

step S51: the original video frame f0 is acquired.

Step S52: the rendered video frame fr is acquired.

Step S53: the original video frame f0 is input into an image segmentation network for image segmentation.

Step S54: the foreground region (i.e., the target pixel region Ro including the target object Obj) in the original video frame f0 is acquired through the image-dividing network.

Step S55: the optical flow field of the target object Obj in the original video frame f0 is calculated.

Step S56: an optical flow field is calculated for rendering the object Obj in the video frame fr.

Step S57: and calibrating the initial Pose Pose0 of the target object Obj in the original video frame f0 based on the optical flow field of the target object Obj in the original video frame f0 and the optical flow field of the target object Obj in the rendered video frame fr to obtain a calibrated Pose Pose1, and taking the calibrated Pose Pose1 as label information corresponding to the original video frame f0.

It will be appreciated that the order of execution of the steps in the above method may not be according to the reference numerals of the steps, for example, step S51 and step S52, and step S55 and step S56 may be executed in parallel, or may be executed in any order.

In some embodiments, the original video frame f0 and the tag information of the target object Obj in the original video frame f0 may be used to train a neural network Net that may be used to track the target object Obj. For example, in a surgical scene, the neural network Net may be trained using, as a data set, video frames in a surgical operation video captured during an actual surgical procedure, and tag information acquired by the method in the foregoing embodiment. The trained neural network Net can be used to track surgical instrument X during surgery. In order to improve the accuracy of the tracking result, a large number of data sets are needed to train the neural network Net, the method can automatically and quickly generate a large number of data sets based on the recorded real operation video, the generated data sets can reflect the real operation scene, and the training quality of the neural network Net can be improved.

In addition, corresponding to the above method, the embodiment of the present disclosure further provides a label generating apparatus for an image, referring to fig. 12, the apparatus including:

an obtaining module 210, configured to obtain an original video frame f0 including the target object Obj, an initial Pose else 0 of the target object Obj when the original video frame f0 is acquired, and obtain a three-dimensional model Mod of the target object Obj;

the rendering module 220 is configured to render the three-dimensional model Mod based on the initial Pose Pose0, so as to obtain a rendered video frame fr;

a determining module 230, configured to determine a pose conversion relationship T between the target object Obj in the original video frame f0 and the target object Obj in the rendered video frame fr based on the optical flow field of the target object Obj in the original video frame f0 and the optical flow field of the target object Obj in the rendered video frame fr;

the calibration module 240 is configured to calibrate the initial Pose Pose0 based on the Pose conversion relationship T to obtain a calibrated Pose Pose1;

the generating module 250 is configured to generate tag information of the target object Obj in the original video frame f0 based on the calibration Pose else 1.

In some embodiments, the apparatus further comprises: the image segmentation module is used for carrying out image segmentation on the original video frame f0 to obtain a target pixel area Ro where a target object Obj in the original video frame f0 is located; the optical flow field determining module is configured to determine an optical flow field of the target object Obj in the original video frame f0 based on the target pixel area Ro where the target object Obj is located in the original video frame f0 and the target pixel area Ro where the target object Obj is located in the subsequent video frame of the original video frame f 0.

In some embodiments, the rendering module is specifically configured to: projecting a three-dimensional model Mod into a coordinate system of an image acquisition device according to the initial Pose Pose0 to obtain a projection video frame; and rendering the projection video frame to obtain a rendered video frame.

In some embodiments, the target object Obj includes at least one surgical instrument X, each surgical instrument X being held on one of the robotic arms 101a of the surgical robot, and the robotic arm 101a being provided with a sensor for acquiring an initial Pose0 of the surgical instrument X held on the robotic arm 101 a; the original video frame f0 is acquired by an image acquisition device.

In some embodiments, the original video frame f0 and the tag information of the target object Obj in the original video frame f0 are used to train a neural network that is used to track the target object Obj.

The embodiment of the disclosure can generate a large number of accurate calibration poses based on the noisy initial poses. In the related art, accurate pose information is acquired, the mechanical arm is required to be accurately controlled or manually marked, and the cost is high. By adopting the first or second embodiment of the present disclosure, the pose acquisition accuracy can be effectively improved while the cost is reduced.

Neural network model and tracking of target objects

Referring to fig. 8, an embodiment of the present disclosure further provides a tracking method of a target object Obj, where the method includes:

step S61: acquiring a video frame f comprising a target object Obj and acquiring an initial Pose Pose0 of the target object Obj when the video frame f is acquired;

Step S62: acquiring a three-dimensional model Mod of a target object Obj;

step S63: based on the video frame f, obtaining predicted Pose information Pose of the target object Obj through a pre-trained neural network Net _pre ；

Step S64: based on the initial Pose Pose0 and the three-dimensional model Mod, acquiring detection Pose information Pose of the target object Obj _det ；

Step S65: for predicted Pose information Pose _pre And detecting Pose information Pose _det Matching is carried out, and a matching result is obtained;

step S66: and tracking the target object Obj based on the matching result.

In step S61, the target object Obj may be the surgical instrument X or another object. The surgical instrument X may be image-acquired by an image acquisition device to obtain a video frame f including the surgical instrument X. Both the surgical instrument X and the image acquisition device may be held on the robotic arm 101a of the surgical robot. The image acquisition device may also be mounted on a stand or fixed at other locations (e.g., wall, table top, or patient bedside). A Pose sensor may be provided on the mechanical arm 101a holding the surgical instrument X, for acquiring an initial Pose0 of the surgical instrument X held on the mechanical arm 101a when the video frame f is acquired. In one example, the robotic arm 101a includes a plurality of sequentially connected link arms, adjacent link arms being connected by a revolute joint, and the position sensor may include an encoder disposed at each revolute joint for measuring the relative rotational angle between the adjacent two link arms.

In step S62, a three-dimensional model Mod of the target object Obj may be acquired. The specific embodiment of step S62 may be referred to the previous description of step S12, and will not be repeated here.

In step S63, the predicted Pose information else of the target object Obj in the video frame f may be obtained through the pre-trained neural network Net _pre . The neural network Net can be trained based on the sample image and the label information of the sample image. The sample image may be the second image Img2 in the first embodiment, and accordingly, the tag information of the sample image may be acquired based on the method in the first embodiment. Alternatively, the sample image may be the original video frame f0 in the second embodiment, and accordingly, the tag information of the sample image may be acquired based on the method in the second embodiment.

Predicted Pose information Pose acquired by neural network Net _pre The predicted pixel region, the predicted keypoint information, and the predicted direction information of the target object Obj may be included. The predicted pixel area is a pixel area where the target object Obj is located in the video frame, the predicted key point information may include position information of one or more key points of the target object Obj in the video frame, and the predicted direction information is used to indicate a posture of the target object Obj in the video frame and may include a yaw angle, a roll angle, and/or a pitch angle of the target object Obj.

In some embodiments, a bounding box of the target object Obj in the video frame f may be obtained, and the predicted Pose information else of the target object Obj in the video frame f may be obtained based on the bounding box of the target object Obj in the video frame f _pre 。

Further, video frame f may be based on video frame f and a preceding video frame f of video frame f _prior And (3) obtaining a bounding box of the target object Obj in the video frame f according to the tracking result of the target object Obj. Wherein the preceding video frame f _prior May include at least one frame of video frame preceding the video frame f in the video to which the video frame f belongs. Assuming video frame f is the mth frame video frame (m is a positive integer) in the video, the previous video frame f _prior May include at least one of: m-1 st video frame, m-2 nd video frame, m-3 rd video frame, etc. in the video. Preceding video frame f _prior The tracking result of the target object Obj in (a) may comprise the preceding video frame f _prior And (3) detecting a result of the bounding box of the target object Obj. Since the position of the target object Obj satisfies the constraint of the physical world, no abrupt change occurs, and thus, based on the previous video frame f _prior The bounding box of the target object Obj in the video frame f is obtained through the tracking result of the target object Obj, and the obtaining accuracy of the bounding box can be effectively improved.

After obtaining the bounding box of the target object Obj, features within the bounding box of the target object Obj in the video frame f can be subjected to pooling processing to obtain predicted Pose information Pose of the target object Obj _pre . The pooling treatment can adopt modes such as maximum pooling or average pooling. By carrying out pooling processing, the positioning of the target object in pose estimation is facilitated. After the target object is positioned, the bounding box of the target object can be restored to the original size, and then the predicted Pose information Pose is performed _pre Is calculated by the computer.

In some embodiments, the neural network Net may include a plurality of feature extraction layers l _f For feature extraction of the video frame f. Wherein a plurality of feature extraction layers l _f Features output by at least one first feature extraction layer (L) are used for acquiring bounding boxes of target objects Obj in video frames, and a plurality of feature extraction layers (L) _f Features output by at least one second feature extraction layer are used for acquiring predicted Pose information Pose of a target object Obj in a video frame _pre And each first feature extraction layer is located at the same positionAfter each second feature extraction layer.

Fig. 9A and 9B illustrate the structure of the neural network Net of the embodiment of the present disclosure, assuming that the neural network Net includes n-1 (n is a positive integer and n is greater than 1) feature extraction layers l in total _f Wherein the 1 st feature extraction layer l _f Up to k (k is a positive integer and k is less than n-1) feature extraction layer l _f The output characteristics are used for acquiring the predicted Pose information Pose of the target object Obj in the video frame _pre The (k+1) th feature extraction layer l _f To the n-1 th feature extraction layer l _f The output features are used to obtain bounding boxes of the target objects Obj in the video frames. The disclosed embodiments employ a two-stage model as a neural network Net for bounding box detection and pose estimation. In the first stage, the bounding box is detected from the video frame by using the high-level features, so that more feature information can be acquired, and the accuracy of bounding box detection is improved. In the second stage, features within the bounding box are pose estimated using low-level features, which are typically associated with geometry in the video frame. By using low-level features, the neural network Net can acquire geometric structure information such as edges, corner points and the like from the video frame, and the accuracy of pose estimation is improved. In addition, the low-level features generally have better stability, are not easily influenced by factors such as illumination change, noise and the like, and improve the performance stability of pose estimation. Furthermore, based on the (k+1) -th feature extraction layer l _f To the n-1 th feature extraction layer l _f The feature output by each of the above may detect bounding box information, where the detected bounding box information includes geometric information (width and height) of the bounding box and confidence level corresponding to the bounding box. By performing Non-maximum suppression (Non-Maximum Suppression, NMS) on each detected bounding box, the bounding box with the highest confidence can be determined as the bounding box of the target object Obj, and other bounding boxes can be filtered out. Thus, the detection accuracy of the bounding box can be effectively improved.

In some embodiments, as shown in FIGS. 9A and 9B, each feature extraction layer l _f Comprises an encoder and a decoder, i-th feature extraction layer l _f Encoder C of (2) _i Is connected to the (i+1) th feature extraction layer l _f Encoder C of (2) _i+1 Is the input of the ith feature extraction layer l _f Decoder P of (2) _i Is provided; ith feature extraction layer l _f Decoder P of (2) _i Is connected to the (i+1) th feature extraction layer l _f Decoder P of (2) _i+1 An output terminal of (a); i is a positive integer. Wherein each encoder is configured to downsample a feature input to the encoder and each decoder is configured to upsample a feature input to the decoder. Each of the feature encoder and the feature decoder may include a convolutional neural network Net, multiple transformer layers, or multiple pooling layers. Each encoder is used for extracting layer l for the last feature _f Downsampling the characteristics of the encoder output to reduce the characteristic dimension. Each decoder is used for extracting layer l for the last feature _f Decoder of (c) and corresponding feature extraction layer l _f Up-sampling the characteristics output by the encoder to recover the characteristic dimension, thereby obtaining a high-resolution pose prediction result.

Referring to fig. 9B, each encoder includes an encoder connected by several bottleneck structures of the res net network, and an encoder connected by several basic cblock structures of the res net network. Each decoder may be concatenated by several convolutional layers. For example, in the figure, the encoder of the 1 st feature extraction layer is obtained by connecting 3 basic cblock structures, the encoder of the 2 nd feature extraction layer is obtained by connecting 4 basic cblock structures, and the encoders of the 3 rd feature extraction layer and the 4 th feature extraction layer are obtained by connecting 2 bottleneck structures. The basic and bolleneck structures are shown as ResNet BasicBlock and resnetbolleneck in the figures, respectively, and "×x" in each rectangular box indicates the number of corresponding structures, for example, resNet BasicBlock ×3 in the encoder of the 1 st feature extraction layer indicates that the encoder is connected by 3 basic bolock structures.

If the encoder of one feature extraction layer is connected by a butteleneck structure, the decoder of that feature extraction layer is connected by several depth separable convolutions (depthwise separable convolution) layers, and in the embodiment shown in fig. 9B, the decoders of the 3 rd and 4 th feature extraction layers are connected by 2 depth separable convolutions, as shown by dwseproblecon in the figure. Conv2d in the figure represents a two-dimensional convolution. By adding an encoder connected by a butteleneck structure and a decoder connected by a depth separable convolution layer, the number of layers of the feature extraction layer can be increased, thereby extracting more features.

The encoder of the last feature extraction layer and the decoder of the last feature extraction layer may further include a hole space pyramid pooling (Atrous Spatial Pyramid Pooling, ASPP) module, where the ASPP module first uses a plurality of parallel convolution modules to perform hole convolution processing on the features output by the last encoder, and different convolution modules use different hole rate parameters (conditions), so as to obtain different receptive fields. The ASPP module shown in the figure adopts 4 parallel convolution modules, and the void ratio parameters adopted by each convolution module are respectively 1,3,6 and 9.

Then, the features output by the convolution modules are spliced (concat), the spliced features are subjected to depth separable convolution, and the features obtained through the depth separable convolution can be output to a decoder of the last feature extraction layer. By using ASPP modules, receptive fields can be improved, thereby extracting more features.

The information of the bounding box obtained after NMS processing can be output to the pooling processing layer, so that the pooling processing layer extracts layer l for the 1 st feature based on the information of the bounding box _f Up to k (k is a positive integer and k is less than n-1) feature extraction layer l _f And carrying out pooling treatment on the output characteristics. The pooling layer may pool the acquired features using an ROI alignment. The pooled features are used to obtain, on the one hand, a mask of the target object and, on the other hand, predicted Pose information Pose of the target object _pre (including the predicted keypoint locations and predicted keypoint poses of the target object). Conv2d and DeConv2d in the figure represent two-dimensional convolution and two-dimensional deconvolution, respectively, softmax is normalization processing, and Regression represents Regression processing. During the determinationWhen the position and the gesture of the key point are determined, the features output by each feature extraction layer can be processed through a swin transducer, so that the global features are effectively extracted, and the accuracy of key point detection is improved.

It is understood that the structures shown in the figures are merely exemplary structures of neural networks. In addition to the above structures, neural networks with other structures may be used in the embodiments of the present disclosure, which are not described herein.

In step S64, the detection pose information may include detection pixel regions, detection keypoint information, and detection direction information of the target object Obj, corresponding to the detection pose information.

In some embodiments, the three-dimensional model Mod may be projected to the coordinate system of the image capturing device according to the initial Pose Pose0, and the detection key point information and the detection direction information of the target object Obj in the coordinate system of the image capturing device are obtained, and the three-dimensional model Mod may be projected to the two-dimensional image plane corresponding to the video frame according to the initial Pose Pose0, and the detection pixel area of the target object Obj in the two-dimensional image plane may be obtained.

The initial Pose Pose0 may represent a relative Pose between the three-dimensional model Mod and the image acquisition device. Based on the initial Pose Pose0, three-dimensional key points on the three-dimensional model Mod can be projected into a coordinate system of an image acquisition device to obtain detection key point information and detection direction information. In addition, based on the initial Pose Pose0, the three-dimensional model Mod may be projected onto a two-dimensional image plane corresponding to the video frame, and the image projected onto the two-dimensional image plane may be detected through the neural network Net, so as to obtain a detection pixel region.

In step S65, the predicted pixel area and the detected pixel area, the predicted key point information and the detected key point information, and the predicted direction information and the detected direction information may be respectively matched, so as to correspondingly obtain a pixel area matching result, a key point matching result, and a direction information matching result. For example, the matching of the above information can be achieved by adopting a bipartite graph matching (bipartite matching) mode.

In step S66, a first confidence that the predicted pixel region matches the detected pixel region may be determined based on the pixel region matching result, a second confidence that the predicted key point information matches the detected key point information may be determined based on the key point matching result, a third confidence that the predicted direction information matches the detected direction information may be determined based on the direction information matching result, and the target object Obj may be tracked based on the first confidence, the second confidence, and the third confidence.

For example, the first confidence, the second confidence, and the third confidence may be weighted averaged to obtain a weighted average confidence. If the weighted average confidence coefficient is larger than a preset confidence coefficient threshold value, successful matching is determined, the initial pose of the target object Obj is calibrated based on the predicted pose information, and a calibration result is stored. If the weighted average confidence is less than or equal to the preset confidence threshold, the matching is determined to be unsuccessful.

The following describes an overall flow of a tracking method of the target object Obj of the image of the embodiment of the present disclosure, taking a surgical scenario as an example, with reference to fig. 10A. The tracking method of the target object Obj comprises the following steps:

step S71: a video frame f including the surgical instrument X is acquired.

Step S72: and detecting the bounding box of the video frame f to obtain the bounding box of the surgical instrument X in the video frame f.

Step S73: pose prediction is carried out on the surgical instrument X in the video frame f based on the bounding box obtained in the step S72, and predicted Pose information Pose is obtained _pre 。

Step S74: obtaining detection Pose information Pose based on a three-dimensional model Mod of the surgical instrument X and an initial Pose Pose0 of the surgical instrument X _det And predicting Pose information Pose _pre And detecting Pose information Pose _det And performing bipartite graph matching.

Step S75: and calibrating the initial Pose Pose0 based on the matching result and the confidence coefficient to obtain a calibrated Pose Pose1.

According to the embodiment of the disclosure, the predicted pose information of the target object is obtained through the neural network, the detected pose information of the target object is obtained based on the initial pose of the target object and the three-dimensional model of the target object, the predicted pose information and the detected pose information are matched, the target object is tracked according to the matching result, and the tracking accuracy can be effectively improved.

By the tracking method, the single target object Obj can be tracked, and a plurality of target objects Obj can be tracked.

The multi-target tracking process is illustrated below in connection with the example shown in fig. 10B. In this example, three surgical instruments X (shown as X1, X2 and X3 in the figure) need to be tracked, so that after the video frame f is input into the neural network, three sets of predicted Pose information Pose are obtained _pre (Pose in the figure) _pre，1 、Pose _pre，2 And Pose _pre，3 Shown). Based on the kinematic data (including initial Pose) of the surgical instruments X1, X2 and X3 and the three-dimensional model, the detection Pose information of the surgical instruments X1, X2 and X3 can be correspondingly obtained and respectively marked as Pose _det，X1 ，Pose _det，X2 And Pose _det ， _X3 . For three sets of predicted Pose information { Pose _pre，1 、Pose _pre，2 And Pose _pre，3 Each of the three sets of detected Pose information { Pose } and _det ， _X1 ，Pose _det，X2 and Pose _det，X3 Each of the two is subjected to bipartite graph matching one by one and the configuration confidence is calculated. And comparing the matching confidence coefficient with a preset confidence coefficient threshold value and determining a matching result, for example, when the matching confidence coefficient is larger than the preset confidence coefficient threshold value, determining that the matching is successful, and calibrating the initial pose according to the predicted position information. In one example, when predicting Pose information Pose _pre，1 And detecting Pose information Pose _det，X1 When the matching is successful, the predicted Pose information Pose is determined _pre，1 Predicted Pose information for the surgical instrument X1 and using the predicted Pose information Pose _pre，1 To calibrate the initial Pose of the surgical instrument X1, e.g., the predicted Pose information else may be saved _pre，1 And the pose offset from the initial pose. In addition, anotherThe treatment of the outer two surgical instruments X2 and X3 is also similar and will not be described in detail here. The tracking results of the surgical instruments X1, X2 and X3, such as Pose in the figures, can be finally obtained _X1 ，Pose _X2 And Pose _X3 As shown.

Corresponding to the above method, the embodiment of the present disclosure further provides a tracking device for the target object Obj, see fig. 13, where the device includes:

a first obtaining module 310, configured to obtain a video frame f including a target object Obj, an initial Pose else 0 of the target object Obj when the video frame f is acquired, and obtain a three-dimensional model Mod of the target object Obj;

a second obtaining module 320, configured to obtain predicted Pose information else of the target object Obj through a pre-trained neural network Net based on the video frame f _pre ；

A third obtaining module 330 for obtaining detected Pose information Pose of the target object Obj based on the initial Pose Pose0 and the three-dimensional model Mod _det ；

A matching module 340 for predicting Pose information Pose _pre And detecting Pose information Pose _det Matching is carried out, and a matching result is obtained;

and the tracking module 350 is used for tracking the target object Obj based on the matching result.

In some embodiments, the Pose information Pose is predicted _pre Comprises predicted pixel region, predicted key point information and predicted direction information of the target object, and detects Pose information Pose _det The detection pixel region, the detection key point information and the detection direction information of the target object are included; the matching module is specifically used for: and respectively matching the predicted pixel area with the detection pixel area, the predicted key point information with the detection key point information, and the predicted direction information with the detection direction information.

In some embodiments, the matching result includes a pixel region matching result, a key point matching result, and a direction information matching result, and the tracking module is specifically configured to: determining a first confidence that the predicted pixel region matches the detection pixel region based on the pixel region matching result; determining a second confidence level of matching the predicted key point information with the detected key point information based on the key point matching result; determining a third confidence level of matching the predicted direction information with the detection direction information based on the direction information matching result; and tracking the target object based on the first confidence, the second confidence and the third confidence.

In some embodiments, the third obtaining module is specifically configured to: projecting the three-dimensional model Mod into an image acquisition device coordinate system according to the initial Pose Pose0, and acquiring detection key point information and detection direction information of the target object Obj in the image acquisition device coordinate system, wherein the image acquisition device is used for acquiring the video frame f; and projecting the three-dimensional model Mod to a two-dimensional image plane corresponding to the video frame f according to the initial Pose Pose0, and acquiring a detection pixel region of the target object Obj in the two-dimensional image plane.

In some embodiments, the second obtaining module is specifically configured to: acquiring a bounding box of the target object Obj in the video frame f; acquiring predicted Pose information Pose of the target object Obj in the video frame f based on the bounding box of the target object Obj in the video frame f _pre 。

In some embodiments, the neural network Net comprises a plurality of feature extraction layers l _f The method is used for extracting the characteristics of the video frame f; the plurality of feature extraction layers l _f The features output by at least one first feature extraction layer are used for acquiring bounding boxes of the target objects in the video frame f; the plurality of feature extraction layers l _f The features output by at least one second feature extraction layer are used for acquiring the predicted pose information of the target object Obj in the video frame f; wherein each first feature extraction layer is located after a respective second feature extraction layer.

In some embodiments, each of the feature extraction layers includes an encoder and a decoder, the output of the encoder of the i-th feature extraction layer being connected to the input of the encoder of the i+1-th feature extraction layer and to the input of the decoder of the i-th feature extraction layer; the input end of the decoder of the ith feature extraction layer is connected to the output end of the decoder of the (i+1) th feature extraction layer; i is a positive integer; wherein each encoder is configured to downsample a feature input to the encoder and each decoder is configured to upsample a feature input to the decoder.

In some embodiments, the neural network Net obtains a bounding box of the target object Obj in the video frame f based on: and acquiring bounding boxes of the target object Obj in the video frame f based on the video frame f and tracking results of the target object Obj in the previous video frame of the video frame f.

In some embodiments, the neural network Net obtains the predicted Pose information Pose for the target object Obj in the video frame f based on _pre : pooling features within the bounding box of the target object Obj in the video frame f to obtain predicted Pose information Pose of the target object Obj _pre 。

In some embodiments, the target object Obj includes at least one surgical instrument X, each surgical instrument X being held on one of the robotic arms 101a of the surgical robot, and the robotic arm 101a being provided with a sensor for acquiring an initial Pose0 of the surgical instrument X held on the robotic arm 101 a; the video frame f is acquired by an image acquisition device.

In some embodiments, the three-dimensional model Mod of the surgical instrument X corresponds to a type and model of the surgical instrument X; the first obtaining module is specifically configured to: a three-dimensional model Mod of the surgical instrument X held on the robot arm 101a is acquired according to the type and model of the surgical instrument X held on the robot arm 101 a.

In some embodiments, the apparatus further comprises: a type and model acquisition module for determining the type and model of the surgical instrument X held on the respective robotic arms 101a based on the operation log of the surgical robot; or determines the type and model of surgical instrument X held on each robotic arm 101a based on user input.

The disclosed embodiments also provide a computing device comprising at least a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of the preceding embodiments when executing the program.

FIG. 14 illustrates a more specific hardware architecture diagram of a computing device 400 provided by embodiments of the present disclosure, which may include: processor 410, memory 420, input/output interface 430, communication interface 440, and bus 450. Wherein processor 410, memory 420, input/output interface 430, and communication interface 440 enable communication connections within the device between each other via bus 450.

The processor 410 may be implemented by a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided by the embodiments of the present disclosure. The processor 410 may also include a graphics card, which may be an Nvidia titanium X graphics card, a 10120Ti graphics card, or the like.

The Memory 420 may be implemented in the form of Read Only Memory (ROM), random access Memory (Random Access Memory, RAM), static storage devices, dynamic storage devices, etc. Memory 420 may store an operating system and other application programs, and when the techniques provided by embodiments of the present disclosure are implemented in software or firmware, the associated program code is stored in memory 420 and invoked for execution by processor 410.

The input/output interface 430 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 440 is used to connect communication modules (not shown) to enable communication interactions of the device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 450 includes a path to transfer information between components of the device (e.g., processor 410, memory 420, input/output interface 430, and communication interface 440).

It should be noted that although the above device only shows the processor 410, the memory 420, the input/output interface 430, the communication interface 440, and the bus 450, in the implementation, the device may further include other components necessary to achieve normal operation. Furthermore, those skilled in the art will appreciate that the above-described apparatus may include only the components necessary to implement the embodiments of the present disclosure, and not all of the components shown in the figures.

Referring to fig. 15, an embodiment of the present disclosure further provides a label generation system of an image, the system including:

an image acquisition device 510, configured to acquire a first image Img1 or an original video frame f0 of a target object;

a Pose sensor 520 for acquiring an initial Pose Pose0 of the target object Obj when the first image Img1 or the original video frame f0; and

computing device 400 in the previous embodiments.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the previous embodiments.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that the disclosed embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in essence or portions contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computing device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the various embodiments or portions of the embodiments of the present disclosure.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer apparatus or entity, or by an article of manufacture having some function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

The various embodiments in this disclosure are described in a progressive manner, and identical and similar parts of the various embodiments are all referred to each other, and each embodiment is mainly described as different from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing embodiments of the present disclosure. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, which should also be considered as the protection scope of the embodiments of this disclosure.

Claims

1. A method of generating a label for an image, the method comprising:

acquiring a first image comprising a target object and acquiring an initial pose of the target object when the first image is acquired;

acquiring a three-dimensional model of the target object;

projecting the three-dimensional model onto the first image based on the initial pose to obtain a projection pixel region;

calibrating the initial pose based on the overlapping degree between the projection pixel region and the target pixel region where the target object is located in the first image to obtain a calibrated pose;

generating a second image including the target object based on the first image;

and generating tag information of the target object in the second image based on the calibration pose.

2. The method of claim 1, wherein the first image is obtained by image acquisition of the target object in a first preset context.

3. The method of claim 2, wherein the generating a second image comprising the target object based on the first image comprises:

and replacing the first preset background in the first image with a second preset background to obtain the second image.

4. A method according to claim 3, characterized in that the method further comprises:

post-processing the second image; the post-processing includes at least one of: blurring processing, sharpening processing, noise reduction processing, and enhancement processing.

5. The method of claim 1, wherein the calibration pose is a pose of the target object when the overlap is maximized.

6. The method of claim 5, wherein calibrating the initial pose based on the degree of overlap between the projected pixel region and a target pixel region in the first image where the target object is located, comprises:

after optimizing the initial pose by adopting a preset pose optimization algorithm, recalculating the overlapping degree between the projection pixel region and the target pixel region;

And determining the pose corresponding to the projection pixel region with the largest overlapping degree of the target pixel region as the calibration pose.

7. The method of claim 1, wherein the degree of overlap between the projected pixel region and the target pixel region is determined based on IoU, GIoU, or die loss of the projected pixel region and the target pixel region.

8. The method according to claim 1, wherein the method further comprises:

acquiring a mask of the target object in the first image;

the degree of overlap is determined based on the mask of the target object and the projected pixel region.

9. The method of claim 8, wherein prior to determining the overlap based on the mask of the target object and the projected pixel region, the method further comprises:

and smoothing the mask.

10. The method according to any one of claims 1 to 9, wherein the target object comprises at least one surgical instrument, each surgical instrument being held on one of the robotic arms of the surgical robot, and the robotic arm being provided with a sensor for acquiring an initial pose of the surgical instrument held on the robotic arm; the first image is acquired by an image acquisition device.

11. The method of claim 10, wherein the three-dimensional model of the surgical instrument corresponds to a type and model of the surgical instrument; the obtaining the three-dimensional model of the target object comprises the following steps:

according to the type and model of the surgical instrument held on the mechanical arm, a three-dimensional model of the surgical instrument held on the mechanical arm is obtained.

12. The method of claim 11, wherein the method further comprises:

determining a type and model of surgical instrument to be held on each robotic arm based on a travel log of the surgical robot; or alternatively

The type and model of surgical instrument held on each robotic arm is determined based on the user input.

13. The method of claim 10, wherein the tag information includes the calibration pose, a type and model of the surgical instrument.

14. The method according to any one of claims 1 to 9, wherein the second image and the label information of the target object in the second image are used for training a neural network for tracking the target object.

15. A label generating apparatus for an image, the apparatus comprising:

the acquisition module is used for acquiring a first image comprising a target object, an initial pose of the target object when the first image is acquired, and a three-dimensional model of the target object;

the projection module is used for projecting the three-dimensional model of the target object onto the first image based on the initial pose to obtain a projection pixel region;

the calibration module is used for calibrating the initial pose based on the overlapping degree between the projection pixel region and the target pixel region where the target object is located in the first image, so as to obtain a calibration pose;

a generation module for generating a second image including the target object based on the first image;

and the determining module is used for determining the label information of the target object in the second image based on the calibration pose.

16. A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the method of any of claims 1 to 14.

17. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 14 when the program is executed by the processor.

18. A label generation system for an image, the system comprising:

the image acquisition device is used for acquiring a first image of the target object;

the pose sensor is used for acquiring the initial pose of the target object when the first image is acquired; and

the computing device of claim 17.

19. The system of claim 18, wherein the target object is a surgical instrument; the system further comprises:

the surgical robot comprises at least one mechanical arm, each mechanical arm is used for holding one surgical instrument, and each mechanical arm is provided with the pose sensor.

20. The system of claim 18, wherein the target object is a surgical instrument; the system further comprises:

the surgical operation robot comprises at least two mechanical arms, the image acquisition device and the surgical operation instrument are respectively held by different mechanical arms, and the pose sensor is arranged on the mechanical arms which at least hold the surgical operation instrument.