CN112967276A

CN112967276A - Object detection method, object detection device, endoscope system, electronic device, and storage medium

Info

Publication number: CN112967276A
Application number: CN202110348217.1A
Authority: CN
Inventors: 王晶
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-06-15
Anticipated expiration: 2041-03-31
Also published as: CN112967276B

Abstract

The embodiment of the application provides an object detection method, an object detection device, an endoscope system, electronic equipment and a storage medium, and video data to be detected are obtained; respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame; tracking each object respectively according to the attribute information of each object to obtain the tracking result of each object; and according to the tracking result of each object, determining a video frame with a new object appearing compared with the previous frame as a first-class video frame, and determining a video frame with an object appearing compared with the next frame and about to disappear as a second-class video frame. Besides the position of the object is detected, the appearance time of a new object and the disappearance time of an existing object can be clearly known through the first type video frame and the second type video frame, and therefore the condition that objects such as gauze are left in the body cavity of a patient can be reduced.

Description

Object detection method, object detection device, endoscope system, electronic device, and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an object detection method and apparatus, an endoscope system, an electronic device, and a storage medium.

Background

Endoscopes (Endoscopes) are a common medical instrument, and comprise a light guide structure and a set of lenses. The endoscope enters the human body through a natural pore canal of the human body or through a small incision, and is used for examination and surgical treatment of organs or tissues of the human body through imaging outside the equipment. Compared with open surgery, endoscopic surgery has the advantages of small wound and quick recovery, and is clinically favored by patients and doctors.

During surgery or diagnosis using an endoscope, there are situations in which one or more pieces of surgical gauze are required. For example, surgical gauze may be placed in a body cavity around an anatomical region to absorb blood or other bodily fluids that may leak. Surgical gauze poses a risk to the patient if left behind in the body cavity after the surgical or diagnostic procedure is completed. Therefore, how to effectively detect the gauze to reduce the gauze remaining in the body cavity of the patient becomes a problem to be solved urgently.

Disclosure of Invention

An object of the embodiments of the present application is to provide an object detection method, an object detection apparatus, an endoscope system, an electronic device, and a storage medium, so as to reduce the situation that objects such as gauze are left in a body cavity of a patient. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present application provides an object detection method, where the method includes: acquiring video data to be detected; respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of the object comprises position information of the object aiming at any object; tracking each object according to the attribute information of each object to obtain a tracking result of each object; and according to the tracking result of each object, determining a video frame with a new object appearing compared with a previous frame as a first-class video frame, and determining a video frame with an object appearing and going to disappear compared with a next frame as a second-class video frame.

In one possible embodiment, the object is gauze, and the deep learning target detection network is a gauze detection network; the deep learning target detection network based on pre-training respectively performs target detection on each video frame in the video data to obtain attribute information of an object in each video frame, and the method comprises the following steps: respectively extracting the characteristics of each video frame in the video data by using a characteristic extraction network of the gauze detection network to obtain the image characteristics of each video frame; and analyzing the image characteristics of each video frame by using the detection head network of the gauze detection network to obtain the attribute information of the gauze in each video frame.

In one possible embodiment, the method further comprises: generating index information at least comprising a first state attribute and a frame number of the first type video frame aiming at any determined first type video frame, wherein the first state attribute indicates that a new object appears; generating index information at least comprising a second state attribute and the frame number of the second type video frame aiming at any determined second type video frame, wherein the second state attribute represents that the appeared object disappears; and packaging each index information and the video data into code stream data.

In a possible implementation manner, for each of the first type video frame and the second type video frame, the index information of the video frame further includes at least the number of objects of the video frame and the position information of the objects in the video frame, where, for each of the first type video frame and the second type video frame, the number of objects of the video frame indicates the number of objects that will disappear and newly appear in the video frame.

In one possible embodiment, the method further comprises: decapsulating the code stream data to obtain each index information; and playing back each video frame in the first type video frame and the second type video frame according to each index information.

In a possible implementation manner, the playing back each video frame of the first type video frame and the second type video frame according to each piece of index information includes: acquiring each first type video frame and/or second type video frame represented by the frame number of the video frame in each index information based on the frame number of the video frame in each index information to obtain each target video frame; and for each target video frame, performing associated playback on the target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame.

In one possible embodiment, the method further comprises: after acquiring detailed display information of a user for a specified target video frame, acquiring each video frame with the difference value of the frame number of the specified target video frame within a first preset frame number difference value range according to the frame number of the specified target video frame to obtain a target video frame segment corresponding to the specified target video frame; and performing associated playback on the target video frame segment and the state attribute corresponding to the target video frame segment, wherein the state attribute corresponding to the target video frame segment is a first state attribute and/or a second state attribute of the specified target video frame, or the state attribute corresponding to the target video frame segment is a first state attribute of a first type of video frame and/or a second state attribute of a second type of video frame included in the target video frame segment.

In a possible implementation manner, the playing back each video frame of the first type video frame and the second type video frame according to each piece of index information includes: aiming at each index information, acquiring each video frame of which the difference value with the frame number of the video frame in the index information is within a second preset frame number difference value range according to the frame number of the video frame in the index information to obtain a target video frame set corresponding to the index information; and for each target video frame set, performing associated playback on the target video frame set and the state attribute corresponding to the target video frame set, wherein for each target video frame set, the state attribute corresponding to the target video frame set is a first state attribute and/or a second state attribute in index information used for determining the target video frame set, or the state attribute corresponding to the target video frame set is a first state attribute of a first type of video frame and/or a second state attribute of a second type of video frame included in the target video frame set.

In a possible implementation, for any object, the attribute information of the object further includes an image feature of the object; the tracking the objects respectively according to the attribute information of the objects to obtain the tracking result of the objects comprises: calculating the cosine distance of the image characteristics of every two objects between adjacent video frames according to the image characteristics of the objects; and associating the same objects between the adjacent video frames according to the position information of each object and the cosine distances to obtain the tracking result of each object.

In a second aspect, embodiments of the present application provide an endoscope system comprising: an endoscope, a light source apparatus, and an imaging system host; the endoscope is used for acquiring image data of a subject; the light source equipment is used for providing a shooting light source for the endoscope; the camera system host is used for realizing the object detection method in the application at run time.

In one possible embodiment, the endoscope system further comprises: a display device and a storage device; the camera system host is also used for sending the image data acquired by the endoscope to the display equipment and storing the processed image data into the storage equipment; the display device is used for displaying the image data and playing back each video frame in the first type video frame and the second type video frame; the storage device is used for storing the processed image data.

In a third aspect, an embodiment of the present application provides an object detection apparatus, where the apparatus includes: the video data acquisition module is used for acquiring video data to be detected; the attribute information determining module is used for respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of the object comprises position information of the object aiming at any object; the tracking result determining module is used for respectively tracking each object according to the attribute information of each object to obtain the tracking result of each object; and the video frame marking module is used for determining a video frame with a new object appearing compared with the previous frame as a first-class video frame and determining a video frame with an object appearing but to be disappeared compared with the next frame as a second-class video frame according to the tracking result of each object.

In one possible embodiment, the object is gauze, and the deep learning target detection network is a gauze detection network; the attribute information determination module is specifically configured to: respectively extracting the characteristics of each video frame in the video data by using a characteristic extraction network of the gauze detection network to obtain the image characteristics of each video frame; and analyzing the image characteristics of each video frame by using the detection head network of the gauze detection network to obtain the attribute information of the gauze in each video frame.

In a possible embodiment, the apparatus further comprises: the index information generation module is used for generating index information at least comprising a first state attribute and the frame number of the first type of video frame aiming at any determined first type of video frame, wherein the first state attribute indicates that a new object appears; generating index information at least comprising a second state attribute and the frame number of the second type video frame aiming at any determined second type video frame, wherein the second state attribute represents that the appeared object disappears; and packaging each index information and the video data into code stream data.

In a possible embodiment, the apparatus further comprises: the data decapsulation module is used for decapsulating the code stream data to obtain each index information; and the video frame display module is used for playing back each video frame in the first type of video frame and the second type of video frame according to each index information.

In a possible implementation manner, the video frame presentation module is specifically configured to: acquiring each first type video frame and/or second type video frame represented by the frame number of the video frame in each index information based on the frame number of the video frame in each index information to obtain each target video frame; and for each target video frame, performing associated playback on the target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame.

In a possible embodiment, the apparatus further comprises: the target video frame segment determining module is used for acquiring each video frame of which the difference value with the frame number of the specified target video frame is within a first preset frame number difference value range according to the frame number of the specified target video frame after acquiring the detailed display message of a user for the specified target video frame, so as to obtain a target video frame segment corresponding to the specified target video frame; and the associated playing module is used for performing associated playback on the target video frame segment and the state attribute corresponding to the target video frame segment, wherein the state attribute corresponding to the target video frame segment is a first state attribute and/or a second state attribute of the specified target video frame, or the state attribute corresponding to the target video frame segment is a first state attribute of a first type of video frame and/or a second state attribute of a second type of video frame included in the target video frame segment.

In a possible implementation manner, the video frame presentation module is specifically configured to: aiming at each index information, acquiring each video frame of which the difference value with the frame number of the video frame in the index information is within a second preset frame number difference value range according to the frame number of the video frame in the index information to obtain a target video frame set corresponding to the index information; and for each target video frame set, performing associated playback on the target video frame set and the state attribute corresponding to the target video frame set, wherein for each target video frame set, the state attribute corresponding to the target video frame set is a first state attribute and/or a second state attribute in index information used for determining the target video frame set, or the state attribute corresponding to the target video frame set is a first state attribute of a first type of video frame and/or a second state attribute of a second type of video frame included in the target video frame set.

In a possible implementation manner, the tracking result determining module is specifically configured to: calculating the cosine distance of the image characteristics of every two objects between adjacent video frames according to the image characteristics of the objects; and associating the same objects between the adjacent video frames according to the position information of each object and the cosine distances to obtain the tracking result of each object.

In a fourth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; the memory is used for storing a computer program; the processor is configured to implement the object detection method according to any one of the present applications when executing the program stored in the memory.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements an object detection method described in any of the present application.

The embodiment of the application has the following beneficial effects: the object detection method, the object detection device, the endoscope system, the electronic equipment and the storage medium provided by the embodiment of the application acquire video data to be detected; respectively carrying out target detection on each video frame in video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of the object comprises position information of the object aiming at any object; tracking each object respectively according to the attribute information of each object to obtain the tracking result of each object; and according to the tracking result of each object, determining a video frame with a new object appearing compared with the previous frame as a first-class video frame, and determining a video frame with an object appearing compared with the next frame and about to disappear as a second-class video frame. In addition to detecting the position of the object, the video frame with a new object appearing is used as a first type video frame, and the video frame with an existing object which is about to disappear is used as a second type video frame; through the first type of video frame and the second type of video frame, the appearance time of a new object and the disappearance time of an existing object can be clearly known, so that the condition that objects such as gauze and the like are left in the body cavity of a patient can be reduced. Of course, not all advantages described above need to be achieved at the same time in the practice of any one product or method of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a first schematic view of an endoscopic system according to an embodiment of the present application;

FIG. 2 is a second schematic view of an endoscopic system of an embodiment of the present application;

FIG. 3 is a schematic view of a modified portion of an endoscope system according to an embodiment of the present application;

FIG. 4 is a diagram illustrating target video frame extraction according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a training deep learning target detection network according to an embodiment of the present application;

FIG. 6 is a schematic diagram of sample image annotation according to an embodiment of the present application;

FIG. 7 is a diagram illustrating a deep learning object detection network according to an embodiment of the present application;

FIG. 8 is a schematic diagram of object tracking according to an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating a method for determining a first type of video frame and a second type of video frame according to an embodiment of the present disclosure;

FIG. 10 is a first diagram illustrating the packaging of index information according to an embodiment of the present application;

FIG. 11 is a second exemplary diagram of packaging index information according to an embodiment of the present application;

FIG. 12 is a diagram illustrating a first type of video frame and a second type of video frame according to an embodiment of the present application;

FIG. 13 is a first schematic diagram of an object detection method according to an embodiment of the present application;

FIG. 14 is a second schematic diagram of an object detection method according to an embodiment of the present application;

FIG. 15 is a third exemplary diagram of packaging index information according to an embodiment of the present application;

fig. 16 is a third schematic diagram of an object detection method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First, terms in the present application are explained:

target detection: given an image, objects of interest are found and their location and category are determined.

Multi-target tracking: given a piece of video, multiple objects of interest are located simultaneously, and their motion trajectories are recorded maintaining the respective IDs.

In the related art, a surgical video image acquired by an endoscope is detected based on a computer vision technology, so that the position of gauze in the video image is obtained, but only the position of the gauze in the image is detected, and medical staff cannot be effectively reminded whether the gauze is left in a body cavity of a patient, for example, the shooting range of the endoscope is limited, and in the surgical process, due to extrusion of tissues, collision of medical instruments and the like, the gauze may be separated from the shooting range of the endoscope, so that the gauze is left in the body cavity of the patient.

In view of the above, the present application further provides an endoscope system, referring to fig. 1, including: an endoscope, a light source apparatus, and an imaging system host; the endoscope is used for acquiring image data of a subject; the light source equipment is used for providing a shooting light source for the endoscope; the camera system host is used for realizing any object detection method in the application during running.

In one possible embodiment, the endoscope system further includes: a display device and a storage device; the camera system host is also used for sending image data acquired by the endoscope to the display equipment and storing the processed image data into the storage equipment; the display device is used for displaying the image data and playing back each video frame in the first type video frame and the second type video frame; the storage device is used for storing the processed image data.

The endoscope system includes an endoscope, a light source apparatus, an image pickup system host, a display apparatus, and a storage apparatus. An endoscope in an endoscope system is capable of being inserted into a subject such as a patient, capturing images of the inside of the subject, and outputting captured in-vivo images to an external display device and storage device. The user checks the presence or absence of a bleeding part, a tumor part, and an abnormal part, which are parts to be detected, by observing the in-vivo image displayed by the display device, and provides a real-time image of the surgical treatment. The user can perform postoperative review and surgical training by accessing the video data in the storage device. An endoscope is inserted into a subject to capture an observation site of the subject and generate image data. The light source device supplies illumination light emitted from the distal end of the endoscope. The imaging system main unit performs the image data processing method on the image data acquired by the endoscope, and controls the overall operation of the endoscope system in a unified manner. The display device displays an image corresponding to image data of the endoscope system host and plays back each of the first type video frame and the second type video frame. The display device plays back each video frame in the first type of video frame and the second type of video frame, and may perform associated playback on a target video frame and a first state attribute and/or a second state attribute corresponding to the target video frame, or perform associated playback on a target video frame segment and a state attribute corresponding to the target video frame segment, or perform associated playback on a target video frame set and a state attribute corresponding to the target video frame set, and the like; for a specific playback manner, reference may be made to relevant portions in the method embodiments, and details are not described here. The storage device stores image data processed by the endoscope system host.

In one possible embodiment, referring to fig. 2, the endoscope includes an image pickup optical unit, a processing unit, an imaging unit, and a first operation unit, the light source apparatus includes an illumination control unit and an illumination unit, and the image pickup system host includes a control unit, a second operation unit, an image input unit, an image processing unit, an intelligent processing unit, and a video encoding unit.

The endoscope has an imaging optical unit, an imaging unit, a processing unit, and a first operation unit. The imaging optical unit condenses light from an observation site. The image pickup optical unit may be constituted using one or more lenses. The imaging unit photoelectrically converts light received by each pixel to generate image data. The imaging unit may be composed of an image sensor such as a CMOS (complementary metal oxide semiconductor) or a CCD (charge coupled device). The processing unit converts the image data generated by the imaging unit into a digital signal and transmits the converted signal to the camera system host. The first operation unit receives input of an instruction signal for switching the operation of the endoscope and an instruction signal lamp for causing the light source device to perform the switching operation of the illumination light, and outputs the instruction signal to the imaging system main unit. The first operation unit includes, but is not limited to, switches, buttons, and a touch panel.

The light source device includes an illumination control unit and an illumination unit. The illumination control unit receives an indication signal of the camera system host to control the illumination unit to provide illumination light to the endoscope.

The image pickup system host processes image data received from the endoscope and transmits the processed image data to the display device and the storage device. The display device and the storage device may be external devices. The camera system host comprises an image input unit, an image processing unit, an intelligent processing unit, a video coding unit, a control unit and a second operation unit. The image input unit receives signals sent by the endoscope and transmits the received signals to the image processing unit. The Image processing unit performs ISP (Image Signal Processor) operations including, but not limited to, luminance transformation, sharpening, moir e removal, scaling, and the like on the Image of the Image input unit. And the image processing unit transmits the image operated by the ISP to the intelligent processing unit, the video coding unit or the display device. The intelligent processing unit intelligently analyzes the image operated by the image processing unit ISP, including but not limited to scene classification based on deep learning, instrument head detection, gauze detection, moire classification and dense fog classification. And the image processed by the intelligent processing unit is transmitted to an image processing unit or a video coding unit. The image processing unit processes the image processed by the intelligent processing unit in a manner including, but not limited to, luminance transformation, degressing, frame folding, and scaling. And the video coding unit codes and compresses the image processed by the image processing unit or the intelligent processing unit and transmits the image to the storage device. The control unit controls various parts of the endoscope system, including but not limited to illumination mode, image processing mode, intelligent processing mode and video coding mode of the light source. The second operation means includes, but is not limited to, a switch, a button, and a touch panel, receives an external instruction signal, and outputs the received instruction signal to the control means.

The application relates to improvements of an intelligent processing unit and a video coding unit, wherein the intelligent processing unit carries out intelligent analysis on images processed by an image processing unit, including but not limited to instrument head detection and gauze detection. And the image processed by the intelligent processing unit is transmitted to an image processing unit or a video coding unit. The image processing unit processes the image processed by the intelligent processing unit in a manner including, but not limited to, luminance transformation, degressing, frame folding, and scaling. And the video coding unit codes and compresses the image processed by the image processing unit or the intelligent processing unit and transmits the image to the storage device.

In the operation process, gauze in a video frame is detected, gauze tracking is carried out by combining continuous frames, if a current frame is a target video frame (the current frame has gauze appeared or the previous frame has gauze disappeared), a plurality of frames before and after the current frame are selected to be stored or marked in the video stream, and the position of the gauze is marked in an image. After the operation is finished (before suturing), the target video frame sequences are displayed on a screen, and doctors are provided with the target video frame sequences to backtrack and investigate the gauze, so that the speed of gauze number checking is increased, and the risk of gauze remaining is reduced.

In one example, such as shown in fig. 3, the improvement may be embodied in three parts: the image acquisition part acquires an endoscope video, the image processing part processes the input endoscope video, the gauze is detected and tracked to obtain the position of the gauze, mark or store a target video frame sequence, and the image display part displays the extracted gauze target video frame sequence for a doctor to use. The following is a detailed description.

In the operation process, the real-time video data collected by the image collecting part is firstly detected by using a gauze detection model, then the detected gauze is tracked, and if the video frame is a target video frame (the current frame has gauze appearing or disappearing in comparison with the previous frame), a plurality of frames before and after the target video frame are selected for marking (for example, the frame number a to the frame number b have gauze appearing or disappearing) or storing, for example, as shown in fig. 4.

In the application, a deep learning method is adopted, namely, a convolutional neural network is utilized to learn the image characteristics. The gauze detection method based on deep learning is divided into two stages: training and testing. The training stage is used for obtaining a gauze detection model, and the testing stage is used for detecting an input image by using the gauze detection model. During network training, input is a training image and a label, a loss function and a network structure, and output is a detection model; and during testing, forward reasoning is carried out on the test image by using the detection model obtained by training to obtain the detection result of the gauze. As shown in fig. 5.

In one example, gauze detection requires design of data calibration, loss function and network structure, and the following description is provided for possible embodiments.

(1) Calibration: gauze target detection requires the definition of a gauze label, and one possible calibration method is the minimum external moment of the area where the gauze is located, as shown in fig. 6.

(2) Loss function: the general detection loss function is mainly divided into two parts, a localization loss and a classification loss, wherein the localization loss is used for object localization, and the classification loss is used for object classification. Regression was performed using the target frame (four points in the gauze-labeled rectangle) and the net prediction frame.

(3) The network structure is as follows: the deep learning target detection network mainly comprises two parts, namely a feature extraction network and a detection head network. An example of a possible network structure is shown in figure 7.

The method adopts a deep learning method, namely, a convolutional neural network is utilized to learn the image characteristics. For each detected object, firstly, the feature extraction network is used to extract apparent features (which can be understood as a feature code of the object), then, the object association is performed, for example, the cosine distance between the image features of every two objects between adjacent video frames can be calculated, and two objects with the shortest cosine distance and smaller than a preset distance threshold are considered to be the same object, as shown in fig. 8, and finally, a tracking result is obtained, where the preset distance threshold is an empirical value or an experimental value. Gauze tracking can occur in three cases: (1) finding the object detected in the k frame in a plurality of object tracks in the k-1 frame, which indicates that the object is normally tracked; (2) in a plurality of object tracks in the k-1 frame, an object detected in the k frame is not found, which indicates that the object newly appears in the k frame; (3) an object exists in the k-1 frame, but the k frame has no object associated with it, which indicates that the object disappears in the k frame.

In the application, the target video frame is defined as that the gauze appears in the current frame or disappears in the previous frame compared with the current frame, as shown in fig. 9, the k-1 th frame and the k-th frame are the target video frames. After the target video frame is found, the related information needs to be described and labeled, and after labeling, the index information with the label is placed in a code stream for storing the related information and transmitting the code. The labeling method is shown in fig. 10, where the frame number indicates the frame number of the target video frame, the number of objects indicates the number of objects to be disappeared and to appear newly in the video frame, the position attribute indicates the position of the gauze (using the coordinates of the top-left corner point and the width and height record), and the state attribute indicates whether the gauze is about to disappear or newly appears (0 indicates about to disappear and 1 indicates about to newly appear). In one example, the sequence number of the target video frame may be further marked, where the sequence number indicates that the target video frame is the second target video frame. In the example shown in fig. 9, if the k-1 th frame and the k-th frame are the 1 st and 2 nd target video frames, a part of the codestream data may be as shown in fig. 11, where for the key frame 1, k-1 is the frame number, 1 is the number of objects, x, y, w, and h are position attributes, x and y are the horizontal and vertical coordinates of the top left corner of the target frame of the object, w is the width of the target frame of the object, h is the height of the target frame of the object, and 0 is a state attribute, which indicates that the object is about to disappear. After the operation is finished (before suturing), the image display part displays the gauze target video frame sequence as a gauze key position for the retrospective and investigation of the doctor, as shown in fig. 12.

In the embodiment of the application, the gauze is detected and tracked in the endoscopic surgery process by using the deep learning technology, the target video frame sequence is extracted, and after the surgery is finished, the key positions of the gauze of a doctor are provided as the reference for backtracking and investigation, so that the checking speed of the number of the gauze after the surgery is improved, and the risk of gauze remaining is reduced. Provides the speed of checking the number of the gauze after the operation and reduces the occurrence of medical accidents caused by the left gauze.

An embodiment of the present application further provides an object detection method, see fig. 13, where the method includes:

s101, video data to be detected are obtained.

The object detection method in the embodiment of the application can be implemented by an electronic device, and specifically, the electronic device can be an endoscope, a hard disk video recorder or other devices with image processing capability. In one example, the video data to be detected is video data captured by an endoscope.

And S102, respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of the object comprises position information of the object for any object.

The deep learning target detection network is used for detecting objects in video frames, the objects in the embodiment of the application include, but are not limited to, gauze, catheters, medical tapes and other articles, and specific types of the objects can be set according to actual detection scenes.

The deep learning target detection network may be any target detection network based on a deep learning algorithm. In one embodiment, the object is gauze, and the deep learning target detection network is a gauze detection network; the deep learning target detection network based on pre-training respectively performs target detection on each video frame in the video data to obtain attribute information of an object in each video frame, and the method includes: respectively extracting the characteristics of each video frame in the video data by using a characteristic extraction network of the gauze detection network to obtain the image characteristics of each video frame; and analyzing the image characteristics of each video frame by using the detection head network of the gauze detection network to obtain the attribute information of the gauze in each video frame.

The deep learning target detection network can be a gauze detection network and comprises a feature extraction network and a detection head network, wherein the feature extraction network is used for extracting image features of video frames, and the detection head network is used for performing pooling, regression and other operations on the image features so as to obtain position information of objects in the video frames. In one example, the deep learning object detection Network may be as shown in fig. 7, where an RPN (Region pro temporal Network) is used to generate each object candidate box based on image features, and an ROI (Region Of Interest) Pooling is used to pool each object candidate box.

The training method of the deep learning target detection network may refer to a network training method in the related art, and in an example, as shown in fig. 5, a process of training the deep learning target detection network in advance includes a training process and a testing process, taking an object as a gauze as an example, a plurality of sample images are obtained, and a gauze position in each sample image is calibrated, for example, a schematic diagram of a possible sample image with a gauze position calibrated may be shown in fig. 6. Dividing the sample image into a training set and a testing set; training process: inputting the sample images in the training set into a deep learning target detection network to obtain predicted gauze position information, calculating loss according to the predicted gauze position information and gauze positions calibrated by the sample images, adjusting parameters of the deep learning target detection network according to the loss, and turning to a testing process after the training times reach preset times; the testing process comprises the following steps: and verifying the deep learning target detection network by using the sample image in the test set, obtaining the trained deep learning target detection network if the loss is converged, and returning to the training process if the loss is not converged.

And S103, tracking each object according to the attribute information of each object to obtain the tracking result of each object.

For any object, the attribute information of the object comprises the position information of the object; each object may be tracked based on a related target tracking method according to the position information of each object, so as to obtain a tracking result of each object, and in one example, the tracking result of the object may be a motion trajectory of the object.

In a possible implementation, for any object, the attribute information of the object further includes an image feature of the object; the tracking result of each object obtained by tracking each object according to the attribute information of each object includes:

step one, calculating the cosine distance of the image characteristics of every two objects between adjacent video frames according to the image characteristics of all the objects.

The image characteristics of the object can be directly obtained by using the deep learning target detection network, and can also be extracted according to the position information of the object by using a characteristic extraction network different from the deep learning target detection network. The cosine distance of the image features of every two objects between every two adjacent video frames in the video data is calculated, for example, the video frame of the K-th frame includes an object 1 and an object 2, and the video frame of the K + 1-th frame includes an object a, an object b, and an object c, so that the cosine distance of the image features of the object 1 and the object a, the cosine distance of the image features of the object 1 and the object b, the cosine distance of the image features of the object 1 and the object c, the cosine distance of the image features of the object 2 and the object a, the cosine distance of the image features of the object 2 and the object b, and the cosine distance of the image features of the object 2 and the object c need to be calculated. In an example, the cosine distance in the embodiment of the present application may be replaced with a parameter representing the image similarity, such as an euclidean distance.

And step two, associating the same object between every two adjacent video frames according to the position information of each object and the cosine distance to obtain the tracking result of each object.

And determining each object which is the same target in the adjacent video frames according to each cosine distance, and generating the track of the object which is the same target according to the position information of each object, thereby obtaining the tracking result of each object.

In one example, two objects with the shortest cosine distance and smaller than the preset distance threshold are considered to be the same target, for example, the K frame video frame includes an object 1 and an object 2, and the K +1 frame video frame includes an object a, an object b, and an object c; if the cosine distance between the image features of the object 1 and the object a is less than the cosine distance between the image features of the object 1 and the object b is less than the cosine distance between the image features of the object 1 and the object c, and the cosine distance between the image features of the object 1 and the object a is less than a preset distance threshold, the object 1 and the object a are judged to be the same object, the position information of the object 1 and the position information of the object a are correlated, and the motion track of the object is obtained as the tracking result of the object. If the cosine distance between the image features of the object 2 and the object a is less than the cosine distance between the image features of the object 2 and the object b is less than the cosine distance between the image features of the object 2 and the object c, and the cosine distance between the image features of the object 2 and the object a is not less than the preset distance threshold, it is determined that the object 1 and the object a are not the same object, that is, the object 1 is an object to be disappeared.

In one example, the associating the same object between adjacent video frames according to the position information of each object and the cosine distance to obtain the tracking result of each object includes:

step 1, aiming at each group of adjacent video frames, all possible object combinations between two video frames of the group of adjacent video frames can be constructed, and the sum of cosine distances of each object combination is determined, wherein the rest chord distances of unrelated objects are preset distance thresholds.

The preset distance threshold is an empirical value or an experimental value, in one example, a plurality of groups of negative sample objects may be selected in advance, two objects in the same group of negative sample objects are not the same object, cosine distances of the two objects in each group of negative sample objects are respectively calculated, and a mean value of the cosine distances of each group of negative sample objects is calculated as the preset distance threshold.

And 2, aiming at each group of adjacent video frames, selecting an object combination with the minimum sum of cosine distances in the group of adjacent video frames as a tracking result of the group of adjacent video frames.

In one example, a group of adjacent video frames are a K frame video frame and a K +1 frame video frame, the K frame video frame includes an object 1 and an object 2, and the K +1 frame video frame includes an object a, an object b, and an object c, so that all possible object combinations are: the object combination A, the object 1 is related to the object a, the object 2 is related to the object b, and the object c is not related to the object; object combination B, object 1 is related to object a, object 2 is related to object c, and object B is not related to object; an object combination C, an object 1 is associated with an object b, an object 2 is associated with an object a, and an object C is not associated with an object; an object combination D, an object 1 is associated with an object b, an object 2 is associated with an object c, and an object a is not associated with an object; an object combination E, an object 1 is associated with an object c, an object 2 is associated with an object a, and an object b is not associated with an object; an object combination F, an object 1 is associated with an object c, an object 2 is associated with an object b, and an object a is not associated with an object; object combination G, object 1 associated object a, object 2 unrelated object, object b unrelated object, object c unrelated object; an object combination H, an object 1 is associated with an object b, an object 2 is not associated with an object, an object a is not associated with an object, and an object c is not associated with an object; object combination I, object 1 associated object c, object 2 unrelated object, object a unrelated object, object b unrelated object; object combination J, object 2 associated object a, object 1 unassociated object, object b unassociated object, object c unassociated object; the object combination K, the object 2 is related to the object b, the object 1 is not related to the object, the object a is not related to the object, and the object c is not related to the object; an object combination L, an object 2 related to an object c, an object 1 unrelated object, an object a unrelated object and an object b unrelated object; object combination M, object 1 unrelated object, object 2 unrelated object, object a unrelated object, object b unrelated object, object c unrelated object.

Respectively calculating the sum of cosine distances of each object combination, wherein the rest chord distances of the unrelated objects are a preset distance threshold, for example, for the sum of cosine distances, the object combination A < the object combination B < the object combination C < the object combination D < the object combination E < the object combination F < the object combination G < the object combination H < the object combination I < the object combination J < the object combination K < the object combination L < the object combination M, the object combination A is the tracking result of the Kth frame video frame and the Kth frame video frame, namely, the object 1 and the object a are the same object, the object 2 and the object B are the same object, and the object C is a newly appeared object in the Kth frame video frame.

And S104, according to the tracking result of each object, determining a video frame with a new object appearing compared with the previous frame as a first-class video frame, and determining a video frame with an object appearing compared with the next frame and about to disappear as a second-class video frame.

Tracking of an object may occur as follows:

(1) in a plurality of object tracks in the k-1 frame, the object detected in the k frame is found, which indicates that the object is normally tracked.

(2) In a plurality of object tracks in the k-1 th frame, the object detected in the k-1 th frame is not found, which indicates that the object is newly appeared in the k-1 th frame.

(3) An object exists in the k-1 frame, but the k frame has no object associated with it, which indicates that the object disappears in the k frame.

The first type of video frame is a video frame with a new object appearing in comparison with the previous video frame; the second type of video frame is a video frame in which an object appearing in a later frame of video frame is to disappear. For example, as shown in FIG. 9, the kth frame is a first type video frame, and the k-1 frame is a second type video frame. In one example, when a new object appears in a video frame and an object that has appeared disappears, the video frame is a first type video frame and a second type video frame.

In the embodiment of the application, besides the position of the object is detected, a video frame with a new object as a first type video frame and a video frame with an existing object which is about to disappear as a second type video frame; through the first type of video frame and the second type of video frame, the appearance time of a new object and the disappearance time of an existing object can be clearly known, so that the condition that objects such as gauze and the like are left in the body cavity of a patient can be reduced.

In a possible embodiment, referring to fig. 14, the method further comprises:

and S105, generating index information at least comprising a first state attribute and the frame number of the first type video frame aiming at any determined first type video frame, wherein the first state attribute indicates that a new object appears.

And generating index information of the video frame aiming at each video frame in the first type of video frames, wherein the index information of the video frame carries a label, the label comprises a first state attribute and a frame number of the video frame, the first state attribute indicates that a new object appears, and the frame number of the video frame indicates that the video frame is the video frame of the frame in the video data.

In a possible implementation manner, for each of the video frames in the first category, the index information of the video frame further includes the number of objects of the video frame and the position information of the objects in the video frame, where, for each of the video frames in the first category, the number of objects of the video frame indicates the number of objects that will disappear and newly appear in the video frame. In one example, for each video frame in the first type of video frames, the index information of the video frame further includes a sequence number of the video frame. The position information of the object may be coordinate information of a target frame of the object, and in one example, the position information is expressed by coordinates of a corner point at the top left corner of the target frame of the object and width and height of the target frame. The sequence number of the video frame may indicate that the video frame is a first type video frame of a few frames in the video data, or may indicate that the video frame is a labeled video frame of the few frames in the video data, where the labeled video frame includes the first type video frame and a second type video frame. The sequence number of the video frame in the index information of the video frame can be used for conveniently distinguishing the video frame according to the time sequence, and the position information of the object in the video frame in the index information of the video frame can be used for conveniently positioning the position of the object in the video frame.

And S106, generating index information at least comprising a second state attribute and the frame number of the second type video frame aiming at any determined second type video frame, wherein the second state attribute represents that the appeared object disappears.

And generating index information of the video frame aiming at each video frame in the second type of video frames, wherein the index information of the video frame carries a label, the label comprises a second state attribute and a frame number of the video frame, the second state attribute indicates that an object which has appeared disappears, and the frame number of the video frame indicates that the video frame is the video frame of the frame in the video data.

In a possible implementation manner, for each video frame in the second type of video frames, the index information of the video frame further includes at least the number of objects of the video frame and the position information of the objects in the video frame, where, for each video frame in the second type of video frames, the number of objects of the video frame indicates the number of objects that will disappear and newly appear in the video frame. In one example, for each video frame in the second type of video frames, the index information of the video frame further includes a sequence number of the video frame. The position information of the object may be coordinate information of a target frame of the object, and in one example, the position information is expressed by coordinates of a corner point at the top left corner of the target frame of the object and width and height of the target frame. The sequence number of the video frame may indicate that the video frame is the second type of video frame of the several frames in the video data, or may indicate that the video frame is the labeled video frame of the several frames in the video data, where the labeled video frame includes the first type of video frame and the second type of video frame. The sequence number of the video frame in the index information of the video frame can be used for conveniently distinguishing the video frame according to the time sequence, and the position information of the object in the video frame in the index information of the video frame can be used for conveniently positioning the position of the object in the video frame.

S107, the index information and the video data are encapsulated into code stream data.

And packaging each index information into the code stream after the video data is coded to obtain code stream data. In an example, as shown in fig. 10 and fig. 15, for example, the index information may be encapsulated in a header of the code stream data, so that the index information can be obtained quickly after decapsulation is facilitated, where the data header is used for identifying the index information, and may include information such as a data length of the index information, and the specific situation may be set by a user according to an actual situation. In one example, as shown in fig. 10, the state attribute indicates whether a new object appears or an object that has appeared disappears, and for example, the second state attribute (the object that has appeared disappears) is indicated by 0, and the first state attribute (the new object appears) is indicated by 1.

In an example, there may be a video frame where both a new object appears and an object that has appeared disappears, when the number of objects that newly appear and are about to disappear in a video frame is greater than 1, the position information and the state attribute of each newly appear and about to disappear may be sequentially arranged in the index information of the video frame, and the arrangement sequence may be set by user-defined, as shown in fig. 15, a key frame 1 includes an object that is about to disappear and an object that newly appears, where k-1 denotes a frame number of the key frame 1, 2 denotes the number of objects that are about to disappear and newly appear in the key frame 1, x, y, w, and h before 0 are the position attributes of the object that is about to disappear, and 0 is a state attribute, which denotes that the object is about to disappear; x, y, w, h before 1 are position attributes of the newly appearing object, and 1 is a state attribute indicating that the object newly appears.

In the embodiment of the application, the index information and the video data are packaged into the code stream data, so that subsequent watching and object tracing can be facilitated.

In a possible embodiment, referring to fig. 16, the method further comprises:

and S108, decapsulating the code stream data to obtain each index information.

And decapsulating the code stream data to obtain index information and video data.

And S109, playing back each video frame in the first type video frame and the second type video frame according to each index information.

In one embodiment, the playing back each of the first type video frame and the second type video frame according to each of the index information includes:

step one, acquiring each first type video frame and/or second type video frame represented by the frame number of the video frame in each index information based on the frame number of the video frame in each index information to obtain each target video frame.

For example, if the frame numbers of the video frames in the respective pieces of fuse information are 5, 99, 255, 1245, and 3455, respectively, the 5 th, 99 th, 255 th, 1245 th, and 3455 th frames of video are obtained from the video data as the respective target video frames.

And secondly, performing associated playback on each target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame.

The first state attribute and/or the second state attribute corresponding to the target video frame are/is the first state attribute and/or the second state attribute in the index information of the target video frame. If the index information of the target video frame carries the first state attribute and does not carry the second state attribute, the target video frame and the first state attribute are displayed in an associated mode; if the index information of the target video frame carries the second state attribute and does not carry the first state attribute, the target video frame and the second state attribute are displayed in an associated mode; and if the index information of the target video frame carries the first state attribute and carries the first state attribute, displaying the target video frame, the first state attribute and the second state attribute in an associated manner. In addition, information such as a sequence number corresponding to the target video frame and a target frame of the object can be displayed. In one example, a schematic diagram of the association presentation may be as shown in fig. 12.

In one embodiment, a user can play back a video segment near a target video frame, and the method further comprises:

and step three, after acquiring the detailed display information of the user aiming at the specified target video frame, acquiring each video frame with the difference value of the frame number of the specified target video frame within the range of the difference value of the first preset frame number according to the frame number of the specified target video frame, and acquiring the target video frame segment corresponding to the specified target video frame.

The first preset frame number difference range can be set in a self-defined manner according to actual conditions, and can be set to be-50 to 50, -50 to 100, -100 to 50, -100 to 100, or-400 to 500, for example. Taking the first preset frame number difference range of-50 to 100 as an example, 50 frames before the frame number of the specified target video frame, the specified target video frame and 100 frames after the frame number of the specified target video frame are selected as the target video frame segment corresponding to the specified target video frame.

And fourthly, performing associated playback on the target video frame segment and the state attribute corresponding to the target video frame segment, wherein the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the specified target video frame, or the state attribute corresponding to the target video frame segment is the first state attribute of the first type of video frame and/or the second state attribute of the second type of video frame included in the target video frame segment.

The state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the specified target video frame used for determining the target video frame segment, and in one example, the first state attribute and/or the second state attribute corresponding to the target video frame segment may further include the first state attribute and/or the second state attribute corresponding to other target video frames in the target video frame segment. Playing the target video segment, and displaying the first state attribute and/or the second state attribute corresponding to the target video frame segment in an associated manner, in one example, the first state attribute and/or the second state attribute corresponding to the target video frame segment may be displayed in the whole process of playing the target video frame segment; in one example, the first status attribute and/or the second status attribute corresponding to the target video frame may be played when the target video frame or a video frame having a frame number within a third preset frame number difference range from the frame number of the target video frame is played.

and step A, aiming at each index information, according to the frame number of the video frame in the index information, obtaining each video frame of which the difference value with the frame number of the video frame in the index information is within the range of the difference value of the second preset frame number, and obtaining a target video frame set corresponding to the index information.

The second preset frame number difference range can be set in a self-defined manner according to actual conditions, and can be set to be-5 to 5, -5 to 10, -10 to 10, -20 to 20, or-50 to 100, for example. Taking the second predetermined frame number difference range of-5 to 10 as an example, the 5 frames before the frame number of the video frame in the index information, the video frame indicated by the frame number of the video frame in the index information, and the 10 frames after the frame number of the video frame in the index information are selected as the target video frame set corresponding to the index information.

And step B, performing associated playback on the target video frame set and the state attribute corresponding to the target video frame set aiming at each target video frame set, wherein for each target video frame set, the state attribute corresponding to the target video frame set is a first state attribute and/or a second state attribute in index information used for determining the target video frame set, or the state attribute corresponding to the target video frame set is a first state attribute of a first type of video frame and/or a second state attribute of a second type of video frame included in the target video frame set.

In the embodiment of the application, each video frame in the first type of video frame and the second type of video frame is displayed, so that medical personnel can review increase and decrease of objects such as gauze in the operation process before suturing, and the condition that the objects such as gauze are left in the body cavity of a patient can be effectively reduced.

An embodiment of the present application further provides an object detection apparatus, including: the video data acquisition module is used for acquiring video data to be detected; the attribute information determining module is used for respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of the object comprises position information of the object aiming at any object; the tracking result determining module is used for respectively tracking each object according to the attribute information of each object to obtain the tracking result of each object; and the video frame marking module is used for determining a video frame with a new object appearing compared with the previous frame as a first-class video frame and determining a video frame with an object appearing but to be disappeared compared with the next frame as a second-class video frame according to the tracking result of each object.

An embodiment of the present application further provides an electronic device, including: a processor and a memory; the memory is used for storing computer programs; the processor is configured to implement any one of the object detection methods of the present application when executing the computer program stored in the memory.

Optionally, in addition to the memory and the processor, the electronic device according to the embodiment of the present application further includes a communication interface and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus.

The communication bus mentioned in the electronic device may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a RAM (Random Access Memory) or an NVM (Non-Volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also a DSP (Digital Signal Processing), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements any object detection method in the present application.

In yet another embodiment provided herein, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the object detection methods of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It should be noted that, in this document, the technical features in the various alternatives can be combined to form the scheme as long as the technical features are not contradictory, and the scheme is within the scope of the disclosure of the present application. Relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for embodiments of the apparatus, the system, the electronic device, the computer program product, and the storage medium, since they are substantially similar to the method embodiments, the description is relatively simple, and for relevant points, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. An object detection method, characterized in that the method comprises:

acquiring video data to be detected;

respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of the object comprises position information of the object aiming at any object;

tracking each object according to the attribute information of each object to obtain a tracking result of each object;

and according to the tracking result of each object, determining a video frame with a new object appearing compared with a previous frame as a first-class video frame, and determining a video frame with an object appearing and going to disappear compared with a next frame as a second-class video frame.

2. The method of claim 1, wherein the object is gauze and the deep learning target detection network is a gauze detection network;

the deep learning target detection network based on pre-training respectively performs target detection on each video frame in the video data to obtain attribute information of an object in each video frame, and the method comprises the following steps:

respectively extracting the characteristics of each video frame in the video data by using a characteristic extraction network of the gauze detection network to obtain the image characteristics of each video frame;

and analyzing the image characteristics of each video frame by using the detection head network of the gauze detection network to obtain the attribute information of the gauze in each video frame.

3. The method of claim 1, further comprising:

generating index information at least comprising a first state attribute and a frame number of the first type video frame aiming at any determined first type video frame, wherein the first state attribute indicates that a new object appears;

generating index information at least comprising a second state attribute and the frame number of the second type video frame aiming at any determined second type video frame, wherein the second state attribute represents that the appeared object disappears;

and packaging each index information and the video data into code stream data.

4. The method according to claim 3, wherein the index information of the video frames further comprises, for each of the first type video frames and the second type video frames, at least the number of objects of the video frames and the position information of the objects in the video frames, wherein the number of objects of the video frames indicates the number of objects to be disappeared and newly appeared in the video frames for each of the first type video frames and the second type video frames.

5. The method according to claim 3 or 4, characterized in that the method further comprises:

decapsulating the code stream data to obtain each index information;

and playing back each video frame in the first type video frame and the second type video frame according to each index information.

6. The method according to claim 5, wherein the playing back each video frame of the first type and the second type according to each index information comprises:

acquiring each first type video frame and/or second type video frame represented by the frame number of the video frame in each index information based on the frame number of the video frame in each index information to obtain each target video frame;

and for each target video frame, performing associated playback on the target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame.

7. The method of claim 6, further comprising:

after acquiring detailed display information of a user for a specified target video frame, acquiring each video frame with the difference value of the frame number of the specified target video frame within a first preset frame number difference value range according to the frame number of the specified target video frame to obtain a target video frame segment corresponding to the specified target video frame;

and performing associated playback on the target video frame segment and the state attribute corresponding to the target video frame segment, wherein the state attribute corresponding to the target video frame segment is a first state attribute and/or a second state attribute of the specified target video frame, or the state attribute corresponding to the target video frame segment is a first state attribute of a first type of video frame and/or a second state attribute of a second type of video frame included in the target video frame segment.

8. The method according to claim 5, wherein the playing back each video frame of the first type and the second type according to each index information comprises:

aiming at each index information, acquiring each video frame of which the difference value with the frame number of the video frame in the index information is within a second preset frame number difference value range according to the frame number of the video frame in the index information to obtain a target video frame set corresponding to the index information;

and for each target video frame set, performing associated playback on the target video frame set and the state attribute corresponding to the target video frame set, wherein for each target video frame set, the state attribute corresponding to the target video frame set is a first state attribute and/or a second state attribute in index information used for determining the target video frame set, or the state attribute corresponding to the target video frame set is a first state attribute of a first type of video frame and/or a second state attribute of a second type of video frame included in the target video frame set.

9. The method according to claim 1, wherein for any object, the attribute information of the object further comprises image features of the object; the tracking the objects respectively according to the attribute information of the objects to obtain the tracking result of the objects comprises:

calculating the cosine distance of the image characteristics of every two objects between adjacent video frames according to the image characteristics of the objects;

and associating the same objects between the adjacent video frames according to the position information of each object and the cosine distances to obtain the tracking result of each object.

10. An endoscopic system, comprising:

an endoscope, a light source apparatus, and an imaging system host;

the endoscope is used for acquiring image data of a subject;

the light source equipment is used for providing a shooting light source for the endoscope;

the camera system host is configured to implement the object detection method according to any one of claims 1 to 9 when running.

11. The system of claim 10, wherein the endoscopic system further comprises: a display device and a storage device;

the camera system host is also used for sending the image data acquired by the endoscope to the display equipment and storing the processed image data into the storage equipment;

the display device is used for displaying the image data and playing back each video frame in the first type video frame and the second type video frame;

the storage device is used for storing the processed image data.

12. An object detection apparatus, characterized in that the apparatus comprises:

the video data acquisition module is used for acquiring video data to be detected;

the attribute information determining module is used for respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of the object comprises position information of the object aiming at any object;

the tracking result determining module is used for respectively tracking each object according to the attribute information of each object to obtain the tracking result of each object;

and the video frame marking module is used for determining a video frame with a new object appearing compared with the previous frame as a first-class video frame and determining a video frame with an object appearing but to be disappeared compared with the next frame as a second-class video frame according to the tracking result of each object.

13. An electronic device comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to implement the object detection method according to any one of claims 1 to 9 when executing the program stored in the memory.

14. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the object detection method according to any one of claims 1 to 9.