CN112967276B

CN112967276B - Object detection method, object detection device, endoscope system, electronic device, and storage medium

Info

Publication number: CN112967276B
Application number: CN202110348217.1A
Authority: CN
Inventors: 王晶
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2023-09-05
Anticipated expiration: 2041-03-31
Also published as: CN112967276A

Abstract

The embodiment of the application provides an object detection method, an object detection device, an endoscope system, electronic equipment and a storage medium, and video data to be detected are obtained; respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame; tracking each object according to the attribute information of each object to obtain a tracking result of each object; according to the tracking result of each object, the video frame with the new object appearing compared with the previous frame is determined as a first type video frame, and the video frame with the object appearing compared with the next frame and about to disappear is determined as a second type video frame. Besides detecting the position of the object, the first-type video frame and the second-type video frame can clearly know the appearance time of a new object and the disappearance time of an existing object, so that the condition that the object such as gauze is left in the body cavity of a patient can be reduced.

Description

Object detection method, object detection device, endoscope system, electronic device, and storage medium

Technical Field

The present application relates to the field of image processing technology, and in particular, to an object detection method, an object detection device, an endoscope system, an electronic device, and a storage medium.

Background

Endoscopes (Endoscopes) are a commonly used medical device, consisting of a light guide structure and a set of lenses. The endoscope enters the human body through a natural duct of the human body or through a small incision, and is used for the examination and the operation treatment of human organs or tissues through the external imaging of equipment. Compared with open surgery, endoscopic surgery has the advantages of small wound and quick recovery, and is favored by patients and doctors clinically.

In the surgical or diagnostic procedures using endoscopes, there are instances where one or more pieces of surgical gauze are required. For example, surgical gauze may be placed in a body cavity surrounding an anatomical region to absorb blood or other body fluids that may exude. Surgical gauze can pose a risk to the patient if it is left in the body cavity after the surgical or diagnostic procedure is completed. Therefore, how to effectively detect the gauze so as to reduce the condition that the gauze is left in the body cavity of the patient becomes a problem to be solved urgently.

Disclosure of Invention

An object of an embodiment of the present application is to provide an object detection method, apparatus, endoscope system, electronic device, and storage medium, so as to reduce the situation that an object such as gauze remains in a body cavity of a patient. The specific technical scheme is as follows:

In a first aspect, an embodiment of the present application provides an object detection method, where the method includes: acquiring video data to be detected; respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of any object comprises the position information of the object; tracking each object according to the attribute information of each object to obtain a tracking result of each object; and according to the tracking result of each object, determining the video frame with the new object appearing compared with the previous frame as a first type video frame, and determining the video frame with the object appearing compared with the next frame to disappear as a second type video frame.

In one possible embodiment, the object is gauze and the deep learning object detection network is a gauze detection network; the pre-training-based deep learning target detection network respectively carries out target detection on each video frame in the video data to obtain attribute information of an object in each video frame, and the method comprises the following steps: respectively extracting the characteristics of each video frame in the video data by utilizing a characteristic extraction network of the gauze detection network to obtain the image characteristics of each video frame; and analyzing the image characteristics of each video frame by using a detection head network of the gauze detection network to obtain the attribute information of the gauze in each video frame.

In one possible embodiment, the method further comprises: generating index information at least comprising a first state attribute and a frame number of the first type video frame aiming at any determined first type video frame, wherein the first state attribute indicates that a new object appears; generating index information at least comprising a second state attribute and a frame number of the second type video frame aiming at any determined second type video frame, wherein the second state attribute indicates that the object which has appeared disappears; and packaging each index information and the video data into code stream data.

In one possible implementation, for each of the first type of video frame and the second type of video frame, the index information of the video frame further includes at least a number of objects of the video frame and location information of objects in the video frame, wherein, for each of the first type of video frame and the second type of video frame, the number of objects of the video frame indicates a number of objects that will disappear and appear newly in the video frame.

In one possible embodiment, the method further comprises: decapsulating the code stream data to obtain each piece of index information; and playing back each video frame in the first type video frame and the second type video frame according to each index information.

In one possible implementation manner, the playing back each video frame in the first type video frame and the second type video frame according to each index information includes: acquiring each first type video frame and/or each second type video frame represented by the frame number of the video frame in each index information based on the frame number of the video frame in each index information to obtain each target video frame; and for each target video frame, performing associated playback on the target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame.

In one possible embodiment, the method further comprises: after a detailed display message of a user aiming at a specified target video frame is acquired, acquiring each video frame with a difference value of the frame number of the specified target video frame within a first preset frame number difference range according to the frame number of the specified target video frame, and acquiring a target video frame segment corresponding to the specified target video frame; and performing associated playback on the target video frame segment and the state attribute corresponding to the target video frame segment, wherein the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the specified target video frame, or the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the first type video frame included in the target video frame segment.

In one possible implementation manner, the playing back each video frame in the first type video frame and the second type video frame according to each index information includes: for each index information, according to the frame numbers of the video frames in the index information, obtaining each video frame with a difference value of the frame numbers of the video frames in the index information within a second preset frame number difference range, and obtaining a target video frame set corresponding to the index information; and carrying out associated playback on each target video frame set and state attributes corresponding to the target video frame set, wherein the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes in index information used for determining the target video frame set or the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes of first type video frames included in the target video frame set for each target video frame set.

In one possible implementation, for any object, the attribute information of the object further includes an image feature of the object; the tracking of each object according to the attribute information of each object to obtain a tracking result of each object includes: according to the image characteristics of each object, calculating the cosine distance of the image characteristics of each two objects between adjacent video frames; and associating the same objects among the adjacent video frames according to the position information of the objects and the cosine distances to obtain tracking results of the objects.

In a second aspect, embodiments of the present application provide an endoscope system including: an endoscope, a light source device, and an imaging system host; the endoscope is used for collecting image data of a subject; the light source device is used for providing a shooting light source for the endoscope; the camera system host is used for realizing any one of the object detection methods in the application during operation.

In one possible embodiment, the endoscope system further comprises: a display device and a storage device; the camera system host is also used for sending the image data acquired by the endoscope to the display equipment and storing the processed image data into the storage equipment; the display device is used for displaying the image data and playing back each video frame in the first type video frame and the second type video frame; the storage device is used for storing the processed image data.

In a third aspect, an embodiment of the present application provides an object detection apparatus, including: the video data acquisition module is used for acquiring video data to be detected; the attribute information determining module is used for respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of any object comprises the position information of the object; the tracking result determining module is used for respectively tracking each object according to the attribute information of each object to obtain the tracking result of each object; and the video frame labeling module is used for determining a video frame with a new object appearing compared with the previous frame as a first type of video frame and determining a video frame with an object appearing and about to disappear compared with the next frame as a second type of video frame according to the tracking result of each object.

In one possible embodiment, the object is gauze and the deep learning object detection network is a gauze detection network; the attribute information determining module is specifically configured to: respectively extracting the characteristics of each video frame in the video data by utilizing a characteristic extraction network of the gauze detection network to obtain the image characteristics of each video frame; and analyzing the image characteristics of each video frame by using a detection head network of the gauze detection network to obtain the attribute information of the gauze in each video frame.

In one possible embodiment, the apparatus further comprises: the index information generation module is used for generating index information at least comprising a first state attribute and a frame number of the first type video frame for any determined first type video frame, wherein the first state attribute indicates that a new object appears; generating index information at least comprising a second state attribute and a frame number of the second type video frame aiming at any determined second type video frame, wherein the second state attribute indicates that the object which has appeared disappears; and packaging each index information and the video data into code stream data.

In one possible embodiment, the apparatus further comprises: a data unpacking module, configured to unpack the code stream data to obtain each index information; and the video frame display module is used for playing back each video frame in the first type of video frames and the second type of video frames according to each index information.

In one possible implementation manner, the video frame display module is specifically configured to: acquiring each first type video frame and/or each second type video frame represented by the frame number of the video frame in each index information based on the frame number of the video frame in each index information to obtain each target video frame; and for each target video frame, performing associated playback on the target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame.

In one possible embodiment, the apparatus further comprises: the target video frame segment determining module is used for acquiring each video frame with the difference value of the frame number of the designated target video frame within a first preset frame number difference range according to the frame number of the designated target video frame after acquiring the detailed display message of the designated target video frame of a user, so as to acquire a target video frame segment corresponding to the designated target video frame; and the associated playing module is used for carrying out associated playback on the target video frame segment and the state attribute corresponding to the target video frame segment, wherein the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the appointed target video frame, or the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the first type video frame included in the target video frame segment.

In one possible implementation manner, the video frame display module is specifically configured to: for each index information, according to the frame numbers of the video frames in the index information, obtaining each video frame with a difference value of the frame numbers of the video frames in the index information within a second preset frame number difference range, and obtaining a target video frame set corresponding to the index information; and carrying out associated playback on each target video frame set and state attributes corresponding to the target video frame set, wherein the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes in index information used for determining the target video frame set or the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes of first type video frames included in the target video frame set for each target video frame set.

In a possible implementation manner, the tracking result determining module is specifically configured to: according to the image characteristics of each object, calculating the cosine distance of the image characteristics of each two objects between adjacent video frames; and associating the same objects among the adjacent video frames according to the position information of the objects and the cosine distances to obtain tracking results of the objects.

In a fourth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; the memory is used for storing a computer program; the processor is configured to implement any one of the object detection methods according to the present application when executing the program stored in the memory.

In a fifth aspect, an embodiment of the present application provides a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the object detection method according to any of the present application.

The embodiment of the application has the beneficial effects that: the object detection method, the device, the endoscope system, the electronic equipment and the storage medium provided by the embodiment of the application acquire video data to be detected; respectively carrying out target detection on each video frame in video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of any object comprises the position information of the object; tracking each object according to the attribute information of each object to obtain a tracking result of each object; according to the tracking result of each object, the video frame with the new object appearing compared with the previous frame is determined as a first type video frame, and the video frame with the object appearing compared with the next frame and about to disappear is determined as a second type video frame. In addition to detecting the position of the object, the video frame with the new object appearing is used as a first type video frame, and the video frame with the object which appears and is about to disappear is used as a second type video frame; the first type video frame and the second type video frame can clearly know the time when a new object appears and the time when an existing object disappears, so that the condition that objects such as gauze are left in a body cavity of a patient can be reduced. Of course, it is not necessary for any one product or method of practicing the application to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a first schematic view of an endoscope system according to an embodiment of the present application;

FIG. 2 is a second schematic view of an endoscope system according to an embodiment of the present application;

FIG. 3 is a schematic view of an improved portion of an endoscope system in accordance with an embodiment of the present application;

FIG. 4 is a schematic diagram of target video frame extraction according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a training deep learning object detection network according to an embodiment of the present application;

FIG. 6 is a schematic illustration of sample image annotation according to an embodiment of the application;

FIG. 7 is a schematic diagram of a deep learning object detection network structure according to an embodiment of the present application;

FIG. 8 is a schematic diagram of object tracking according to an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating a first type of video frame and a second type of video frame determination method according to an embodiment of the present application;

FIG. 10 is a first diagram illustrating a method for encapsulating index information according to an embodiment of the present application;

FIG. 11 is a second diagram illustrating a method for encapsulating index information according to an embodiment of the present application;

FIG. 12 is a schematic diagram showing a first type of video frame and a second type of video frame in an embodiment of the present application;

FIG. 13 is a first schematic diagram of an object detection method according to an embodiment of the present application;

FIG. 14 is a second schematic diagram of an object detection method according to an embodiment of the present application;

FIG. 15 is a third diagram illustrating a third embodiment of encapsulating index information;

fig. 16 is a third schematic diagram of an object detection method according to an embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

First, terms in the present application will be explained:

and (3) target detection: given an image, the objects of interest are found from this and their location and class are determined.

Multi-target tracking: given a video, a plurality of objects of interest are located simultaneously, and their respective IDs are maintained and their motion trajectories recorded.

In the related art, the position of the gauze in the video image is obtained by detecting the surgical video image collected by the endoscope based on the computer vision technology, but only the position of the gauze in the image is detected, and the medical staff cannot be effectively reminded whether the gauze is left in the body cavity of the patient, for example, the shooting range of the endoscope is limited, and in the surgical process, the gauze may be separated from the shooting range of the endoscope due to the extrusion of tissues, the collision of medical instruments and the like, so that the gauze is left in the body cavity of the patient.

In view of this, an embodiment of the present application also provides an endoscope system, referring to fig. 1, including: an endoscope, a light source device, and an imaging system host; the endoscope is used for collecting image data of a subject; the light source device is used for providing a shooting light source for the endoscope; the camera system host is used for realizing any object detection method in the application when running.

In one possible embodiment, the above endoscope system further includes: a display device and a storage device; the camera system host is also used for sending the image data acquired by the endoscope to the display equipment and storing the processed image data into the storage equipment; the display device is used for displaying the image data and playing back each video frame in the first type video frame and the second type video frame; the storage device is used for storing the processed image data.

The endoscope system includes an endoscope, a light source device, an imaging system host, a display device, and a storage device. An endoscope in an endoscope system can be inserted into a subject such as a patient to capture an image of the body of the subject, and the captured image of the body is output to an external display device and a storage device. The user checks the presence or absence of bleeding, tumor and abnormal sites, which are the target sites, by observing the in-vivo image displayed by the display device, and provides a real-time image of the surgical treatment. The user may perform post-operative review and surgical training by accessing video data in the storage device. The endoscope is inserted into a subject to capture an observation site of the subject and generate image data. The light source device supplies illumination light emitted from the endoscope front end. The image pickup system host performs the image data processing method described above on the image data collected by the endoscope, and controls the operation of the entire endoscope system in a unified manner. The display device displays an image corresponding to image data of the endoscope system host and plays back each of the first-type video frame and the second-type video frame. The playback of each video frame in the first type of video frame and the second type of video frame by the display device can be that the target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame are/is played back in an associated manner, or the target video frame segment and the state attribute corresponding to the target video frame segment are played back in an associated manner, or the target video frame set and the state attribute corresponding to the target video frame set are played back in an associated manner, etc.; the specific playback manner may be referred to relevant portions in the method embodiment, and will not be described herein. The storage device stores the image data processed by the endoscope system host.

In one possible embodiment, referring to fig. 2, the endoscope includes an image capturing optical unit, a processing unit, an imaging unit, and a first operation unit, the light source device includes an illumination control unit and an illumination unit, and the image capturing system host includes a control unit, a second operation unit, an image input unit, an image processing unit, an intelligent processing unit, and a video encoding unit.

The endoscope has an imaging optical unit, an imaging unit, a processing unit, and a first operation unit. The imaging optical unit condenses light from the observation site. The imaging optical unit may be constituted using one or more lenses. The imaging unit photoelectrically converts light received by each pixel to generate image data. The imaging unit may be composed of an image sensor such as CMOS (complementary metal oxide semiconductor) or CCD (charge coupled device). The processing unit converts the image data generated by the imaging unit into a digital signal and transmits the converted signal to the imaging system host. The first operation unit receives an instruction signal for switching the operation of the endoscope and an input of an instruction signal lamp for switching the illumination light of the light source device, and outputs the instruction signal to the imaging system host. The first operation unit includes, but is not limited to, a switch, a button, and a touch panel.

The light source device includes a lighting control unit and a lighting unit. The illumination control unit receives an instruction signal of the camera system host to control the illumination unit to provide illumination light to the endoscope.

The image capture system host processes image data received from the endoscope and transmits it to the display device and the storage device. The display device and the storage device may be external devices. The camera system host comprises an image input unit, an image processing unit, an intelligent processing unit, a video coding unit, a control unit and a second operation unit. The image input unit receives a signal transmitted from the endoscope and transmits the received signal to the image processing unit. The image processing unit performs ISP (Image Signal Processor, image signal processing) operations on the image of the image input unit, including but not limited to brightness conversion, sharpening, de-moire, scaling, and the like. The image processing unit transmits the image after ISP operation to the intelligent processing unit, the video encoding unit or the display device. The intelligent processing unit performs intelligent analysis on the image after the operation of the image processing unit ISP, including but not limited to scene classification based on deep learning, instrument head detection, gauze detection, moire classification and dense fog classification. And the image processed by the intelligent processing unit is transmitted to the image processing unit or the video coding unit. The image processing unit processes the image processed by the intelligent processing unit in a manner including but not limited to brightness conversion, moire removal, frame overlapping and scaling. The video coding unit carries out coding compression on the image processed by the image processing unit or the intelligent processing unit and transmits the image to the storage device. The control unit controls various portions of the endoscope system including, but not limited to, illumination mode of the light source, image processing mode, intelligent processing mode, and video encoding mode. The second operation means includes, but is not limited to, a switch, a button, and a touch panel, receives an external instruction signal, and outputs the received instruction signal to the control means.

The application relates to an improvement of an intelligent processing unit and a video coding unit, wherein the intelligent processing unit performs intelligent analysis on an image processed by the image processing unit, including but not limited to instrument head detection and gauze detection. And the image processed by the intelligent processing unit is transmitted to the image processing unit or the video coding unit. The image processing unit processes the image processed by the intelligent processing unit in a manner including but not limited to brightness conversion, moire removal, frame overlapping and scaling. The video coding unit carries out coding compression on the image processed by the image processing unit or the intelligent processing unit and transmits the image to the storage device.

In the operation process, detecting the gauze in the video frame, tracking the gauze by combining with the continuous frames, if the current frame is a target video frame (the current frame has the gauze appearing in the previous frame or the gauze disappearing in the previous frame in comparison with the current frame), selecting a plurality of frames before and after the current frame for storage or marking in a video stream, and marking the position of the gauze in the image. After the operation is finished (before suturing), the target video frame sequences are displayed on a screen, so that a doctor can trace back and check the gauze, the checking speed of the number of the gauze is improved, and the risk of leaving the gauze is reduced.

In one example, such as shown in FIG. 3, the improvement may be embodied in three parts: the image acquisition part acquires endoscope video, the image processing part processes the input endoscope video, the gauze is detected and tracked to obtain a gauze position, a mark or a stored target video frame sequence, and the image display part displays the extracted gauze target video frame sequence for doctors to use. The following is a detailed description.

In the operation process, the real-time video data acquired by the image acquisition part is firstly detected by using a gauze detection model, then the detected gauze is tracked, and if the detected gauze is the target video frame (the current frame has the gauze appearing or disappearing compared with the previous frame), a plurality of frames before and after the target video frame are selected for marking (for example, the frame number a to the frame number b are the gauze appearing or disappearing) or storing, for example, as shown in fig. 4.

The application adopts a deep learning method, namely, the convolutional neural network is utilized to learn the image characteristics. The gauze detection method based on deep learning is divided into two stages: training and testing. The training stage is used for acquiring a gauze detection model, and the testing stage is used for detecting the input image by using the gauze detection model. The input is training images, labels, loss functions and network structures during network training, and the output is a detection model; and during testing, forward reasoning is carried out on the test image by utilizing the detection model obtained through training, so as to obtain a detection result of the gauze. As shown in fig. 5.

In one example, gauze detection requires design of data calibration patterns, loss functions, and network structures, and a description of possible embodiments is provided below.

(1) Calibrating: the gauze target detection requires the definition of a gauze label, and one possible calibration method is the minimum external moment of the area where the gauze is located, as shown in fig. 6.

(2) Loss function: the usual detection loss function is largely divided into two parts, positioning loss for object positioning and classification loss for object classification. Regression is performed with the target frame (four points of the rectangular frame calibrated by gauze) and the network prediction frame.

(3) Network structure: the deep learning target detection network mainly comprises two parts, namely a feature extraction network and a detection head network. An example of a possible network structure is shown in fig. 7.

The application adopts a deep learning method, namely, the convolutional neural network is utilized to learn the image characteristics. The tracking method based on the object feature modeling may be adopted, for each detected object, firstly, the feature extraction network is utilized to extract the apparent feature (which may be understood as a feature code of the object), then, object association is performed, for example, the cosine distance between the image features of every two objects between adjacent video frames may be calculated, two objects with the shortest cosine distance and less than a preset distance threshold are considered to be the same object, as shown in fig. 8, a tracking result is finally obtained, where the preset distance threshold is an empirical value or an experimental value, in one example, multiple groups of negative sample objects may be selected in advance, two objects in the same group of negative sample objects are not the same object, the cosine distances of the two objects in each group of negative sample objects are calculated respectively, and the average value of the cosine distances of the negative sample objects in each group of negative sample objects is calculated as the preset distance threshold. Gauze tracking can occur in three cases: (1) In a plurality of object tracks in the k-1 frame, finding the object detected in the k frame, and indicating that the object is normally tracked; (2) In a plurality of object tracks in the k-1 frame, no object detected in the k frame is found, which indicates that the object is newly appeared in the k frame; (3) There is an object in the k-1 frame, but the k frame has no object associated with it, indicating that this object has disappeared in the k frame.

In the application, the target video frame is defined as that the current frame has gauze appearing in the previous frame or the previous frame has gauze disappearing in the current frame, as shown in fig. 9, and the k-1 frame and the k frame are the target video frames. After the target video frame is found, related information needs to be described and marked, and after marking, index information with marking is placed in a code stream to store and encode and transmit the related information. The labeling method is as shown in fig. 10, where the frame number indicates the frame number of the target video frame, the number of objects indicates the number of objects to be disappeared and newly appeared in the video frame, the position attribute indicates the position of the gauze (using the upper left corner coordinates and the wide-high record), and the state attribute indicates whether the gauze is about to disappear or newly appear (0 indicates about to disappear and 1 indicates newly appear). In one example, the sequence number of the target video frame may also be marked, where the sequence number indicates what target video frame the target video frame is. In the example shown in fig. 9, the kth-1 frame and the kth frame are the 1 st and the 2 nd target video frames, and then a part of the code stream data may be shown in fig. 11, where for the key frame 1, the k-1 is the frame number, 1 is the object number, x, y, w, h is the position attribute, x and y are the horizontal and vertical coordinates of the upper left corner point of the target frame of the object, w is the width of the target frame of the object, h is the height of the target frame of the object, and 0 is the state attribute, which indicates that the object is about to disappear. After the operation is finished (before suturing), the image display part displays the gauze target video frame sequence, and the gauze target video frame sequence is used as a gauze key position for a doctor to trace back and check, as shown in fig. 12.

In the embodiment of the application, the deep learning technology is used for detecting and tracking the gauze in the endoscopic surgery process, and extracting the target video frame sequence, and the target video frame sequence is displayed after the surgery is finished, so that the key position of the gauze of a doctor is provided as a reference for backtracking and investigation, the speed of checking the number of the gauze after the surgery is improved, and the risk of leaving the gauze is reduced. Provides the speed of checking the number of the gauze after operation and reduces the occurrence of medical accidents caused by leaving the gauze behind.

The embodiment of the application also provides an object detection method, referring to fig. 13, the method comprises the following steps:

s101, obtaining video data to be detected.

The object detection method of the embodiment of the application can be realized through electronic equipment, and in particular, the electronic equipment can be an endoscope, a hard disk video recorder or other equipment with image processing capability. In one example, the video data to be detected is video data collected by an endoscope.

S102, respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of any object comprises the position information of the object.

The deep learning target detection network is used for detecting objects in video frames, and the objects in the embodiment of the application include but are not limited to articles such as gauze, catheters and medical adhesive tapes, and the specific types of the objects can be set according to actual detection scenes.

The deep learning object detection network may be any object detection network based on a deep learning algorithm. In one embodiment, the object is gauze, and the deep learning target detection network is a gauze detection network; the deep learning target detection network based on pre-training performs target detection on each video frame in the video data to obtain attribute information of an object in each video frame, and the method comprises the following steps: respectively extracting the characteristics of each video frame in the video data by utilizing the characteristic extraction network of the gauze detection network to obtain the image characteristics of each video frame; and respectively analyzing the image characteristics of each video frame by using the detection head network of the gauze detection network to obtain the attribute information of the gauze in each video frame.

The deep learning target detection network can be a gauze detection network, and comprises a feature extraction network and a detection head network, wherein the feature extraction network is used for extracting image features of a video frame, and the detection head network is used for carrying out operations such as pooling, regression and the like on the image features so as to obtain position information of an object in the video frame. In one example, the deep learning object detection network may be as shown in fig. 7, where an RPN (Region Proposal Network, region generation network) is used to generate each object candidate box based on image features, and ROI (Region Of Interest ) Pooling is used to pool each object candidate box.

The training method of the deep learning target detection network may refer to a network training method in the related art, in an example, as shown in fig. 5, the process of pre-training the deep learning target detection network includes a training process and a testing process, taking a subject as a gauze as an example, obtaining a plurality of sample images, and calibrating the gauze position in each sample image, where for example, a schematic diagram of a possible sample image calibrated with the gauze position may be shown in fig. 6. Dividing the sample image into a training set and a testing set; training process: inputting the sample images in the training set into a deep learning target detection network to obtain predicted gauze position information, calculating loss according to the predicted gauze position information and the gauze positions marked by the sample images, adjusting parameters of the deep learning target detection network according to the loss, and transferring to a testing process after the training times reach preset times; the testing process comprises the following steps: and verifying the deep learning target detection network by using the sample images in the test set, obtaining a trained deep learning target detection network if the loss is converged, and returning to the training process if the loss is not converged.

And S103, tracking each object according to the attribute information of each object to obtain a tracking result of each object.

For any object, the attribute information of the object comprises the position information of the object; each object may be tracked based on a related target tracking method according to the position information of each object, so as to obtain a tracking result of each object, where in an example, the tracking result of each object may be a motion track of the object.

In a possible implementation manner, for any object, the attribute information of the object further comprises image characteristics of the object; the tracking of each object according to the attribute information of each object to obtain a tracking result of each object includes:

according to the image characteristics of each object, calculating the cosine distance of the image characteristics of every two objects between adjacent video frames.

The image features of the object can be obtained directly by using the deep learning target detection network, or can be extracted according to the position information of the object by using a feature extraction network different from the deep learning target detection network. The cosine distance of the image features of every two objects between every two adjacent video frames in the video data is calculated, for example, the K-th frame video frame comprises the object 1 and the object 2, and the k+1th frame video frame comprises the object a, the object b and the object c, so that the cosine distance of the image features of the object 1 and the object a, the cosine distance of the image features of the object 1 and the object b, the cosine distance of the image features of the object 1 and the object c, the cosine distance of the image features of the object 2 and the object a, the cosine distance of the image features of the object 2 and the object b, and the cosine distance of the image features of the object 2 and the object c need to be calculated. In one example, the cosine distance in the embodiment of the present application may be replaced by a euclidean distance or another parameter that represents the similarity of images.

And step two, associating the same objects among adjacent video frames according to the position information of the objects and the cosine distances to obtain tracking results of the objects.

And determining each object which is the same target in the adjacent video frames according to each cosine distance, and generating the track of the object which is the same target according to the position information of each object so as to obtain the tracking result of each object.

In one example, two objects with shortest cosine distance and less than a preset distance threshold are considered to be the same target, for example, the kth frame of video frame includes object 1 and object 2, and the k+1th frame of video frame includes object a, object b and object c; if the cosine distance between the image features of the object 1 and the object a is less than the cosine distance between the image features of the object 1 and the object b is less than the cosine distance between the image features of the object 1 and the object c, and the cosine distance between the image features of the object 1 and the object a is less than the preset distance threshold, determining that the object 1 and the object a are the same, and correlating the position information of the object 1 with the position information of the object a to obtain the motion trail of the object as the tracking result of the object. If the cosine distance between the image features of the object 2 and the object a is less than the cosine distance between the image features of the object 2 and the object b is less than the cosine distance between the image features of the object 2 and the object c, and the cosine distance between the image features of the object 2 and the object a is not less than the preset distance threshold, it is determined that the object 1 and the object a are not the same object, that is, the object 1 is the object to be disappeared.

In one example, the associating the same object between adjacent video frames according to the position information of each object and the cosine distance to obtain the tracking result of each object includes:

step 1, for each group of adjacent video frames, all possible object combinations between two video frames of the group of adjacent video frames can be constructed, and the sum of cosine distances of each object combination is determined, wherein the cosine distance of each unassociated object is a preset distance threshold.

In one example, a plurality of groups of negative sample objects may be selected in advance, two objects in the same group of negative sample objects are not the same object, cosine distances of the two objects in each group of negative sample objects are calculated respectively, and an average value of the cosine distances of the negative sample objects in each group is calculated as the preset distance threshold.

And 2, selecting the minimum object combination of the sum of cosine distances in the group of adjacent video frames as a tracking result of the group of adjacent video frames aiming at each group of adjacent video frames.

In one example, a group of adjacent video frames are a kth frame video frame and a k+1th frame video frame, where the kth frame video frame includes an object 1 and an object 2, and the k+1th frame video frame includes an object a, an object b, and an object c, all possible object combinations are: object combination A, object 1 associates object a, object 2 associates object b, object c does not have an associated object; object combination B, object 1 is associated with object a, object 2 is associated with object c, and object B is not associated with object; object combination C, object 1 associates object b, object 2 associates object a, object C does not have an associated object; object combination D, object 1 associated with object b, object 2 associated with object c, object a not associated with object a; object combination E, object 1 associates object c, object 2 associates object a, object b does not have an associated object; object combination F, object 1 associates object c, object 2 associates object b, object a does not associate objects; object combination G, object 1 associates object a, object 2 does not associate object, object b does not associate object, and object c does not associate object; object combination H, object 1 associates object b, object 2 does not associate object, object a does not associate object, object c does not associate object; object combination I, object 1 associates object c, object 2 does not associate object, object a does not associate object, object b does not associate object; object combination J, object 2 associated with object a, object 1 unassociated with object b unassociated with object, object c unassociated with object; object combination K, object 2 associated with object b, object 1 unassociated with object a unassociated with object c; object combination L, object 2 associated with object c, object 1 unassociated with object a unassociated with object b; object combination M, object 1, object 2, object a, object b, and object c.

And respectively calculating the sum of cosine distances of each object combination, wherein the cosine distances of the unassociated objects are a preset distance threshold, for example, object combination A < object combination B < object combination C < object combination D < object combination E < object combination F < object combination G < object combination H < object combination I < object combination J < object combination K < object combination L < object combination M, the object combination A is the tracking result of the Kth frame video frame and the Kth+1st frame video frame, namely the object 1 and the object a are the same object, the object 2 and the object B are the same object, and the object C is the newly appeared object in the Kth+1st frame video frame.

S104, according to the tracking result of each object, determining the video frame with the new object appearing compared with the previous frame as a first type video frame, and determining the video frame with the object appearing compared with the next frame and about to disappear as a second type video frame.

Tracking of objects may occur as follows:

(1) In a number of object trajectories in the k-1 frame, the object detected in the k frame is found, indicating that the object is normally tracked.

(2) In several object trajectories in the k-1 frame, no object detected in the k frame is found, indicating that the object is new in the k frame.

(3) There is an object in the k-1 frame, but the k frame has no object associated with it, indicating that this object has disappeared in the k frame.

The first type of video frames are video frames with new objects appearing in the video frames of the previous frame; the second type of video frame is a video frame in which the object which has appeared in the later frame of video frame is about to disappear. For example, as shown in fig. 9, the kth frame is a first type video frame, and the kth-1 frame is a second type video frame. In one example, when a new object appears in a video frame and an object that has already appeared disappears, the video frames are both first-type video frames and second-type video frames.

In the embodiment of the application, besides detecting the position of the object, the video frame with the new object appearing is used as a first type video frame, and the video frame with the object which appears and is about to disappear is used as a second type video frame; the first type video frame and the second type video frame can clearly know the time when a new object appears and the time when an existing object disappears, so that the condition that objects such as gauze are left in a body cavity of a patient can be reduced.

In one possible embodiment, referring to fig. 14, the method further includes:

s105, generating index information at least comprising a first state attribute and a frame number of the first type video frame for any determined first type video frame, wherein the first state attribute indicates that a new object appears.

For each video frame in the first type of video frames, index information of the video frame is generated, the index information of the video frame carries a label, the label comprises a first state attribute and a frame number of the video frame, wherein the first state attribute indicates that a new object appears, and the frame number of the video frame indicates which frame of video frames in video data is.

In one possible implementation manner, for each video frame in the first type of video frame, the index information of the video frame further includes the number of objects of the video frame and the position information of the objects in the video frame, where, for each video frame in the first type of video frame, the number of objects of the video frame indicates the number of objects that will disappear and newly appear in the video frame. In one example, for each video frame in the first type of video frame, the index information of the video frame further includes a sequence number of the video frame. The position information of the object may be coordinate information of a target frame of the object, and in one example, the position information is represented by coordinates of an upper left corner point of the target frame of the object and widths and heights of the target frame. The sequence number of the video frame may indicate which video frame is the first type of video frame in the video data, or may indicate which frame in the video data is the marked video frame, where the marked video frame includes the first type of video frame and the second type of video frame. The sequence number of the video frame in the index information of the video frame can conveniently distinguish the video frame according to the time sequence, and the position information of the object in the video frame in the index information of the video frame can conveniently position the object in the video frame.

And S106, generating index information at least comprising a second state attribute and a frame number of the second type video frame for any determined second type video frame, wherein the second state attribute indicates that the object which has appeared disappears.

For each video frame in the second type of video frames, index information of the video frame is generated, the index information of the video frame carries a label, the label comprises a second state attribute and a frame number of the video frame, wherein the second state attribute indicates that an object which has appeared disappears, and the frame number of the video frame indicates which frame of video frames in video data is.

In one possible implementation manner, for each video frame in the second type of video frame, the index information of the video frame at least further includes the number of objects of the video frame and the position information of the objects in the video frame, where, for each video frame in the second type of video frame, the number of objects of the video frame indicates the number of objects that will disappear and newly appear in the video frame. In one example, for each video frame in the second type of video frame, the index information of the video frame further includes a sequence number of the video frame. The position information of the object may be, for example, coordinate information of a target frame of the object, and may be represented by coordinates of an upper left corner point of the target frame of the object and widths and heights of the target frame. The sequence number of the video frame may indicate that the video frame is a second type of video frame in the video data, or may indicate that the video frame is a marked video frame in the video data, where the marked video frame includes a first type of video frame and a second type of video frame. The sequence number of the video frame in the index information of the video frame can conveniently distinguish the video frame according to the time sequence, and the position information of the object in the video frame in the index information of the video frame can conveniently position the object in the video frame.

And S107, packaging each index information and the video data into code stream data.

And packaging the index information into a code stream after video data encoding to obtain code stream data. In an example, as shown in fig. 10 and fig. 15, the index information may be encapsulated after the header of the code stream data, so that the index information can be obtained quickly after the decapsulation, where the data header is used for identifying the index information, and may include information such as a data length of the index information, and the specific case may be set in a customized manner according to the actual case. In one example, as shown in fig. 10, the state attribute indicates whether a new object appears or an object that has appeared disappears, for example, a second state attribute (an object that has appeared disappears) is indicated by 0, a first state attribute (a new object appears) is indicated by 1, and so on.

In one example, there may be a video frame in which both new objects appear and the new objects disappear, when the number of new objects appearing and new objects to disappear in one video frame is greater than 1, the position information and the state attribute of each new object to appear and new object to disappear may be sequentially arranged in the index information of the video frame, and the arrangement order may be set in a self-defined manner, as shown in fig. 15, the key frame 1 includes a new object to disappear and a new object to disappear, where k-1 represents the frame number of the key frame 1, 2 represents the number of new objects to disappear and new objects to disappear in the key frame 1, x, y, w, h before 0 is the position attribute of the new object to disappear, and 0 is the state attribute to represent the new object to disappear; x, y, w, h before 1 is the position attribute of the newly appearing object, and 1 is the state attribute, which indicates that the object is newly appearing.

In the embodiment of the application, the index information and the video data are packaged into the code stream data, so that the subsequent watching and the object tracing can be facilitated.

In one possible embodiment, referring to fig. 16, the method further includes:

s108, unpacking the code stream data to obtain each index information.

And decapsulating the code stream data to obtain index information and video data.

And S109, playing back each video frame in the first type video frame and the second type video frame according to each index information.

In one embodiment, the playing back each of the first type video frame and the second type video frame according to each of the index information includes:

step one, based on the frame numbers of the video frames in the index information, obtaining the first-type video frames and/or the second-type video frames represented by the frame numbers of the video frames in the index information, and obtaining the target video frames.

For example, if the frame numbers of the video frames in each fuse are 5, 99, 255, 1245, 3455, the 5 th, 99 th, 255 th, 1245 th, 3455 th video is obtained from the video data as each target video frame.

And step two, for each target video frame, performing associated playback on the target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame.

The first state attribute and/or the second state attribute corresponding to the target video frame are the first state attribute and/or the second state attribute in the index information of the target video frame. If the index information of the target video frame carries the first state attribute and does not carry the second state attribute, the target video frame and the first state attribute are displayed in a correlated manner; if the index information of the target video frame carries the second state attribute and does not carry the first state attribute, the target video frame and the second state attribute are displayed in a correlated manner; and if the index information of the target video frame carries the first state attribute and the index information of the target video frame carries the first state attribute, the target video frame, the first state attribute and the second state attribute are displayed in a correlated mode. In addition, the information such as the sequence number corresponding to the target video frame and the target frame of the object can be displayed. In one example, a schematic diagram of the association presentation may be as shown in fig. 12.

In one embodiment, the user may play back a video segment near the target video frame, and the method further includes:

and thirdly, after the detailed display information of the user aiming at the appointed target video frame is obtained, each video frame with the difference value of the frame number of the appointed target video frame within the range of the difference value of the first preset frame number is obtained according to the frame number of the appointed target video frame, and a target video frame segment corresponding to the appointed target video frame is obtained.

The first preset frame number difference range may be set in a customized manner according to practical situations, for example, may be set to-50 to 50, -50 to 100, -100 to 50, -100 to 100, or-400 to 500, etc. Taking the first preset frame number difference range as-50 to 100 as an example, selecting 50 frames before the frame number of the appointed target video frame, the appointed target video frame and 100 frames after the frame number of the appointed target video frame as target video frame segments corresponding to the appointed target video frame.

And step four, carrying out associated playback on the target video frame segment and the state attribute corresponding to the target video frame segment, wherein the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the appointed target video frame, or the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the first type video frame included in the target video frame segment.

The state attribute corresponding to the target video frame segment is a first state attribute and/or a second state attribute of a specified target video frame used for determining the target video frame segment, and in one example, the first state attribute and/or the second state attribute corresponding to the target video frame segment may further include the first state attribute and/or the second state attribute corresponding to other target video frames in the target video frame segment. Playing the target video segment, and displaying the first state attribute and/or the second state attribute corresponding to the target video frame segment in an associated manner, wherein in one example, the first state attribute and/or the second state attribute corresponding to the target video frame segment can be displayed in the whole process of playing the target video frame segment; in one example, the first state attribute and/or the second state attribute corresponding to the target video frame may be played when the target video frame or a video frame having a frame number within a third preset frame number difference range is played.

and step A, aiming at each index information, acquiring each video frame with the difference value of the frame number of the video frame in the index information within a second preset frame number difference range according to the frame number of the video frame in the index information, and obtaining a target video frame set corresponding to the index information.

The second preset frame number difference range may be set in a customized manner according to practical situations, for example, may be set to-5 to 5, -5 to 10, -10 to 10, -20 to 20, or-50 to 100, etc. Taking a second preset frame number difference range of-5 to 10 as an example, selecting 5 frames before the frame number of the video frame in the index information, the video frame represented by the frame number of the video frame in the index information and 10 frames after the frame number of the video frame in the index information as a target video frame set corresponding to the index information.

And B, carrying out associated playback on each target video frame set and state attributes corresponding to the target video frame set, wherein the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes in index information used for determining the target video frame set or the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes of the first type video frames included in the target video frame set for each target video frame set.

In the embodiment of the application, each video frame in the first type video frame and the second type video frame is displayed, so that medical staff can conveniently and quickly review the increase and decrease of the gauze and other objects in the operation process before suturing, and the condition that the gauze and other objects remain in the body cavity of a patient can be effectively reduced.

The embodiment of the application also provides an object detection device, which comprises: the video data acquisition module is used for acquiring video data to be detected; the attribute information determining module is used for respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of any object comprises the position information of the object; the tracking result determining module is used for respectively tracking each object according to the attribute information of each object to obtain the tracking result of each object; and the video frame labeling module is used for determining a video frame with a new object appearing compared with the previous frame as a first type of video frame and determining a video frame with an object appearing and about to disappear compared with the next frame as a second type of video frame according to the tracking result of each object.

The embodiment of the application also provides electronic equipment, which comprises: a processor and a memory; the memory is used for storing a computer program; the processor is used for executing the computer program stored in the memory to realize any object detection method.

Optionally, in addition to the memory and the processor, the electronic device according to the embodiment of the present application further includes a communication interface and a communication bus, where the processor and the communication interface, and the memory complete communication with each other through the communication bus.

The communication bus mentioned for the above-mentioned electronic devices may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include RAM (Random Access Memory ) or NVM (Non-Volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes any object detection method when being executed by a processor.

In yet another embodiment of the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the object detection methods of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It should be noted that, in this document, the technical features in each alternative may be combined to form a solution, so long as they are not contradictory, and all such solutions are within the scope of the disclosure of the present application. Relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for embodiments of the apparatus, system, electronic device, computer program product, and storage medium, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the description of the method embodiments as relevant.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. An object detection method, the method comprising:

acquiring video data to be detected;

respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of any object comprises the position information of the object;

tracking each object according to the attribute information of each object to obtain a tracking result of each object;

according to the tracking result of each object, determining a video frame with a new object appearing compared with the previous frame as a first type video frame, and determining a video frame with an object appearing compared with the next frame to disappear as a second type video frame;

generating index information at least comprising a first state attribute and a frame number of the first type video frame aiming at any determined first type video frame, wherein the first state attribute indicates that a new object appears;

Generating index information at least comprising a second state attribute and a frame number of the second type video frame aiming at any determined second type video frame, wherein the second state attribute indicates that the object which has appeared disappears;

encapsulating each index information and the video data into code stream data;

decapsulating the code stream data to obtain each piece of index information;

according to each index information, playing back each video frame in the first type video frame and the second type video frame;

and playing back each video frame in the first type video frame and the second type video frame according to each index information, wherein the method comprises the following steps: for each index information, according to the frame numbers of the video frames in the index information, obtaining each video frame with a difference value of the frame numbers of the video frames in the index information within a second preset frame number difference range, and obtaining a target video frame set corresponding to the index information; and carrying out associated playback on each target video frame set and state attributes corresponding to the target video frame set, wherein the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes in index information used for determining the target video frame set or the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes of first type video frames included in the target video frame set for each target video frame set.

2. The method of claim 1, wherein the object is gauze and the deep learning target detection network is a gauze detection network;

the pre-training-based deep learning target detection network respectively carries out target detection on each video frame in the video data to obtain attribute information of an object in each video frame, and the method comprises the following steps:

respectively extracting the characteristics of each video frame in the video data by utilizing a characteristic extraction network of the gauze detection network to obtain the image characteristics of each video frame;

and analyzing the image characteristics of each video frame by using a detection head network of the gauze detection network to obtain the attribute information of the gauze in each video frame.

3. The method of claim 1, wherein for each of the first type of video frame and the second type of video frame, the index information of the video frame further comprises at least a number of objects of the video frame and location information of objects in the video frame, wherein for each of the first type of video frame and the second type of video frame, the number of objects of the video frame indicates a number of objects in the video frame that are to disappear and to appear newly.

4. The method of claim 1, wherein playing back each of the first type of video frame and the second type of video frame according to each of the index information, comprises:

acquiring each first type video frame and/or each second type video frame represented by the frame number of the video frame in each index information based on the frame number of the video frame in each index information to obtain each target video frame;

and for each target video frame, performing associated playback on the target video frame and the first state attribute and/or the second state attribute corresponding to the target video frame.

5. The method according to claim 4, wherein the method further comprises:

after a detailed display message of a user aiming at a specified target video frame is acquired, acquiring each video frame with a difference value of the frame number of the specified target video frame within a first preset frame number difference range according to the frame number of the specified target video frame, and acquiring a target video frame segment corresponding to the specified target video frame;

and performing associated playback on the target video frame segment and the state attribute corresponding to the target video frame segment, wherein the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the specified target video frame, or the state attribute corresponding to the target video frame segment is the first state attribute and/or the second state attribute of the first type video frame included in the target video frame segment.

6. The method of claim 1, wherein for any object, the attribute information of the object further comprises image characteristics of the object; the tracking of each object according to the attribute information of each object to obtain a tracking result of each object includes:

according to the image characteristics of each object, calculating the cosine distance of the image characteristics of each two objects between adjacent video frames;

and associating the same objects among the adjacent video frames according to the position information of the objects and the cosine distances to obtain tracking results of the objects.

7. An endoscope system, the endoscope system comprising:

an endoscope, a light source device, and an imaging system host;

the endoscope is used for collecting image data of a subject;

the light source device is used for providing a shooting light source for the endoscope;

the camera system host is configured to implement the object detection method according to any one of the preceding claims 1-6 at run-time.

8. The system of claim 7, wherein the endoscope system further comprises: a display device and a storage device;

the camera system host is also used for sending the image data acquired by the endoscope to the display equipment and storing the processed image data into the storage equipment;

The display device is used for displaying the image data and playing back each video frame in the first type video frame and the second type video frame;

the storage device is used for storing the processed image data.

9. An object detection apparatus, the apparatus comprising:

the video data acquisition module is used for acquiring video data to be detected;

the attribute information determining module is used for respectively carrying out target detection on each video frame in the video data based on a pre-trained deep learning target detection network to obtain attribute information of an object in each video frame, wherein the attribute information of any object comprises the position information of the object;

the tracking result determining module is used for respectively tracking each object according to the attribute information of each object to obtain the tracking result of each object;

the video frame labeling module is used for determining a video frame with a new object appearing compared with the previous frame as a first type video frame and determining a video frame with an object appearing and about to disappear compared with the next frame as a second type video frame according to the tracking result of each object;

the index information generation module is used for generating index information at least comprising a first state attribute and a frame number of the first type video frame for any determined first type video frame, wherein the first state attribute indicates that a new object appears; generating index information at least comprising a second state attribute and a frame number of the second type video frame aiming at any determined second type video frame, wherein the second state attribute indicates that the object which has appeared disappears; encapsulating each index information and the video data into code stream data;

A data unpacking module, configured to unpack the code stream data to obtain each index information;

the video frame display module is used for playing back each video frame in the first type of video frames and the second type of video frames according to each index information;

the video frame display module is specifically configured to: for each index information, according to the frame numbers of the video frames in the index information, obtaining each video frame with a difference value of the frame numbers of the video frames in the index information within a second preset frame number difference range, and obtaining a target video frame set corresponding to the index information; and carrying out associated playback on each target video frame set and state attributes corresponding to the target video frame set, wherein the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes in index information used for determining the target video frame set or the state attributes corresponding to the target video frame set are first state attributes and/or second state attributes of first type video frames included in the target video frame set for each target video frame set.

10. An electronic device, comprising a processor and a memory;

The memory is used for storing a computer program;

the processor is configured to implement the object detection method according to any one of claims 1 to 6 when executing the program stored in the memory.

11. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the object detection method of any of claims 1-6.