CN115393670A

CN115393670A - Method for training lung endoscope image recognition model and recognition method

Info

Publication number: CN115393670A
Application number: CN202211003172.5A
Authority: CN
Inventors: 方传煜; 严建祺; 刘淳奇
Original assignee: Quanbao Network Technology Co ltd
Current assignee: Quanbao Network Technology Co ltd
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2022-11-25
Anticipated expiration: 2042-08-19
Also published as: CN115393670B

Abstract

The invention discloses a method for training a lung endoscope image recognition model and a recognition method, wherein the method comprises the steps of constructing a data set and a neural network model, and carrying out lung endoscope image recognition on the constructed recognition model, wherein the neural network model is a SeqYOLO constructed by combining a YOLOv5 model and an LSTM (least Square TM) to recognize an endoscope video, namely, a model special for recognizing the lung endoscope is trained by an artificial intelligent deep learning algorithm, and when a doctor uses an endoscope carrying the technology, the position information of the endoscope can be intuitively known through a terminal screen, so that the time for artificial judgment is shortened, the efficiency of an operation is improved, the operation time is shortened, and the pain of a patient is reduced.

Description

Method for training lung endoscope image recognition model and recognition method

Technical Field

The invention relates to the technical field of neural networks, in particular to a method for training a lung endoscope image recognition model and a recognition method.

Background

Pulmonary endoscopy is the process of placing an elongated endoscope into the lower respiratory tract of a patient orally or nasally, i.e., through the glottis into the trachea and bronchi and beyond, directly observing lesions in the trachea and bronchi, and performing corresponding examination and treatment according to the lesions.

The large tree with the similar inverted structure of the lung trachea and the bronchus has a plurality of branches, and a surgeon who is trained professionally immediately gets lost easily in the examination process, so that the operation time is prolonged, discomfort of a patient in the operation process is prolonged, and only a doctor with rich experience can feel comfortable in the operation.

At the present stage, the position of the endoscope is judged only by manually observing the image of the endoscope, and an advancing pipeline is searched, so that the current position of the endoscope cannot be accurately judged easily due to insufficient experience.

Disclosure of Invention

The invention aims to provide a method for training a lung endoscope image recognition model and a recognition method for automatically recognizing and labeling the current lung trachea position of an endoscope probe through an endoscope image.

In order to realize the purpose, the technical scheme of the invention is as follows:

a method for training a lung endoscope image recognition model comprises the steps of constructing a data set and a neural network model, training the neural network model through 80% of sample data of the constructed data set, deploying after testing through the remaining 20% of the sample data of the data set after training is completed, and using mAP0.5 as an evaluation index for measuring accuracy on an endoscope image target detection task.

The construction of the data set is: the method comprises the steps that a plurality of clear bronchoscope images are intercepted from a plurality of bronchoscope videos and are divided into a plurality of categories, then the images are marked in a mode of drawing one or more bounding boxes, and finally the images are constructed in a data enhancement mode;

the data enhancement of the data set comprises three different levels of enhancement intensity;

the neural network model is SeqYOLO constructed by combining a YOLOv5 model and LSTM;

the SeqYOLO code establishes a reference by YOLOv5, the YOLOv5 part provides target detection capability, and a new LSTM module is introduced to learn information with time series relation in a video;

SeqYOLO obtains image characteristic values of each frame in a video sequence through a YOLO backbone respectively, the characteristic values are calculated through an LSTM model to learn information with time sequence relation in the video, the LSTM outputs an image characteristic value corresponding to an image to be inferred after reading the image characteristic values of the whole sequence, and a head component in a YOLO series algorithm is used for inferring and calculating specific image detection frames and object classes in the frames.

According to the content, each picture corresponds to one frame in the bronchoscope video, the frame comprises other frames in front and at back of a video time axis, the other frames in front and at back comprise information useful for prediction of the current frame, so that a video clip for model training is obtained, the frame in the middle of the video clip is labeled, the other frames in the video are context information for assisting the frame in prediction, and common data enhancement is assisted to expand a data set by picture transformation operations such as left-right turning, up-down turning, zooming, translation and rotation, so that the problems that an endoscope is not always located in the middle of a trachea and the angle and the direction of a shot image are different are solved, meanwhile, the strength of data enhancement is divided into three different levels, the data enhancement in the highest level has more enhancement items, the original image is changed more heavily, the data enhancement in the three different levels is subjected to pertinence optimization on the data set, and the result shows that the data enhancement in the lower level can obtain the highest mAp0.5, namely, the best detection effect is achieved; therefore, the integrity of the data set is ensured, the neural network model for target detection is adapted to different human body environments, mAP0.5 is used as an evaluation index for measuring accuracy on an endoscope image target detection task, and the performance of the model on the task is objectively reflected.

For the detection target data is video data, an algorithm SeqYOLO capable of detecting video targets is constructed, continuous video frames can be input, context information in the video is learned to improve the detection capability of the specified frames, a target detection model used in the system is constructed by a lightweight YOLOv5n model, and the target detection model can have a higher mAP0.5 value than other large models with high delay, namely, the detection accuracy is improved while the reasoning time is shortened.

Preferably, the data set classification comprises 18 categories, respectively: epiglottis, vocal cords, main bronchus, left main bronchus, carina, right main bronchus, left superior lobar bronchus, left inferior lobar bronchus, right superior lobar bronchus, right middle bronchus, right inferior lobar bronchus, left intrinsic superior lobar bronchus, left lingual lobar bronchus, left inferior lobar dorsal rampart bronchus, left inferior lobar basal part bronchus, right middle lobar bronchus, right inferior lobar basal part bronchus, and right inferior lobar dorsal part bronchus.

Therefore, the data sets are classified, the number of samples of each category is kept in a balanced state conveniently through data enhancement and frame interception of input videos, and therefore the problem of unbalanced data samples does not exist, and the trained model is identified more accurately.

Preferably, the structure of the YOLOv5 model is mainly divided into two parts, namely YOLOv5 backsbone for extracting information and an independent YOLOv5 module for regressing a detection frame, which is called YOLOv5 Head.

It can be seen that the functions of the YOLOv5 backhaul and YOLOv5 modules are clearly distinguished, and the modules are used as independent components of the SeqYOLO model by performing corresponding adjustment, because the modules maintain the functions in the YOLOv5, and the model weights pre-trained in the YOLOv5 can be used as the initial weights of the modules, so as to accelerate the later training of the video data.

Preferably, after reading in the continuous video frame information, seqYOLO gives the prediction result of the last frame image, that is, firstly, using YOLO backbone to perform feature extraction on the image, inputting the obtained feature value into an LSTM model to perform time series learning, learning the feature information of each frame in the LSTM, and retaining information useful for predicting the next frame, after reading in the last frame, the LSTM outputs the feature value of the whole video segment finally integrating all the previous frame information and the last frame information, and the feature value uses a YOLO head structure to identify the detection frame and the object type.

Preferably, the training of the SeqYOLO model is divided into two phases: firstly, training a single labeled picture by using a YOLOv5 model to enable mAP0.5 to reach a relatively accurate value, and obtaining the weights of a backbone component and a head component of the model at the stage for initializing the model weight at the next stage; and in the second stage, the built SeqYOLO model transfers the weights of two parts, namely the YOLO backbone and the YOLO head, obtained in the previous stage to the corresponding part in the SeqYOLO, the training in the stage uses video data for training, namely the training is a continuous video frame input model, 20 frames of images obtained by a method of taking frames at intervals are input into the SeqYOLO for learning, the last frame of the 20 frames of images is a target image to be predicted, and therefore only the last frame of the 20 frames of images has annotation information.

Therefore, the SeqYOLO model is constructed in the second stage, the weights of the two parts, namely the YOLO backbone and the YOLO head, obtained in the previous stage are transferred to the corresponding part in the SeqYOLO model, and the training efficiency of the model can be remarkably improved.

Preferably, when the SeqYOLO model continues to be trained, the last frame with labels in the video is learned by using the YOLOv5 model in the first stage, and random data enhancement is performed when data is loaded;

in the forward propagation of the model, the data-enhanced pictures are firstly subjected to the feature extraction of a YOLOv5 backlight, and then enter a YOLOv5 Head to predict the specific detection frame position and object type;

in back propagation of the model, the PyTorch deep learning framework used can perform automatic derivation calculation to update the model weight through a random batch gradient descent method, an SGD optimizer is adopted during training, the learning rate is set to be 0.01, the momentum is set to be 0.937, an OneCyclR learning rate planning strategy is used in the training process, the learning rate is gradually increased from zero to the set learning rate of 0.01 at the beginning, and then the learning rate is slowly decreased in the subsequent convergence process.

Therefore, random data enhancement is carried out when data are loaded through a data enhancement mode, compared with the traditional data enhancement mode, the data enhancement mode has the advantages of saving storage space, generating a large number of different samples for data enhancement and avoiding repetition, an OneCyclin learning rate planning strategy is used in the training process, the learning rate planning strategy can enable the learning rate to gradually increase from zero to the set learning rate of 0.01 at the beginning and then slowly decrease in the subsequent convergence process, and the learning rate planning strategy can avoid the problem that the model cannot converge due to the fact that the gradient disappears when the model just starts to be trained because of the overhigh learning rate.

Preferably, in the SeqYOLO training stage, the model weight trained in the previous stage is migrated to the YOLOv5 backlight and YOLOv5 Head components in SeqYOLO, and after the previous stage of training, these two components have respective capabilities of extracting endoscopic image features and performing target prediction and target classification according to the extracted image features, the input data of SeqYOLO each time is a video sequence with a length of 20 frames, where the last video frame is an endoscopic image to be predicted, the output of SeqYOLO model is a target prediction result for the target frame, that is, the last frame in the 20-frame video input sequence, and after the model obtains an image sequence with a length of 20 frames, simultaneously, all images are subjected to feature extraction through the same YOLOv5 backlight, 20 groups of different feature values are finally obtained, the 20 groups of different feature values sequentially correspond to the original 20 frames of images, the 20 groups of different feature values enter an LSTM module to be subjected to sequence information learning, at this stage, image information before the endoscope moves to the current position is concerned, after the LSTM receives 20 groups of different feature values, a coded feature value is output, the format of the feature value is the same as that of the output of the YOLO backlight, then the YOLO Head is directly used for target position and category calculation, and the LSTM-coded feature value is input into the YOLO Head for calculation to obtain a final target position information and category information prediction result;

the optimization mode of the SeqYOLO model is also that the model weight is updated by back propagation after forward propagation, the optimizer adopts an Adam method and sets the learning rate to be 0.001, and the learning rate planning strategy uses the OneClR.

Preferably, seqYOLO is added with an attention module and includes three attention modules, respectively CBAM-based attention module, visual Transformer model ViT, SWIN Transformer.

Therefore, by improving the structure of the model and adding the attention module, the model has stronger learning capacity, the attention module can enable the model to actively focus on important context information in the picture, the context information has an important effect on the accuracy of the target detection model, and meanwhile, by means of the attention module, the image attention thermodynamic diagram can be generated to prompt a doctor to focus on details which possibly need to be noticed in the image.

A pulmonary endoscopy image identification method, the method comprising: acquiring a lung endoscope image to be identified; the pulmonary endoscope image recognition model recognizes the pulmonary endoscope image to be recognized, wherein the pulmonary endoscope image recognition model is trained based on the method for training the image recognition model according to any one of claims 1 to 8.

Preferably, the method is further provided with reasoning calculation, and the reasoning calculation is used for compiling rules according to the human lung structure diagram.

The method is used for assisting a doctor in endoscopic operation, so that the real-time calculation requirement is high, a lightweight model is used for achieving millisecond-level reasoning delay, a target detection model used in the system is constructed on the basis of the lightweight YOLOv5n model through adaptation and reasoning deployment optimization on a data set, and the model can have a higher mAP0.5 value compared with other large models with high delay through targeted optimization on the task, namely, the detection accuracy is improved while the reasoning time is shortened.

According to the invention, the model special for identifying the lung endoscope is trained by the artificial intelligence deep learning algorithm, and when a doctor uses the endoscope carrying the technology, the position information of the endoscope can be intuitively known through the terminal screen, so that the time for artificial judgment is reduced, the efficiency of the operation is improved, the operation time is shortened, and the pain of a patient is reduced.

Drawings

Fig. 1 is an overall working principle diagram of the present invention.

FIG. 2 is a data set presentation diagram of the present invention.

FIG. 3 is a diagram of the real-time reasoning effect of the present invention.

FIG. 4 is a schematic representation of data enhancement for three different intensity levels according to the present invention.

Fig. 5 is a diagram illustrating a SeqYOLO network structure according to the present invention.

Fig. 6 is a diagram of an inference assistant model structure of the present invention.

Detailed Description

The present invention is described in detail below with reference to the attached drawings.

Example 1

As shown in fig. 1-6, a method for training a lung endoscope image recognition model includes the steps of constructing a data set and a neural network model, training the neural network model through 80% of sample data of the constructed data set, deploying after testing through the remaining 20% of the sample data of the data set after training is completed, and using mAp0.5 as an evaluation index for measuring accuracy on an endoscope image target detection task.

in this embodiment, the method is further provided with reasoning calculation, and the reasoning calculation is used for compiling rules according to the human lung structure diagram.

The specific training process comprises the following steps:

1. the method comprises the steps of extracting 2451 clear bronchoscope images intercepted from a bronchoscope video from a pediatric institution database, wherein each image comprises key position characteristics in a bronchus which can be judged by a doctor, and marking positions on the images manually to determine the accuracy, usability and integrity of manual marking. Each image was annotated simultaneously by two medical experts and a data analyst. The medical professional labels one or more exact site information and locations based on clinical experience. And then using a labeling tool LabelImg to draw one or more bounding boxes on the bronchoscope image and record the correct category. And then the coordinates of each bounding box are recorded.

2. Arranging data sets of all the parts, wherein the data sets comprise: epiglottis, vocal cords, main bronchus, main left bronchus, carina, main right bronchus, upper left-lobe bronchus, lower left-lobe bronchus, upper right-lobe bronchus, middle right-lobe bronchus, lower right-lobe bronchus, intrinsic left-lobe bronchus, left lingual-lobe bronchus, lower left-lobe dorsal-segment bronchus, lower left-lobe basal-segment bronchus, middle right-lobe bronchus, lower right-lobe basal-segment bronchus, and lower right-lobe dorsal-segment bronchus, example image data of 18 categories, and in order to ensure sample data balance, sample data amount of each category is controlled. Meanwhile, the specific frame position of the image in the video is determined, information of fixed time before and after the frame is intercepted to serve as context information of the image, the number of samples of each category is kept about 300, and the problem of data sample imbalance is avoided.

3. Data enhancement, in an application scene, after an endoscope enters a human body, the angle and the direction of an image imaged on a screen are not the same, so that the diversity of a sample is increased by using a data enhancement method of turning up and down, turning left and right and rotating; meanwhile, when the endoscope enters a human body, the endoscope cannot be guaranteed to be positioned in the middle of a trachea certainly, and the diversity of data samples is guaranteed by using a data enhancement method for translating input data. Meanwhile, a dynamic Gaussian blur data enhancement method is used for simulating motion blur caused by movement of the endoscope in the lung of a human body.

4. And (3) model training, namely performing model training by using 80% of picture sample data in the collected samples and matching with the data enhancement configuration combination of the most suitable sample data set, wherein the specific training judgment logic is described as follows.

SeqYOLO obtains image characteristic values of each frame in a video sequence through a YOLO backbone respectively, the characteristic values are calculated by using an LSTM model to learn information with time sequence relation in the video, the LSTM outputs an image characteristic value corresponding to an image for reasoning after reading the image characteristic values of the whole sequence, and a head component in a YOLO series algorithm uses the image characteristic values to carry out reasoning and calculate specific image detection frames and object types in the frames.

In this embodiment, the data set classification includes 18 categories, which are: epiglottis, vocal cords, main trachea, left main bronchus, carina, right main bronchus, left superior lobal bronchus, left inferior lobal bronchus, right superior lobal bronchus, right intermediate bronchus, right inferior lobal bronchus, left proper superior lobal bronchus, left lingual bronchus, left inferior lobal dorsal rambronchus, left inferior lobal basal segmental bronchus, right intermediate lobal bronchus, right inferior lobal basal segmental bronchus, and right inferior lobal dorsal segmental bronchus.

In the embodiment, the structure of the YOLOv5 model is mainly divided into two parts, i.e., YOLOv5 backsbone for extracting information and an independent YOLOv5 module for regression detection frame, which is called YOLOv5 Head.

In this embodiment, after reading in continuous video frame information, seqYOLO gives a prediction result of the last frame image, that is, firstly, the image is subjected to feature extraction by using the YOLO backbone, and the obtained feature value is input into an LSTM model for time series learning, the LSTM learns the feature information of each frame and retains information useful for prediction of the next frame, after reading in the last frame, the LSTM outputs an overall video segment feature value finally integrating all previous frame information and the last frame information, and the feature value identifies the detection frame and the object type by using the YOLO head structure.

In this embodiment, the training of the SeqYOLO model is divided into two phases: firstly, training a single labeled picture by using a YOLOv5 model to enable mAP0.5 to reach a relatively accurate value, and obtaining the weights of a backbone component and a head component of the model at the stage for initializing the model weight at the next stage; and in the second stage, the built SeqYOLO model transfers the weights of two parts, namely the YOLO backbone and the YOLO head, obtained in the previous stage to the corresponding part in the SeqYOLO, the training in the stage uses video data for training, namely the training is a continuous video frame input model, 20 frames of images obtained by a method of taking frames at intervals are input into the SeqYOLO for learning, the last frame of the 20 frames of images is a target image to be predicted, and therefore only the last frame of the 20 frames of images has annotation information.

In this embodiment, when the SeqYOLO model continues to be trained, the YOLOv5 model is used to learn the last frame with labels in the video in the first stage, and random data enhancement is performed when data is loaded;

in the forward propagation of the model, the data-enhanced pictures are subjected to feature extraction of YOLOv5 background, and then enter YOLOv5 Head to predict specific detection frame positions and object types;

In this embodiment, in the SeqYOLO training stage, the model weight trained in the previous stage is first transferred to the YOLOv5 backlight and YOLOv5 Head components in SeqYOLO, and after the previous stage of training, these two components have respective capabilities of extracting endoscopic image features and performing target prediction and target classification according to the extracted image features, the input data of SeqYOLO each time is a video sequence with a length of 20 frames, where the last video frame is an endoscopic image to be predicted, the output of the SeqYOLO model is a target prediction result for the target frame, that is, the last frame in the 20-frame video input sequence, and after the model obtains an image sequence with a length of 20 frames, simultaneously, all images are subjected to feature extraction through the same YOLOv5 backlight, 20 groups of different feature values are finally obtained, the 20 groups of different feature values sequentially correspond to the original 20 frames of images, the 20 groups of different feature values enter an LSTM module to be subjected to sequence information learning, at this stage, image information before the endoscope moves to the current position is concerned, after the LSTM receives 20 groups of different feature values, a coded feature value is output, the format of the feature value is the same as that of the output of the YOLO backlight, then the YOLO Head is directly used for target position and category calculation, and the LSTM-coded feature value is input into the YOLO Head for calculation to obtain a final target position information and category information prediction result;

In the present embodiment, seqYOLO is added with an attention module and includes three attention modules, which are CBAM-based attention module, visual Transformer model ViT, SWIN Transformer, respectively.

5. And model checking, namely checking the trained model by using the other 20% of sample data to promote the accuracy of the identification of the model. The validation set is used to verify the state of the model during the training process, convergence. The verification set is usually used for adjusting the hyper-parameters, which group of hyper-parameters has the best performance is determined according to the performances of several groups of model verification sets, and meanwhile, the verification set can also be used for monitoring whether the model is over-fitted or not in the training process. To see if the effect of model training is going in the desired direction.

6. Reasoning and terminal deployment, after model training is finished, matching with a well-developed system, after the model is successfully deployed at a terminal, real-time reasoning and display of an endoscope image can be started, and after a model output result is processed during reasoning so as to improve the prediction accuracy of the model.

A lung endoscope image identification method is that in the actual use of doctors, a lung endoscope acquires a lung endoscope image to be identified; and the lung endoscope image identification model identifies the lung endoscope image to be identified, and is obtained by the training method.

The present invention has been described in detail with reference to specific embodiments, and it should not be construed that the embodiments of the present invention are limited to these descriptions. For a person skilled in the art to which the invention pertains, several equivalent alternatives or obvious modifications, all of which have the same properties or uses, without departing from the inventive concept, should be considered as falling within the scope of the patent protection of the invention, as determined by the claims submitted.

Claims

1. A method for training a lung endoscope image recognition model is characterized in that: the method comprises the steps of constructing a data set and a neural network model, training the neural network model through 80% of sample data of the constructed data set, deploying after testing through the remaining 20% of the sample data of the data set after training is finished, and using mAP0.5 as an evaluation index for measuring accuracy on an endoscope image target detection task;

the SeqYOLO obtains image characteristic values of each frame in a video sequence through a YOLO backbone respectively, calculates the characteristic values by using an LSTM model to learn information with time sequence relation in the video, outputs an image characteristic value corresponding to an image for reasoning after the LSTM reads the image characteristic values of the whole sequence, and uses the image characteristic values to carry out reasoning and calculate the specific image detection frame and object type in the frame by using a head component in a YOLO series algorithm.

2. The method for training the image recognition model of the pulmonary endoscope of claim 1, wherein: the data set classification comprises 18 categories, which are respectively: epiglottis, vocal cords, main bronchus, left main bronchus, carina, right main bronchus, left superior lobar bronchus, left inferior lobar bronchus, right superior lobar bronchus, right middle bronchus, right inferior lobar bronchus, left intrinsic superior lobar bronchus, left lingual lobar bronchus, left inferior lobar dorsal rampart bronchus, left inferior lobar basal part bronchus, right middle lobar bronchus, right inferior lobar basal part bronchus, and right inferior lobar dorsal part bronchus.

3. The method for training the image recognition model of the pulmonary endoscope of claim 1, wherein: the Yolov5 model structure is mainly divided into two parts, namely Yolov5 Back bone for extracting information and an independent Yolov5 module for regressing a detection frame, which is called Yolov5 Head.

4. The method of training a pulmonary endoscopy image recognition model of claim 1, further comprising: after reading in continuous video frame information, the SeqYOLO gives a prediction result of the last frame image, namely, firstly, using a YOLO backbone to extract the features of the image, inputting the obtained feature values into an LSTM model to perform time series learning, learning the feature information of each frame in the LSTM, and reserving information useful for predicting the next frame, after reading in the last frame, the LSTM outputs an integral video segment feature value finally integrating all the previous frame information and the last frame information, and the feature value uses a YOLO head structure to identify a detection frame and an object type.

5. The method for training the image recognition model of the pulmonary endoscope of claim 1, wherein: the training of the SeqYOLO model is divided into two phases: firstly, training a single labeled picture by using a YOLOv5 model to enable mAP0.5 to reach a relatively accurate value, and obtaining the weights of a backbone component and a head component of the model at the stage for initializing the model weight at the next stage; and at the second stage, the built SeqYOLO model transfers the weights of two parts, namely the YOLO backbone and the YOLO head, obtained at the previous stage to the corresponding part in the SeqYOLO, the training at the second stage uses video data for training, namely, the training is a continuous video frame input model, 20 frames of images obtained by a method of taking frames at intervals are input into the SeqYOLO for learning, and the last frame of the 20 frames of images is a target image needing to be predicted, so only the last frame can carry annotation information.

6. The method for training the image recognition model of the pulmonary endoscope of claim 1, wherein: when the SeqYOLO model continues to be trained, the last frame with labels in the video is learned by using a YOLOv5 model in the first stage, and random data enhancement is performed when data are loaded;

in back propagation of the model, the used PyTorch deep learning framework can perform automatic derivation calculation to update the model weight through a random batch gradient descent method, an SGD optimizer is adopted during training, the learning rate is set to be 0.01, the momentum is set to be 0.937, an OneClR learning rate planning strategy is used in the training process, the learning rate is gradually increased from zero to the set learning rate of 0.01 at the beginning, and then the learning rate is slowly decreased in the subsequent convergence process.

7. The method for training the image recognition model of the pulmonary endoscope of claim 1, wherein: in the SeqYOLO training stage, firstly, the model weight trained in the previous stage is transferred to the YOLOv5 Back bone and YOLOv5 Head components in SeqYOLO, and after the training in the previous stage, the two components respectively have the capability of extracting endoscopic image features and performing target prediction and target classification according to the extracted image features, the input data of each time of SeqYOLO is a video sequence with the length of 20 frames, wherein the last video frame is an endoscopic image to be predicted, the output of the SeqYOLO model is the target prediction result of the last frame in the target frame, namely the 20-frame video input sequence, after the model obtains an image sequence with the length of 20 frames, simultaneously, all images are subjected to feature extraction through the same YOLOv5 backlight, 20 groups of different feature values are finally obtained, the 20 groups of different feature values sequentially correspond to the original 20 frames of images, the 20 groups of different feature values enter an LSTM module to be subjected to sequence information learning, at this stage, image information before the endoscope moves to the current position is concerned, after the LSTM receives 20 groups of different feature values, a coded feature value is output, the format of the feature value is the same as that of the output of the YOLO backlight, then the YOLO Head is directly used for target position and category calculation, and the LSTM-coded feature value is input into the YOLO Head for calculation to obtain a final target position information and category information prediction result;

the optimization mode of the SeqYOLO model is also that the model weight is updated by back propagation after forward propagation, the optimizer adopts an Adam method and sets the learning rate to be 0.001, and the learning rate planning strategy uses OneCyclr.

8. The method for training the image recognition model of the pulmonary endoscope of claim 1, wherein: the SeqYOLO is added with an attention module and comprises three attention modules, namely an attention module based on CBAM, a visual transform model ViT and a SWIN transform.

9. A lung endoscope image identification method is characterized in that: the method comprises the following steps: acquiring a lung endoscope image to be identified; a lung endoscope image recognition model recognizes the lung endoscope image to be recognized, wherein the lung endoscope image recognition model is trained based on the method for training the image recognition model according to any one of claims 1 to 8.

10. The pulmonary endoscope image recognition method of claim 9, wherein: the method is also provided with reasoning calculation, and the reasoning calculation is based on the human lung structure chart to compile rules.