CN113393449A

CN113393449A - Endoscope video image automatic storage method based on artificial intelligence

Info

Publication number: CN113393449A
Application number: CN202110710489.1A
Authority: CN
Inventors: 俞晔; 方圆圆; 姜婷
Original assignee: Shanghai First Peoples Hospital
Current assignee: Shanghai First Peoples Hospital
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-09-14

Abstract

The invention relates to the technical field of medical image storage, and discloses an endoscope video image automatic storage method based on artificial intelligence, which comprises the following steps of S1: selecting a video frame from an endoscope video as a reference frame; s2: comparing the similarity of two adjacent frames from the reference frame to the reference frame one by one, stopping until two adjacent frames with the similarity lower than a similarity threshold appear, and selecting the next frame in the two adjacent frames which are compared at last as the first frame; s3: comparing the similarity of two adjacent frames from the reference frame to the next frame by frame until two adjacent frames with the similarity lower than the similarity threshold appear, and selecting the previous frame of the two adjacent frames which are compared last as the tail frame; s4: selecting all video frames from the first frame to the last frame as target images; s5: constructing and training a convolutional neural network, and compressing a target image by using the convolutional neural network; s6: storing the compressed target image; the method reduces the space occupied by the required video frame and improves the space utilization rate of the memory.

Description

Endoscope video image automatic storage method based on artificial intelligence

Technical Field

The invention relates to the technical field of medical image storage, in particular to an endoscope video image automatic storage method based on artificial intelligence.

Background

Capsule endoscopy video examination is a commonly used examination method for gastrointestinal diseases, which can visually show the condition of the gastrointestinal part of a patient, is non-invasive and reduces the pain of the patient. In the existing capsule endoscopy, a patient swallows the capsule endoscope, the capsule endoscope passes through a part to be inspected through gastrointestinal peristalsis of the patient, and shot videos are sent to electronic equipment in a wireless mode for medical personnel to watch. The video frames of the whole section of capsule endoscope video are stored in a database of a hospital, so that a large storage space is occupied, and medical personnel are not convenient to search and watch.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an endoscope video image automatic storage method based on artificial intelligence.

In order to achieve the above purpose, the invention provides the following technical scheme:

an endoscope video image automatic storage method based on artificial intelligence is characterized by comprising the following steps: s1: selecting a video frame from an endoscope video as a reference frame; s2: comparing the similarity of two adjacent frames from the reference frame to the reference frame one by one, stopping until two adjacent frames with the similarity lower than a similarity threshold appear, and selecting the next frame in the two adjacent frames which are compared at last as the first frame; s3: comparing the similarity of two adjacent frames from the reference frame to the next frame by frame until two adjacent frames with the similarity lower than the similarity threshold appear, and selecting the previous frame of the two adjacent frames which are compared last as the tail frame; s4: selecting all video frames from the first frame to the last frame as target images; s5: constructing and training a convolutional neural network, and compressing a target image by using the convolutional neural network; s6: and storing the compressed target image.

In the present invention, preferably, the comparing the similarity between two adjacent frames in S2 and S3 includes: s21: respectively extracting color features of two video frames; s22: judging whether the color characteristics of the two video frames are similar, if so, executing S23, otherwise, executing S26; s23: respectively extracting texture features of two video frames; s24: judging whether the texture features of the two video frames are similar, if so, executing S25, otherwise, executing S26; s25: identifying two video frames as similar images; s26: two video frames are not considered to be similar images.

In the present invention, preferably, S21 includes: s211: converting the video frame from the RBG color space to the HSI color space; s212: sampling color saturation components of video frames of an HSI color space to form saturation vectors; s213: and carrying out standard normalization processing on the saturation vector to form color characteristics.

In the present invention, preferably, S23 includes: s231: establishing an image pyramid for tone components of a video frame of an HSI color space; s232: and each layer of the image pyramid adopts a local binary mode operator to extract texture features, so that the texture features are formed.

In the present invention, preferably, the determination of whether the color features of the two video frames are similar in S22 is implemented by comparing cosine values of vector angles of the color features of the two video frames.

In the present invention, preferably, the determining whether the texture features of the two video frames are similar in S24 is implemented by a method of statistical matching of global texture features.

In the present invention, preferably, the compressing the target image by using the convolutional neural network in S5 includes: s51: carrying out feature extraction on the target image to form a feature image corresponding to the target image; s52: removing redundant information in the characteristic image to form a concise characteristic image; s53: and reconstructing the concise feature image to form a reconstructed image corresponding to the target image.

In the present invention, preferably, S51 includes: s511: performing convolution on the target image by utilizing two cascaded first convolution layers to form a first characteristic image; s512: learning the first characteristic image by utilizing three cascaded residual modules to form a second characteristic image; s513: and forming a third characteristic image by convolving the second characteristic image by using a second convolution layer.

In the present invention, it is preferable that the removing of redundant information in the feature image in S52 is performed by a Round function.

In the present invention, preferably, S53 includes: s531: convolving the concise feature image by using a third convolution layer to form a fourth feature image, wherein the convolution kernel size of the third convolution layer is 1 multiplied by 1, the number of the convolution kernels is 512, and the convolution step length is 1; s532: convolving the fourth characteristic image by using a sub-pixel convolution layer to form a fifth characteristic image; s533: learning the fifth characteristic image by utilizing three cascaded residual modules to form a sixth characteristic image; s534: convolving the sixth characteristic image by using a sub-pixel convolution layer to form a seventh characteristic image; s535: and performing convolution on the seventh characteristic image by utilizing a sub-pixel convolution layer to form a reconstructed image.

Compared with the prior art, the invention has the beneficial effects that:

the endoscope video image automatic storage method based on artificial intelligence rapidly intercepts the required video frames in the endoscope video by manually selecting the reference frames and compresses the video frames through the convolutional neural network, so that the storage space occupied by the required video frames is obviously reduced, the space utilization rate of a storage is improved, and the searching by medical personnel is facilitated; the similarity of the video frames is compared by adopting the color features and the texture features, so that the accuracy of judging the similarity is ensured; and the convolutional neural network is adopted for image compression, so that the image compression efficiency is improved.

Drawings

FIG. 1 is a flow chart of an artificial intelligence based endoscopic video image automatic storage method.

Fig. 2 is a flowchart of comparing the similarity between two adjacent frames in S2 and S3 in the method for automatically storing endoscopic video images based on artificial intelligence.

Fig. 3 is a flowchart of S21 in the artificial intelligence based endoscopic video image automatic storage method.

Fig. 4 is a flowchart of S23 in the artificial intelligence based endoscopic video image automatic storage method.

Fig. 5 is a flowchart of compressing the target image by using the convolutional neural network in S5 of the method for automatically storing an endoscopic video image based on artificial intelligence.

Fig. 6 is a flowchart of S51 in the artificial intelligence based endoscopic video image automatic storage method.

Fig. 7 is a flowchart of S53 in the artificial intelligence based endoscopic video image automatic storage method.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When a component is referred to as being "connected" to another component, it can be directly connected to the other component or intervening components may also be present. When a component is referred to as being "disposed on" another component, it can be directly on the other component or intervening components may also be present. The terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Referring to fig. 1 to 3, a preferred embodiment of the present invention provides an endoscope video image automatic storage method based on artificial intelligence, including:

s1: one video frame is selected from the endoscopic video as a reference frame.

The endoscopic video shot by the capsule endoscope contains video contents of human esophagus, stomach, small intestine, large intestine and other parts, the video length is long, and the whole observation needs several hours, while the endoscopy is usually performed on a certain part, so that an image of a certain segment needs to be intercepted from the whole endoscopic video for disease diagnosis. The intercepting process can be realized in a semi-automatic mode of the embodiment, namely, a user selects a video frame containing the inspection part as a reference frame, and then the boundary of the video segment is determined according to the reference frame, so that the required video segment can be accurately intercepted.

S2: and comparing the similarity of two adjacent frames from the reference frame to the reference frame one by one, stopping until two adjacent frames with the similarity lower than a similarity threshold appear, and selecting the next frame in the two adjacent frames which are compared at last as the first frame.

In the step, if the reference frame selected by the user is the Nth frame of the endoscope video, the previous frame is the (N-1) th frame, the similarity of the two frames is compared, if the similarity is lower than a similarity threshold value, the two frames are determined to be dissimilar, the comparison is stopped, and the Nth frame is selected as the first frame; if the similarity is not lower than the similarity threshold, the similarity is determined, and the similarity comparison between the frame N-1 and the frame N-2 is continued. And sequentially carrying out similarity comparison on adjacent frames until two adjacent video frames with the similarity lower than a similarity threshold are found: the X-1 th frame and the X-th frame, and the X-th frame is selected as the first frame.

S3: and comparing the similarity of two adjacent frames from the reference frame to the next frame by frame until two adjacent frames with the similarity lower than the similarity threshold appear, and selecting the previous frame in the two adjacent frames with the last comparison as the tail frame.

The procedure of this step is similar to S2 except that the direction of comparison is changed to be performed backward. If the reference frame selected by the user is the Nth frame of the endoscope video, the next frame is the (N + 1) th frame, the similarity of the two frames is compared, if the similarity is lower than the similarity threshold, the two frames are determined to be dissimilar, the comparison is stopped, and the Nth frame is selected as the tail frame; if the similarity is not lower than the similarity threshold, the similarity is determined, and the similarity comparison between the (N + 1) th frame and the (N + 2) th frame is continued. And sequentially carrying out similarity comparison on adjacent frames until two adjacent video frames with the similarity lower than a similarity threshold are found: the Yth frame and the Y +1 th frame, and the Yth frame is selected as the end frame.

In this embodiment, specifically, the similarity comparing two adjacent frames in S2 and S3 includes:

s21: and respectively extracting the color features of the two video frames.

In the feature extraction algorithm of computer vision, the color features do not need to be calculated in a large amount. Only the pixel values in the digital image need to be correspondingly converted and expressed as numerical values, so that the color features become better features with low complexity. The method for extracting color features includes methods for extracting color histograms, color moments and the like of images.

Since the endoscope video image is in the RGB color space, each component of the RGB color space is closely related to the brightness, as long as the brightness changes, all three components will change along with the change, and the three components are not independent, but the illumination condition of the endoscope inside the human body just changes greatly, so the RGB color space is not suitable for the processing of the endoscope video image.

To this end, the video frames may be converted to the HSI color space. The HSI color space has the obvious advantage that the intensity component I is separated from the color saturation S and the hue component H, so that the problem of uneven brightness distribution of the images of the capsule endoscopy can be effectively solved only by extracting and analyzing the characteristics of the color saturation S. Specifically, S21 includes:

s211: the video frame is converted from the RBG color space to the HSI color space.

The conversion formula for converting the original video frame from RBG color space to HSI color space is composed of three parts of hue component H, color saturation S and intensity I, and is as follows:

wherein,

the color saturation S is calculated by the formula

The intensity I is calculated by the formula

S212: the color saturation components of the video frames of the HSI color space are sampled to form a saturation vector.

After the color space conversion is completed, the saturation of the video frame may be sampled at intervals, and the sampling is performed sequentially from left to right and from top to bottom with a number of (the number is determined as required, for example, 5) pixels as row-column intervals. After the sampling points of the edge invalid region are removed, an n-dimensional vector C is obtained, and the vector C is the saturation vector of the endoscope image.

S213: and carrying out standard normalization processing on the saturation vector to form color characteristics.

In order to improve the robustness of the extracted saturation vector, the vector C needs to be normalized by the following formula:

where μ is the mean of all components of vector C, σ represents the variance of all components of vector C, C_iIs the i-th component of vector C, E_iAnd the ith component of the vector C subjected to standard normalization is represented, and the vector E subjected to standard normalization is the color feature of the video frame.

S22: and judging whether the color features of the two video frames are similar, if so, executing S23, and if not, executing S26.

In this step, in order to determine the color similarity between two adjacent frames of images, similarity measurement needs to be performed on the color feature vectors of the video frames. The similarity measure is a comprehensive assessment of how similar two things are based on their characteristics. Currently, common similarity measurement methods include euclidean distance, cosine of included angle, manhattan distance, mahalanobis distance, correlation coefficient and the like. Specifically, the embodiment determines whether the color features of the two video frames are similar by comparing cosine values of vector angles of the color features of the two video frames. The calculation formula of the cosine value of the included angle is as follows:

and dividing the dot product of the color characteristic vectors E and E' of two adjacent video frames by the modulus of the two vectors, wherein the calculated result is the cosine value AS of the included angle of the two vectors, and the value range is [ -1,1 ]. The larger the value of AS, the smaller the angle theta between the two feature vectors E and E', and the more similar the color features of the two video frames. In the process of judging the similarity, the similarity threshold of the color features is determined according to actual needs.

S23: and respectively extracting texture features of the two video frames.

The video frames contain abundant texture information, and the texture characteristics can be extracted to be used as a basis for judging the similarity of the two video frames. Since the hue H of the HSI space is not sensitive to illumination and the image contains rich information, texture analysis is performed for the hue H. The extraction method includes an LBP (local binary pattern) feature method, a gray level co-occurrence matrix method and the like. The present embodiment employs an LBP feature method. Specifically, S23 includes:

s231: an image pyramid is established for the tone components of the video frame of the HSI color space.

In order to extract the characteristics of textures with different thicknesses in a video frame, the textures with different scales are expressed by adopting an image pyramid. Setting the total number of layers of the image pyramid as you n, wherein the image of the (n + 1) th layer is obtained by filtering and downsampling on the basis of the nth layer, and the filtering and downsampling formula is as follows:

wherein G is_mImage of m-th layer representing pyramid, G_m+1Representing the image of the (m + 1) th layer of the pyramid, wherein m is more than or equal to 0 and less than or equal to 2,

when m is 0, the original image of the video frame is shown; ω is the template of the mean filter, the size of which is a × a, and a is the column-row spacing of the down-sampling. The mean filter is adopted in the filtering process, so that the calculation amount is small, and the constructed pyramid has a good visual effect.

S232: and each layer of the image pyramid adopts a local binary mode operator to extract texture features, so that the texture features are formed.

Local Binary Pattern (LBP) is a rectangular window of size 3 × 3, with the operator centered on the pixel value g_cFor threshold, p pixel values g in a circular neighborhood of radius r_iCarrying out binarization treatment; less than the central pixel value g_cThe point of (a) is binarized to 0, and the point greater than or equal to the central pixel value is binarized to 1, thereby obtaining an eight-bit binary number; are aligned in sequence at g_iThe binary result of the position point is endowed with a weight 2ⁱAnd summed to obtain the corresponding LBP value, expressed as:

where s (x) is a binarization function defined as:

extracting texture features by adopting local binary pattern operator LBP at each layer of the image pyramid to obtain LBP feature spectrum O_mAnd m is 1,2 and 3 respectively corresponding to three layers of the pyramid, namely the multi-scale local texture features extracted from the hue component H of the video frame.

S24: and judging whether the texture features of the two video frames are similar, if so, executing S25, and if not, executing S26.

In this step, in order to determine the texture similarity between two adjacent frames of images, similarity measurement needs to be performed on the texture feature vectors of the video frames. Specifically, the texture similarity measurement is implemented by a statistical matching method of global texture features.

The method firstly carries out the extraction of the characteristic spectrum O_mPerforming global texture feature analysis, performing texture feature extraction on the tone component of the image pyramid of the video frame to obtain an LBP feature spectrum O_mAnd m is 1,2, 3. Texture statistics is carried out on the characteristic spectrum of the m-th layer to obtain a texture code LBP_m＝nFrequency of (y)_mnN is 0,1,2,3,4, and the statistical frequency number constitutes the texture feature vector Y of the layer_m. The global texture feature vector [ Y ] can be obtained by carrying out texture statistics on the feature spectrum of each layer₁,Y₂,Y₃]。

The similarity measurement of the global texture features adopts weighted Manhattan distance, and obtains a feature vector [ Y ] of two adjacent frames of images through statistics₁,Y₂,Y₃]Similarity meter of its texture featuresThe calculation formula is as follows:

wherein, y_mnAnd y_mn' LBP code LBP for respectively representing m-th layer characteristic spectrum of two video frames_m＝nFrequency of (d); beta is a_mThe number of pixel points in the texture effective area of the mth layer of the image pyramid is set; lambda [ alpha ]₁Is a weighting coefficient; result of calculation D_mAnd representing the similarity of global texture features of the mth layer of the pyramid of the two video frames. In the process of judging the similarity, the similarity threshold of the texture features is determined according to actual needs.

S25: two video frames are considered similar images.

S26: two video frames are not considered to be similar images.

Through the steps of S21, S22, S23, and S24, it can be concluded whether or not two video frames are similar images.

S4: and selecting all video frames from the first frame to the last frame as target images.

After the selection of the first frame and the last frame is completed, all the video frames between the two frames are video images of the capsule endoscope at the part to be inspected, so the video frames should be determined as target images for medical personnel to diagnose.

S5: and constructing and training a convolutional neural network, and compressing the target image by using the convolutional neural network.

The core of adopting the convolutional neural network to compress and encode the video frame is to convert original image information into a characteristic image through convolutional downsampling, compared with the original image, the characteristic image has the advantages of smaller size, smaller information entropy and more suitability for binary encoding, and then the characteristic image is restored into a reconstructed image through deconvolution operation, so that a great deal of redundant information is reduced in the reconstructed image, key information is reserved, the image is compressed, and the occupied storage space is smaller. Specifically, S5 includes:

s51: and performing feature extraction on the target image to form a feature image corresponding to the target image.

In the step, an original target image is input into a convolution layer, the convolution layer performs convolution operation to extract a characteristic image, and each convolution is a characteristic extraction process. Specifically, S51 includes:

s511: and performing convolution on the target image by utilizing the two cascaded first convolution layers to form a first characteristic image.

In the step, the shallow layer features of the image target are extracted through the two first convolution layers, and for the convenience of distinguishing, the formed image is marked as a first feature image. The number of channels of the first convolutional layer is 128, the convolutional kernel size is 5 × 5, the step size is 2, and each first convolutional layer is followed by a Relu activation layer. After the image with the size of M multiplied by N passes through two first convolution layers, 128 first feature maps with the size of M/4 multiplied by N/4 are obtained.

S512: and learning the first characteristic image by utilizing three cascaded residual modules to form a second characteristic image.

In the step, the first characteristic image is learned through three cascaded residual modules, deeper characteristics are extracted, and the formed image is recorded as a second characteristic image for convenient distinguishing. The residual error module is internally provided with a plurality of residual error convolution layers, and jump connection is arranged between the input end and the output end of the residual error module. And the 128 first feature images with the size of M/4 multiplied by N/4 are subjected to three cascaded residual modules to obtain 128 second feature images with the size of M/4 multiplied by N/4.

S513: and forming a third characteristic image by convolving the second characteristic image by using a second convolution layer.

The step is to convolute the second characteristic image by a second convolution layer to realize the conversion of the channel number of the second characteristic image, and the formed image is marked as a third characteristic image for convenient distinguishing. The 128 second characteristic images with the size of M/4 multiplied by N/4 are processed by the second convolution layer with the convolution kernel size of 5 multiplied by 5, the step size of 2 and the number of convolution kernels of F to obtain F third characteristic images with the size of M/8 multiplied by N/8.

S52: and removing redundant information in the characteristic image to form a concise characteristic image.

The step can remove redundant information in the third characteristic image by inputting the third characteristic image into a specific estimation function for processing, and key information is reserved to obtain an indirect characteristic image. Preferably, the present embodiment implements the above process by using a Round function. The Round function is a quantization function, and quantizes the input third feature image and rounds the floating point number of the third feature image to an integer number. The Round function quantizes the input floating point number into an integer, and directly quantizes each third feature map during network forward propagation. The derivative of the quantization function itself is mostly 0 and is not conducive in other places, such as directly using the Round function itself to calculate the gradient and applying it to the network, which will make the gradient unable to be transmitted to the next layer through the Round layer. Therefore, it is necessary to approximate the Round function as a continuous function r (x), and to replace the Round derivative with the derivative of r (x) in the reverse propagation, i.e. to approximate the Round function as a continuous function r (x), i.e. to use the derivative of r (x) in the reverse propagation

Where round (x) is a quantization function, and r (x) is an approximation function of round (x). It can take round (x) r (x) x, where round (x) is used for the forward process when the network is actually trained, and r (x) x is used for the reverse propagation of the gradient to calculate the reverse derivative. When r (x) is equal to x, the derivative of r (x) to x is 1, so that after the gradient of the upper layer of the network passes through the Round function according to the chain rule, the gradient value is multiplied by 1, namely the gradient value is kept unchanged, and then the lower layer of the Round layer is transmitted, so that after the approximation according to the method, the Round layer is only substantially equivalent to one wire connecting the upper layer and the lower layer of the Round layer during back propagation, and the gradient of the network is not influenced.

S53: and reconstructing the concise feature image to form a reconstructed image corresponding to the target image.

Inputting the concise feature image into a deconvolution layer, and performing deconvolution operation on the deconvolution layer to recover the concise feature image. Specifically, S53 includes:

s531: and (3) convolving the concise feature image by using a third convolution layer to form a fourth feature image, wherein the convolution kernel size of the third convolution layer is 1 multiplied by 1, the number of the convolution kernels is 512, and the convolution step size is 1.

The step changes the number of channels, converts the concise feature images into 512 new feature images with the size of M/8 multiplied by N/8, and marks the new feature images as a fourth feature image for convenient distinguishing.

S532: and performing convolution on the fourth characteristic image by using a sub-pixel convolution layer to form a fifth characteristic image.

In the step, 256 new feature images with the size of M/4 multiplied by N/4 are obtained through the sub-pixel convolution layer and are marked as a fifth feature image for convenient distinguishing.

S533: and learning the fifth characteristic image by utilizing three cascaded residual modules to form a sixth characteristic image.

In the step, 256 new characteristic images with the size of M/4 multiplied by N/4 are obtained through three cascaded residual modules and are marked as a sixth characteristic image for convenient distinguishing.

S534: and performing convolution on the sixth characteristic image by using a sub-pixel convolution layer to form a seventh characteristic image.

In the step, 128 new characteristic images with the size of M/2 XN/2 are obtained through one sub-pixel convolution layer and are marked as a seventh characteristic image for convenient distinguishing.

S535: and performing convolution on the seventh characteristic image by utilizing a sub-pixel convolution layer to form a reconstructed image.

The seventh characteristic image is restored to a new image with the same size as the original image through a sub-pixel convolution layer, namely the reconstructed image.

S6: and storing the compressed target image.

Through the steps, redundant information is removed from the reconstructed image, the information amount of the reconstructed image is obviously less than that of the original target image, and the compressed target image, namely the reconstructed image can be stored in a memory in a bitmap storage mode.

The above description is intended to describe in detail the preferred embodiments of the present invention, but the embodiments are not intended to limit the scope of the claims of the present invention, and all equivalent changes and modifications made within the technical spirit of the present invention should fall within the scope of the claims of the present invention.

Claims

1. An endoscope video image automatic storage method based on artificial intelligence is characterized by comprising the following steps:

s1: selecting a video frame from an endoscope video as a reference frame;

s2: comparing the similarity of two adjacent frames from the reference frame to the reference frame one by one, stopping until two adjacent frames with the similarity lower than a similarity threshold appear, and selecting the next frame in the two adjacent frames which are compared at last as the first frame;

s3: comparing the similarity of two adjacent frames from the reference frame to the next frame by frame until two adjacent frames with the similarity lower than the similarity threshold appear, and selecting the previous frame of the two adjacent frames which are compared last as the tail frame;

s4: selecting all video frames from the first frame to the last frame as target images;

s5: constructing and training a convolutional neural network, and compressing a target image by using the convolutional neural network;

s6: and storing the compressed target image.

2. The method of claim 1, wherein the comparing the similarity between two adjacent frames in S2 and S3 comprises:

s21: respectively extracting color features of two video frames;

s22: judging whether the color characteristics of the two video frames are similar, if so, executing S23, otherwise, executing S26;

s23: respectively extracting texture features of two video frames;

s24: judging whether the texture features of the two video frames are similar, if so, executing S25, otherwise, executing S26;

s25: identifying two video frames as similar images;

s26: two video frames are not considered to be similar images.

3. The method for automatically storing endoscopic video images based on artificial intelligence as claimed in claim 2, wherein S21 includes:

s211: converting the video frame from the RBG color space to the HSI color space;

s212: sampling color saturation components of video frames of an HSI color space to form saturation vectors;

4. The artificial intelligence based endoscopic video image automatic storage method according to claim 3, wherein S23 includes:

s231: establishing an image pyramid for tone components of a video frame of an HSI color space;

5. The method for automatically storing endoscopic video images based on artificial intelligence as claimed in claim 3, wherein said determining whether the color features of the two video frames are similar in S22 is performed by comparing cosine values of vector angles of the color features of the two video frames.

6. The method for automatically storing endoscopic video images based on artificial intelligence as claimed in claim 3, wherein said determining whether the texture features of two video frames are similar in S24 is implemented by statistical matching of global texture features.

7. The method for automatically storing artificial intelligence based endoscopic video images as claimed in claim 1, wherein said compressing the target image with convolutional neural network in S5 comprises:

s51: carrying out feature extraction on the target image to form a feature image corresponding to the target image;

s52: removing redundant information in the characteristic image to form a concise characteristic image;

8. The method for automatically storing endoscopic video images based on artificial intelligence as claimed in claim 7, wherein S51 includes:

s511: performing convolution on the target image by utilizing two cascaded first convolution layers to form a first characteristic image;

s512: learning the first characteristic image by utilizing three cascaded residual modules to form a second characteristic image;

9. The method for automatically storing endoscope video images based on artificial intelligence as claimed in claim 7, wherein the removing redundant information in the characteristic images in S52 is implemented by Round function.

10. The method for automatically storing endoscopic video images based on artificial intelligence as claimed in claim 7, wherein S53 includes:

s531: convolving the concise feature image by using a third convolution layer to form a fourth feature image, wherein the convolution kernel size of the third convolution layer is 1 multiplied by 1, the number of the convolution kernels is 512, and the convolution step length is 1;

s532: convolving the fourth characteristic image by using a sub-pixel convolution layer to form a fifth characteristic image;

s533: learning the fifth characteristic image by utilizing three cascaded residual modules to form a sixth characteristic image;

s534: convolving the sixth characteristic image by using a sub-pixel convolution layer to form a seventh characteristic image;