CN113743384B

CN113743384B - Stomach picture identification method and device

Info

Publication number: CN113743384B
Application number: CN202111303012.8A
Authority: CN
Inventors: 吴家豪; 李青原; 方堉欣; 王羽嗣
Original assignee: Guangzhou Side Medical Technology Co ltd
Current assignee: Guangzhou Side Medical Technology Co ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-04-05
Anticipated expiration: 2041-11-05
Also published as: CN113743384A

Abstract

The invention provides a stomach picture identification method and a device, wherein the method comprises the following steps: according to the sequence of video frames, dividing stomach video data into a plurality of video image sets with the same number of frames; and inputting a plurality of video image sets into a trained image recognition model to obtain a stomach part recognition result corresponding to each frame of image in the stomach video data, wherein the trained image recognition model is constructed by a convolutional neural network, a Transformer network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label. According to the method, the convolutional neural network and the transform network are combined, so that the time sequence information of the image characteristics can be obtained when the characteristics of the stomach image are extracted, and the type of the stomach image can be more accurately judged by combining the local image information and the time sequence information.

Description

Stomach picture identification method and device

Technical Field

The invention relates to the technical field of image recognition, in particular to a stomach image recognition method and device.

Background

In the existing endoscope detection, a stomach part where a video image is shot needs to be judged according to the video image shot by an endoscope.

Although the technology of analyzing a gastroscope picture by using artificial intelligence image recognition is available at present, most of the technology is to simply use a convolutional neural network to perform feature extraction and classification processing on the picture, and only local feature information existing in a single picture is considered to judge the type of the part to which the picture belongs, so that the accuracy of a recognition result is low.

Therefore, a method and a device for recognizing a stomach image are needed to solve the above problems.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a stomach picture identification method and a device.

The invention provides a stomach picture identification method, which comprises the following steps:

according to the sequence of video frames, dividing stomach video data into a plurality of video image sets with the same number of frames;

and inputting a plurality of video image sets into a trained image recognition model to obtain a stomach part recognition result corresponding to each frame of image in the stomach video data, wherein the trained image recognition model is constructed by a convolutional neural network, a Transformer network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label.

According to the stomach image recognition method provided by the invention, the trained image recognition model is obtained through the following steps:

acquiring a plurality of sample video image sets with the same frame number, marking a corresponding first sample label for each frame of sample image in each sample video image set, and constructing to obtain a training sample set, wherein the first sample label is a stomach part category label;

inputting the training sample set into a convolutional neural network for training, outputting to obtain a first picture characteristic of a sample picture in each sample video image set, and obtaining a pre-trained convolutional neural network,

inputting the first picture characteristics into a Transformer network for training according to the video frame sequence of the sample pictures in each sample video image set, outputting to obtain second picture characteristics, and obtaining a pre-trained Transformer network;

inputting the second picture characteristic into a full-link layer for training, outputting to obtain a sample picture prediction result, and performing back propagation on the basis of an error between the sample picture prediction result and a corresponding actual sample picture marking result so as to perform gradient optimization on the pre-trained convolutional neural network and the pre-trained Transformer network, thereby obtaining a trained picture recognition model.

According to the stomach picture identification method provided by the invention, after the sample video image sets with the same number of frames are obtained and a corresponding first sample label is marked for each frame of sample picture in each sample video image set, the method further comprises the following steps:

marking a corresponding second sample label for each frame of sample picture in a sample video image set with picture pixels or image resolution lower than a preset threshold;

and constructing a training sample set according to the sample video image set marked with the first sample label and the sample video image set marked with the second sample label.

According to the stomach picture identification method provided by the invention, the stomach part category label comprises an esophageal dentate line part, a fundus cardia part, a fundus junction part, a lesser curvature part, a greater curvature part, a lower stomach part, a gastric angle part, a antrum pylorus part, a duodenum part and an outside of a body.

According to the stomach picture identification method provided by the invention, the convolutional neural network is a ShuffleNet V2 network.

According to the stomach picture identification method provided by the invention, after the sample video image sets with the same number of frames are obtained, the method further comprises the following steps:

and carrying out image enhancement processing on the sample pictures in each sample video image set, and constructing a training sample set according to the sample pictures subjected to the image enhancement processing.

The invention also provides a stomach picture recognition device, comprising:

the video image acquisition module is used for dividing the stomach video data into a plurality of video image sets with the same frame number according to the video frame sequence;

and the part recognition module is used for inputting a plurality of video image sets into a trained picture recognition model to obtain a stomach part recognition result corresponding to each frame of picture in the stomach video data, wherein the trained picture recognition model is constructed by a convolutional neural network, a transform network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the above stomach picture identification methods.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for stomach image recognition as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the stomach picture identification method as described in any one of the above.

According to the stomach picture identification method and device provided by the invention, the convolutional neural network and the Transformer network are combined, so that the time sequence information of the picture characteristics can be obtained when the characteristics of the stomach picture are extracted, and the stomach picture category can be more accurately judged by combining the local picture information and the time sequence information.

Drawings

In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flow chart of a stomach image identification method provided by the present invention;

fig. 2 is a schematic structural diagram of a stomach image recognition device provided by the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the original video image acquired by the invention is obtained by capsule gastroscope shooting, and the working process and characteristics of the capsule gastroscope mainly comprise: 1. the capsule gastroscope enters the alimentary canal from the oral cavity and then is naturally discharged out of the body; 2. the capsule gastroscope has limited battery endurance, and the effective working space is one part of oral cavity, esophagus, stomach, duodenum, small intestine and large intestine; 3. each activity of the capsule gastroscope can generate an intra-domain examination picture and an out-domain examination picture, wherein the intra-domain examination picture is obtained by shooting a certain section of the alimentary canal, and the out-domain examination picture is a picture which is shot by the capsule gastroscope except the intra-domain examination picture; 4. the number of original pictures that each capsule gastroscope can shoot at a time can be 2000-3000, namely the number of pictures in the picture set acquired by the capsule gastroscope.

Fig. 1 is a schematic flow diagram of a stomach image recognition method provided by the present invention, and as shown in fig. 1, the present invention provides a stomach image recognition method, including:

step 101, according to the sequence of video frames, dividing the stomach video data into a plurality of video image sets with the same number of frames.

In the invention, stomach video data (JPG format) is acquired through a capsule gastroscope, and the stomach video data can be derived from a hospital information system; then, the stomach video data is divided into a plurality of small video data sets (for example, videos composed of ten continuous frames of pictures) with equal frame lengths according to the video frame sequence, so as to obtain a plurality of video image sets with the same number of frames.

And 102, inputting a plurality of video image sets into a trained image recognition model to obtain a stomach part recognition result corresponding to each frame of image in the stomach video data, wherein the trained image recognition model is constructed by a convolutional neural network, a Transformer network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label.

In the present invention, the plurality of video image sets obtained in the above embodiment are randomly input into a trained image recognition model for forward propagation, specifically, the trained image recognition model is constructed by a convolutional neural network, a transform network and a full link layer, when a video image set is subjected to feature extraction by the convolutional neural network, since the images in the video image set are based on a video frame sequence, the extracted image features are also input into the transform network according to the video frame sequence, so that the image recognition process can be compatible with the image information of the previous and next frames, and finally, the image features with time sequence output by the transform network are input into a classified full link layer to obtain a Prediction result (Prediction results). And repeating the process until all the pictures are predicted, and packaging and outputting all the prediction results of the stomach video data.

According to the stomach picture identification method provided by the invention, the convolution neural network and the transform network are combined, so that the time sequence information of the picture characteristics can be obtained when the characteristics of the stomach picture are extracted, and the stomach picture category can be more accurately judged by combining the local picture information and the time sequence information.

On the basis of the above embodiment, the trained image recognition model is obtained by the following steps:

step S1, obtaining a plurality of sample video image sets with the same frame number, marking a corresponding first sample label for each frame of sample image in each sample video image set, and constructing to obtain a training sample set, wherein the first sample label is a stomach part category label.

In the invention, when a sample video image set is labeled, the type of the image data can be labeled in a manual labeling mode, and the labeled image can be rechecked, wherein in each sample video data, the labeled sample image can be rechecked only when the consistency percentage of the types labeled by the three labeling personnel reaches 95%, otherwise, the labeled sample image returns to the rechecking. Then, each sample video data is divided into a plurality of small video data sets with equal length according to the video frame sequence, that is, a training sample set (for example, ten continuous sample pictures) is constructed. It should be noted that, the sequence of labeling the labels and dividing the multi-segment video data sets is not specifically limited, and in another embodiment, the image data labeling may be performed after the sample video data is divided into the multi-segment video data sets.

Step S2, inputting the training sample set into a convolutional neural network for training, outputting to obtain the first picture characteristic of the sample picture in each sample video image set, and obtaining a pre-trained convolutional neural network,

step S3, inputting the first picture characteristics into a Transformer network for training according to the video frame sequence of the sample pictures in each sample video image set, outputting to obtain second picture characteristics, and obtaining a pre-trained Transformer network;

and step S4, inputting the second picture characteristics into a full-link layer for training, outputting to obtain a sample picture prediction result, and performing back propagation on the basis of an error between the sample picture prediction result and a corresponding actual sample picture marking result so as to perform gradient optimization on the pre-trained convolutional neural network and the pre-trained Transformer network, thereby obtaining a trained picture recognition model.

In the invention, a plurality of sample video image sets are randomly input into a convolutional neural network for forward propagation, and after the characteristics of the convolutional neural network are extracted, the extracted picture characteristics are input into a Transformer network for training, so that a trained model can be compatible with the information of front and rear frame pictures during subsequent tests and actual picture identification; and finally, inputting the characteristics output by the Transformer network in the training process into the classified full-connection layer to obtain a prediction result. Further, according to the actual labeling result labels of each sample picture in the sample video image set and the prediction result of each sample picture in the network training process, a cross entropy loss function is used for solving the loss value of the labels and the prediction result pres, so that the network is subjected to back propagation based on the error between the actual value and the prediction value, and the network is subjected to gradient optimization according to the loss function. And repeating the steps S2 to S4 until the loss value is reduced to a preset value, thereby completing network training and obtaining the trained picture recognition model.

According to the invention, the time sequence information of the sample picture is added into the network training by adding a Transformer technology, so that in the subsequent model test and practical application, the type of the current frame picture is predicted by combining the information of a plurality of pictures of the front frame and the rear frame instead of only considering the information of the current frame picture alone, and the recognition result of the network prediction is more accurate by combining the time sequence of the pictures.

On the basis of the above embodiment, after acquiring a plurality of sample video image sets with the same number of frames, and marking each frame of sample picture in each sample video image set with a corresponding first sample label, the method further includes:

In the invention, when the sample picture is subjected to picture data type marking, in order to deal with the low picture pixel and the low picture definition in the capsule gastroscope (by setting a preset threshold, the sample picture lower than the threshold is judged to be the low picture pixel or the low picture definition), the sample picture with the above condition is taken as a low-quality picture, and a corresponding second sample label is marked, preferably, the condition that a floater is blocked and the like is also taken as a negative sample for marking. In the invention, the image pixels and the image are low in definition, and the occlusion condition is further classified, including 7 fine classifications of too close adherence, severe shrinkage, bile reflux, air bubbles, overexposure and over darkness, floaters and turbid and fuzzy, and the labels are used as negative samples for training. During training and testing, the pictures can be classified into one group in a training network, so that the interference pictures are effectively eliminated, and the judgment and identification of subsequent stomach part pictures are improved.

On the basis of the above embodiment, the stomach region classification label includes an esophageal dentate line region, a fundus cardia region, a fundus junction region, a lesser curvature region, a greater curvature region, a lower stomach region, a corner region, a antrum pylorus region, a duodenum region, and an external region.

In the invention, aiming at different positions of the stomach, in order to enable a network to be trained to learn more fine image characteristics and better learn differences among different types of pictures during training, when the sample pictures are labeled, the labels for finally predicting the type of the position of the stomach to be identified are divided into 10 types of results, thereby improving the identification precision of the model.

On the basis of the above embodiment, the convolutional neural network is a ShuffleNet V2 network.

In the invention, the lightweight network ShuffleNet V2 is used as a backbone of a classification network of the mobile terminal, and the trained picture recognition model is subjected to quantization processing, so that the testing and recognition speed of a network algorithm at the mobile terminal meets the real-time requirement.

On the basis of the above embodiment, after the acquiring a plurality of sample video image sets with the same number of frames, the method further includes:

In the invention, when the model is trained, the same sample picture is subjected to a plurality of different picture processing by using TTA (test time augmentation), and then more sample pictures are obtained for training. In the model testing stage, testing the model according to the sample pictures obtained after the enhancement treatment, and performing weighted fusion on the prediction result to be used as the prediction result of the network; and combining with an integrated algorithm (Embedded Learning) technology of the network, fusing results obtained from a plurality of prediction networks to obtain a final prediction result. The generalization capability of the obtained prediction result is stronger, and the accuracy is higher compared with that of a single network prediction single picture.

The invention provides a method combining a convolutional neural network and a transform network, aiming at the problem of low recognition accuracy rate when the prior art is used for carrying out part recognition on a stomach picture acquired by a capsule gastroscope, and better carrying out part recognition on the stomach picture. The method comprises the steps of firstly extracting features of a single picture from a stomach picture through a convolutional neural network, inputting the picture features extracted from continuous multi-frame pictures into a transform network, and acquiring the overall global features of the stomach by combining the time sequence in a video, thereby better judging the type of the part to which each picture belongs.

Fig. 2 is a schematic structural diagram of the stomach image recognition device provided by the present invention, and as shown in fig. 2, the present invention provides a stomach image recognition device, which includes a video image acquisition module 201 and a part recognition module 202, wherein the video image acquisition module 201 is configured to divide stomach video data into a plurality of video image sets with the same number of frames according to a video frame sequence; the part recognition module 202 is configured to input a plurality of video image sets into a trained picture recognition model, and obtain a stomach part recognition result corresponding to each frame of picture in the stomach video data, where the trained picture recognition model is constructed by a convolutional neural network, a transform network, and a full link layer, and is obtained by training a sample video image set labeled with a stomach part category label.

According to the stomach picture identification device provided by the invention, the convolution neural network and the transform network are combined, so that the time sequence information of the picture characteristics can be obtained when the characteristics of the stomach picture are extracted, and the stomach picture category can be more accurately judged by combining the local picture information and the time sequence information.

On the basis of the above embodiment, the apparatus further includes:

the training set constructing module is used for acquiring a plurality of sample video image sets with the same frame number, marking a corresponding first sample label for each frame of sample image in each sample video image set, and constructing to obtain a training sample set, wherein the first sample label is a stomach part type label;

a first training module, configured to input the training sample set to a convolutional neural network for training, output a first picture feature of a sample picture in each sample video image set, and obtain a pre-trained convolutional neural network,

the second training module is used for inputting the first picture characteristics into a Transformer network for training according to the video frame sequence of the sample pictures in each sample video image set, outputting the first picture characteristics to obtain second picture characteristics and obtaining a pre-trained Transformer network;

and the third training module is used for inputting the second picture characteristics into a full-link layer for training, outputting a sample picture prediction result, and performing back propagation on the error between the sample picture prediction result and a corresponding actual sample picture marking result so as to perform gradient optimization on the pre-trained convolutional neural network and the pre-trained Transformer network and obtain a trained picture recognition model.

On the basis of the above embodiment, the training set constructing module further includes a negative sample generating unit and a sample set constructing unit, wherein the negative sample generating unit is configured to mark a corresponding second sample label for each frame of sample picture in a sample video image set in which picture pixels or image resolution is lower than a preset threshold; the sample set constructing unit is used for constructing a training sample set according to the sample video image set marked with the first sample label and the sample video image set marked with the second sample label.

On the basis of the above embodiment, the training set constructing module further includes: and the image enhancement module is used for carrying out image enhancement processing on the sample pictures in each sample video image set and constructing a training sample set according to the sample pictures after the image enhancement processing.

The apparatus provided by the present invention is used for executing the above method embodiments, and for details and flow, reference is made to the above embodiments, which are not described herein again.

Fig. 3 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor) 301, a communication interface (communication interface) 302, a memory (memory) 303 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 complete communication with each other through the communication bus 304. Processor 301 may invoke logic instructions in memory 303 to perform a gastric picture recognition method comprising: according to the sequence of video frames, dividing stomach video data into a plurality of video image sets with the same number of frames; and inputting a plurality of video image sets into a trained image recognition model to obtain a stomach part recognition result corresponding to each frame of image in the stomach video data, wherein the trained image recognition model is constructed by a convolutional neural network, a Transformer network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label.

In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In another aspect, the present invention also provides a computer program product, including a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, which when executed by a computer, enable the computer to perform the stomach image recognition method provided by the above methods, the method including: according to the sequence of video frames, dividing stomach video data into a plurality of video image sets with the same number of frames; and inputting a plurality of video image sets into a trained image recognition model to obtain a stomach part recognition result corresponding to each frame of image in the stomach video data, wherein the trained image recognition model is constructed by a convolutional neural network, a Transformer network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label.

In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the stomach image recognition method provided in the foregoing embodiments, the method including: according to the sequence of video frames, dividing stomach video data into a plurality of video image sets with the same number of frames; and inputting a plurality of video image sets into a trained image recognition model to obtain a stomach part recognition result corresponding to each frame of image in the stomach video data, wherein the trained image recognition model is constructed by a convolutional neural network, a Transformer network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A stomach picture identification method is characterized by comprising the following steps:

inputting a plurality of video image sets into a trained image recognition model to obtain a stomach part recognition result corresponding to each frame of image in the stomach video data, wherein the trained image recognition model is constructed by a convolutional neural network, a Transformer network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label;

the trained picture recognition model is obtained through the following steps:

inputting the training sample set into a convolutional neural network for training, outputting to obtain a first picture characteristic of a sample picture in each sample video image set, and obtaining a pre-trained convolutional neural network;

inputting the second picture characteristic into a full-link layer for training, outputting to obtain a sample picture prediction result, and performing back propagation on the basis of an error between the sample picture prediction result and a corresponding actual sample picture marking result so as to perform gradient optimization on the pre-trained convolutional neural network and the pre-trained Transformer network to obtain a trained picture recognition model;

after the obtaining a plurality of sample video image sets with the same number of frames and marking each frame of sample picture in each sample video image set with a corresponding first sample label, the method further includes:

marking a corresponding second sample label for an occlusion area in each frame of sample picture in the sample video image set;

2. The method of claim 1, wherein the stomach region classification labels include an esophageal dentate line region, a fundus cardia region, a fundus junction region, a lesser curvature region, a greater curvature region, a lower stomach region, a corner region, a antrum pylorus region, a duodenum region, and an external region.

3. The stomach picture identification method according to claim 1, wherein the convolutional neural network is a ShuffleNet V2 network.

4. The gastric picture identification method of claim 1, wherein after said obtaining a plurality of sample video image sets of the same number of frames, the method further comprises:

5. A stomach picture recognition device, comprising:

the part recognition module is used for inputting a plurality of video image sets into a trained picture recognition model to obtain a stomach part recognition result corresponding to each frame of picture in the stomach video data, wherein the trained picture recognition model is constructed by a convolutional neural network, a transform network and a full connection layer and is obtained by training a sample video image set marked with a stomach part category label;

the trained picture recognition model is obtained through the following steps:

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the stomach picture recognition method according to any one of claims 1 to 4 when executing the computer program.

7. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the stomach picture recognition method according to any one of claims 1 to 4.