CN109523535B

CN109523535B - Pretreatment method of lesion image

Info

Publication number: CN109523535B
Application number: CN201811361336.5A
Authority: CN
Inventors: 张澍田; 朱圣韬; 闵力; 陈蕾
Original assignee: Beijing Friendship Hospital
Current assignee: Beijing Friendship Hospital
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2023-11-17
Anticipated expiration: 2038-11-15
Also published as: CN109523535A

Abstract

The invention relates to a preprocessing method for a lesion image in a medical image artificial intelligence auxiliary identification system, which comprises the step of accurately framing a pathological part in each picture by a strict control means.

Description

Pretreatment method of lesion image

Technical Field

The invention belongs to the field of medicine, and in particular relates to a preprocessing method for a lesion image in an artificial intelligence auxiliary identification system of a biomedical image.

Background

The application of artificial intelligence techniques in biomedical image processing is increasingly widespread, but the existing research and heuristic applications are limited to classifying rather than identifying lesions, i.e. to distinguishing one from two or more artificially predetermined lesion categories, rather than detecting lesions from a broad normal background during real-time detection, e.g. classification of colonic polyps ("Computer-based classification of small colorectal polyps by using narrow-band imaging with optical magnification", sebastian Gross et al, GASTROINTESTINAL ENDOSCOPY,74 (6): 2011), classification of skin cancers ("dermalogist-level classification of skin cancer with deep neural networks", andre estiva 1et al., nature,542,20070202), classification of breast tumors ("Representation learning for mammographymass lesion classification with convolutionalneural networks", john Arevalo et al, COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE,127,248-257,2016), identification of benign and malignant lesions of lung nodules ("Deep learning aided decision support for pulmonary nodules diagnosing: a review", yixin Yang et al Journal of Thoracic Disease,2018 (Suppl 7): S867-S875), etc. But this is far from meeting the requirements of clinical applications.

Therefore, there is an urgent need in the art to develop an artificial intelligent image recognition system capable of recognizing a lesion site, even in real time, which is truly suitable for clinical diagnosis.

Disclosure of Invention

The inventor discovers through exploring that through carrying out strict preprocessing on images for neural network learning, namely carrying out accurate frame selection on lesion sites in each endoscopic image in a training database, and inputting lesions in the frame selection as positive samples into the neural network learning, and simultaneously taking the parts outside the rectangular frames as negative samples for the neural network learning, the trained neural network model can accurately identify the lesion types and the lesion sites, and the identification rate of the trained neural network model even exceeds that of a physician.

In a first aspect of the present invention, there is provided a method of preprocessing a lesion image, comprising the step of framing a lesion region in the lesion image, the inside of the framing being defined as a positive sample and the outside being defined as a negative sample.

In one embodiment, the image is a medical diagnostic image, such as an endoscopic image.

In another embodiment, the lesion is preferably a digestive tract disease, such as gastric cancer or chronic atrophic gastritis.

In another embodiment, wherein the image is used for training of a lesion image recognition model based on a neural network (e.g. convolutional neural network), the model is trained to be able to recognize whether the lesion and/or the location of the lesion is present in the image to be analyzed (automatic recognition).

In another embodiment, the frame selection can generate a rectangular frame or a square frame containing a lesion, and simultaneously generate and/or record coordinate information; preferably, the coordinate information is coordinate information of points at upper left and lower right corners of the rectangular or square frame.

In another embodiment, wherein the framed location (or lesion in the image) is determined by the method of: 2n endoscopists perform frame selection in a back-to-back mode, namely, 2n persons are randomly divided into n groups, 2 persons/group, all images are randomly divided into n parts at the same time, and the n parts are randomly distributed to each group of doctors for frame selection; when the framing is complete, the framing results of each group of two physicians are compared and the consistency of the framing results between the two physicians is evaluated, resulting in a determination of the framing site, wherein n is a natural number between 1 and 100, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90 or 100.

In another embodiment, the method of assessing the consistency of the framing results between two physicians is as follows:

comparing the overlapping areas of the frame selection results of each group of two doctors aiming at each lesion picture, if the area (i.e. intersection) of the overlapping parts of the frame selection parts of each group of two doctors is greater than 50% of the area covered by the union of the two parts, considering that the frame selection judgment results of the two doctors are good in consistency, and storing diagonal coordinates corresponding to the intersection, namely coordinates of points at the upper left corner and the lower right corner, as final positioning of a target lesion;

if the area of the overlapping part (i.e. intersection) is smaller than 50% of the area covered by the union of the two, the frame selection judgment results of the two doctors are considered to be greatly different, the lesion pictures are selected independently, and all 2n doctors participating in the frame selection work discuss and determine the final position of the target lesion together.

In a second aspect of the present invention, there is provided a method of training a neural network, comprising the steps of the method of the first aspect, preferably the neural network is a convolutional neural network.

In a third aspect of the invention, there is provided a training method for a lesion image recognition model based on a neural network (e.g. convolutional neural network), comprising the steps of the method of the first aspect.

Drawings

FIG. 1 shows a frame selection screenshot

Detailed description of the preferred embodiments

Unless otherwise indicated, terms used in this disclosure have the ordinary meaning as understood by one of ordinary skill in the art. The following are meanings of some terms in the present disclosure, and if there is an inconsistency with other definitions, the following definitions shall apply.

The method of image preprocessing of the present disclosure can be used in a gastric cancer image recognition system or an auxiliary diagnosis system (apparatus) of gastric cancer, the system (apparatus) including the following modules:

a. a data input module for inputting an image containing a stomach cancer focus, the image preferably being an endoscopic image;

b. the data preprocessing module is used for receiving the image from the data input module, accurately framing the focus part of the gastric cancer, defining the part in the frame as a positive sample, defining the part outside the frame as a negative sample, and outputting the position coordinate information and focus type information of the focus; preferably, before the frame selection, the module also pre-desensitizes the image to remove personal information of patients;

preferably, the frame selection can generate a rectangular frame or a square frame containing a focus part; the coordinate information is preferably coordinate information of points at the upper left corner and the lower right corner of the rectangular frame or the square frame;

Also preferably, the framing location is determined by the following method: 2n endoscopists perform frame selection in a back-to-back mode, namely, 2n persons are randomly divided into n groups, 2 persons/group, all images are randomly divided into n parts at the same time, and the n parts are randomly distributed to each group of doctors for frame selection; when the framing is completed, comparing the framing results of each group of two physicians and evaluating the consistency of the framing results between the two physicians to finally determine the framing position, wherein n is a natural number between 1 and 100, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90 or 100;

further preferably, the criteria for evaluating the consistency of the framing results between two physicians are as follows:

comparing the overlapping areas of the frame selection results of each group of 2 doctors aiming at each lesion picture, if the area (i.e. intersection) of the overlapping parts of the frame selection parts of each group of two doctors is greater than 50% of the area covered by the union of the two, considering that the frame selection judgment results of the 2 doctors are good in consistency, and storing diagonal coordinates corresponding to the intersection as final positioning of the target lesion;

if the area of the overlapped part (i.e. intersection) is smaller than 50% of the area covered by the union of the two, the frame selection judgment result of 2 doctors is considered to be quite different, the pathological change pictures are selected independently, and all 2n doctors participating in the frame selection work discuss and determine the final position of the target pathological change together;

c. The image recognition model construction module can receive the image processed by the data preprocessing module and is used for constructing and training an image recognition model based on a neural network, and the neural network is preferably a convolutional neural network;

d. and the lesion recognition module is used for inputting the image to be detected into the trained image recognition model and judging whether the focus and/or the position of the focus exist in the image to be detected or not based on the output result of the image recognition model.

In one embodiment, the image recognition model construction module includes a feature extractor, a candidate region generator, and a target recognizer, wherein:

the feature extractor is used for extracting features of the image from the data preprocessing module so as to obtain a feature map, and preferably, the feature extraction is performed through convolution operation;

the candidate region generator is used for generating a plurality of candidate regions based on the feature map;

the target identifier calculates a classification score for the candidate region, the score indicating a probability that the region belongs to the positive sample and/or the negative sample; meanwhile, the target identifier can provide an adjustment value for the frame position of each area, so that the frame position of each area is adjusted, and the focus position is accurately determined; preferably, a loss function (Lossfunction) is used in the training of the classification score and the adjustment value;

It is also preferred that a mini-batch based gradient descent method is used in performing the training, i.e., one mini-batch containing multiple positive and negative candidate regions is generated for each training picture. 256 candidate regions are then randomly sampled from each picture until the ratio of positive candidate regions to negative candidate regions approaches 1:1, and then the corresponding mini-batch loss function is calculated. If the number of the positive candidate areas in one picture is less than 128, filling the mini-batch with the negative candidate areas;

further preferably, the learning rate of the first 50000 mini-latches is set to 0.001, and the learning rate of the second 50000 mini-latches is set to 0.0001; the dynamic term is preferably set to 0.9 and the weight decay is preferably set to 0.0005.

In another embodiment, the feature extractor can perform feature extraction on an input image with any size and/or resolution, where the image may be an original size and/or resolution, or an image input after the size and/or resolution is changed, so as to obtain a multi-dimensional (e.g. 256-dimensional or 512-dimensional) feature map;

in a specific embodiment, the feature extractor comprises X convolutional layers and Y sample layers, wherein the ith (i between 1-X) convolutional layer comprises Q _i The size is m p _i Wherein m represents the long and wide pixel values of the convolution kernel, p _i The number of convolution kernels Q equal to the last convolution layer _i-1 In the ith convolution layer, the convolution kernel convolves data from the previous stage (e.g., the original, the (i-1) th convolution layer, or the sampling layer) by a step L; each sampling layer comprises 1 convolution kernel which moves by a step length of 2L and has a size of 2L x 2L, and convolution operation is carried out on an image input by the convolution layer; after feature extraction is performed by a feature extractor, a Qx-dimensional feature map is finally obtained;

wherein X is between 1 and 20, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20; y is between 1 and 10, for example 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10; m is between 2 and 10, for example 2, 3, 4, 5, 6, 7, 8, 9 or 10; p is between 1 and 1024, Q is between 1 and 1024, and p or Q have values such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, 64, 128, 256, 512 or 1024, respectively.

At another oneIn an embodiment, wherein the candidate region generator sets a sliding window in the feature map, the sliding window having a size of n×n, e.g. 3×3; sliding the sliding window along the feature map, wherein for each position of the sliding window, a corresponding relation exists between the center point of the sliding window and the corresponding position in the original map, and k candidate areas with different scales and length-width ratios are generated in the original map by taking the corresponding position as the center; where k=x if k candidate regions have x (e.g., 3) different scales and aspect ratios ² (e.g., k=9);

in another embodiment, the object identifier includes an intermediate layer, a classification layer, and a bounding box regression layer, wherein the intermediate layer is used to map the data of the candidate region formed by the sliding window operation to a multi-dimensional (e.g., 256-dimensional or 512-dimensional) vector;

the classification layer and the frame regression layer are respectively connected with the middle layer, the classification layer is used for judging whether the target candidate region is a foreground (namely a positive sample) or a background (namely a negative sample), and the frame regression layer is used for generating an x coordinate and a y coordinate of a center point of the candidate region and a width w and a height h of the candidate region.

The method of the invention can also be used in a gastric cancer image processing and/or identifying method, which comprises the following steps:

s1, acquiring lesion data

For obtaining an image, preferably an endoscopic image, of a lesion-containing region of a patient diagnosed with gastric cancer;

s2, image preprocessing

Accurately framing a focus part of gastric cancer, defining a part in the framing as a positive sample, defining a part out of the framing as a negative sample, and outputting position coordinate information and focus type information of a focus; the frame selection is preferably performed by frame selection software; preferably, before the frame selection, the module also pre-desensitizes the image to remove personal information of patients;

also preferably, the specific operations of the block selection are as follows: 2n endoscopists perform frame selection in a back-to-back mode, namely, 2n persons are randomly divided into n groups, 2 persons/group, all images are randomly divided into n parts at the same time, and the n parts are randomly distributed to each group of doctors for frame selection; when the framing is completed, comparing the framing results of each group of two physicians and evaluating the consistency of the framing results between the two physicians to finally determine the framing position, wherein n is a natural number between 1 and 100, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90 or 100;

s3, training an image recognition model

Performing supervised training on the trainable image recognition model by using the image obtained in the step S2, and adjusting trainable parameters through a preset algorithm so as to obtain the image recognition model capable of recognizing gastric cancer;

preferably, the training is performed using a mini-batch based gradient descent method, i.e., a mini-batch containing multiple positive and negative candidate regions is generated for each training picture. 256 candidate regions are then randomly sampled from each picture until the ratio of positive candidate region samples to negative candidate regions approaches 1:1, and then the loss function (Lossfunction) of the corresponding mini-batch is calculated. If the number of the positive candidate areas in one picture is less than 128, filling the mini-batch with the negative candidate areas;

S4, pathological image identification

Utilizing the image recognition program obtained by training in the step S3 to recognize the pathological image to be analyzed, and calculating the type and probability of the focus in the image; the pathological image to be analyzed can be an endoscopic photo or a real-time image;

preferably, the classification score is set to 0.85, i.e. a focus with a probability of more than 85% of the lesion confirmed by the deep learning network is marked, so that the picture is judged as positive; conversely, if no suspicious lesion area is detected in a picture, the picture is determined to be negative.

In one embodiment, wherein the image recognition model in step S3 comprises:

a) Feature extractor

The feature extractor consists of X convolution layers and Y sampling layers, wherein the ith (i is between 1-X) convolution layer contains Q _i The size is m p _i Wherein m represents the long and wide pixel values of the convolution kernel, p _i The number of convolution kernels Q equal to the last convolution layer _i-1 In the ith convolution layer, the convolution kernel convolves data (including original image, the ith-1 th convolution layer or sampling layer) from the previous stage by a step length L; each sampling layer comprises 1 convolution kernel which moves by a step length of 2L and has a size of 2L x 2L, and convolution operation is carried out on an image input by the convolution layer; after feature extraction is performed by a feature extractor, a Qx-dimensional feature map is finally obtained;

Wherein X is between 1 and 20, Y is between 1 and 10, m is between 2 and 10, p is between 1 and 1024, and Q is between 1 and 1024;

b) Candidate region selector

Setting a sliding window in the feature map extracted by the original feature extraction layer, the sliding window having a size of n×n, for example, 3×3; sliding the sliding window along the feature map, wherein for each position of the sliding window, a corresponding relation exists between the center point of the sliding window and the corresponding position in the original map, and k candidate areas with different scales and length-width ratios are generated in the original map by taking the corresponding position as the center; where k=x if k candidate regions have x (e.g., 3) different scales and aspect ratios ² (e.g., k=9);

c) Target identifier

The system comprises a middle layer, a classification layer and a frame regression layer, wherein the middle layer is used for mapping data of candidate areas formed by sliding window operation and is a multi-dimensional (for example 256-dimensional or 512-dimensional) vector;

the classification layer and the frame regression layer are respectively connected with the middle layer and are used for judging whether the target candidate region is a foreground (namely a positive sample) or a background (namely a negative sample), and generating an x coordinate and a y coordinate of a central point of the candidate region and the width w and the height h of the candidate region.

Definition of the definition

The term "gastric cancer" refers to malignant tumors derived from gastric mucosal epithelial cells, including early gastric cancer and progressive gastric cancer.

The term "chronic atrophic gastritis", also known as atrophic gastritis, is a chronic digestive system disease characterized by atrophy of the gastric mucosa epithelium and glands, a reduced number, thinning of the gastric mucosa, thickening of the mucosal base layer, or concomitant pyloroadenogenesis and intestinal adenogenesis, or atypical hyperplasia. Is a multi-pathogenic disease and precancerous lesion.

The term "chronic superficial gastritis" refers to a disease in which gastric mucosa presents chronic superficial inflammation, is a common disease of the digestive system, and belongs to one of chronic gastritis. Can be caused by alcoholism, espresso, bile reflux, or helicobacter pylori infection. Patients may have varying degrees of dyspepsia symptoms. In the present invention, chronic superficial gastritis represents a relatively normal change of gastric mucosa, and if diagnosed "chronic superficial gastritis" corresponds to no clear gastric mucosal lesions.

In one embodiment of the invention, the endoscopic image of the chronic superficial gastritis is used as an interference sample in the atrophic gastritis test data set, and the capability of the trained deep learning network to distinguish atrophic gastritis from the relative normal gastric mucosa can be effectively evaluated through the sample arrangement.

The term "module" refers to a collection of functions capable of achieving a particular effect, which may be performed by a computer alone, by a human being, or by both a computer and a human being.

Obtaining lesion data

The key role of the step of acquiring lesion data is to obtain sample material for deep learning.

In one embodiment, the acquisition process may specifically include the steps of acquisition and prescreening.

By "acquiring" is meant that all endoscopic diagnostic images of all patients suffering from gastric cancer or chronic atrophic gastritis are searched and acquired in all endoscopic databases according to the standard of "diagnosing gastric cancer or chronic atrophic gastritis", for example, all pictures in the folder to which the patient diagnosed with gastric cancer or chronic atrophic gastritis belongs, that is, all stored pictures of a certain patient in the whole endoscopic examination process, so that gastroscopy pictures except lesions of a target site may be included, for example, pictures stored in various sites in the examination process of esophagus, fundus, stomach body, duodenum and the like are included in the folder under the name of the patient, which is diagnosed with benign ulcer, polyp and the like.

The ' primary screening ' is a step of screening the acquired pathological images of the gastric cancer patient, and can be specifically carried out by an experienced endoscopist according to the description of the related content in the ' combined ' pathological diagnosis ' seen by the ' endoscopy ' in the case. Since the pictures used for the deep learning network must be clear in quality and accurate in characteristics, otherwise, the learning difficulty is increased or the recognition result is inaccurate. Therefore, the module and/or the step of the lesion data preliminary screening can select the picture with the clear stomach cancer focus part from a set of examination pictures.

Importantly, the primary screening can be combined with the description of the atrophy part in the histopathological result of the patient after biopsy, namely 'pathological diagnosis', accurately position the lesion, and simultaneously consider the definition, shooting angle, amplification degree and the like of the picture, and select the endoscope images with high definition, moderate amplification degree and capability of peeping the full view of the lesion as far as possible.

Through the preliminary screening, the images input into the training set can be guaranteed to be high-quality images containing the determined lesion parts, so that the feature accuracy of the data set input into the training is improved, the image features of the atrophic lesions can be better summarized and summarized by the artificial intelligent network, and the diagnosis accuracy is improved.

Pretreatment of lesion images

The pretreatment is to finish the process of accurately framing the focus part of the gastric cancer, wherein the part in the frame is defined as a positive sample, the part outside the frame is defined as a negative sample, and the position coordinate information and focus type information of the focus are output.

In one embodiment, the frame selection is implemented by an image preprocessing program.

The term "image preprocessing program" refers to a program that enables the framing of a target area in an image, thereby marking the type and extent of the target area.

In one embodiment, the image pre-processing program is also capable of desensitizing the image to remove patient personal information.

In one embodiment, the image preprocessing program is a piece of software written in a computer programming language capable of performing this function.

In another embodiment, the image preprocessing program is software capable of performing the framing function.

In a specific embodiment, the software for performing the framing function can import the picture to be processed into the software and display the picture on the operation interface, and at this time, the framing operator (for example, doctor) only needs to drag the mouse along the direction from top left to bottom right (or other diagonal directions) at the target lesion site to be framed, so as to form a rectangular frame or square frame covering the target lesion, and at the same time, the background generates and stores the accurate coordinates of the top left corner and the bottom right corner of the rectangular frame for unique positioning.

In order to ensure the accuracy of preprocessing (or frame selection), the invention emphasizes the quality control of frame selection, which is also an important guarantee that the method/system of the invention can obtain greater accuracy, and the specific mode is as follows: 2n endoscopists perform frame selection in a back-to-back mode, namely, 2n persons are randomly divided into n groups, 2 persons/group, all images are randomly divided into n parts at the same time, and the n parts are randomly distributed to each group of doctors for frame selection; when the framing is complete, the framing results of each group of two physicians are compared and the consistency of the framing results between the two physicians is evaluated, resulting in a determination of the framing site, wherein n is a natural number between 1 and 100, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90 or 100.

In one embodiment, the evaluation criteria for consistency are: for the same lesion picture, comparing the frame selection result of each group of 2 doctors, namely comparing the overlapping areas of the rectangular frames determined by the diagonal coordinates, if the area of the overlapping part (i.e. intersection) of the two rectangular frames is greater than 50% of the area covered by the union of the two rectangular frames, the frame selection judgment result of the 2 doctors is considered to be good in consistency, and the diagonal coordinates corresponding to the intersection are stored as the final positioning of the target lesion. Conversely, if the area (i.e. intersection) of the overlapping parts of the two rectangular frames is smaller than 50% of the area covered by the union of the two rectangular frames, the frame selection judgment results of 2 doctors are considered to be greatly different, then such lesion pictures are individually selected by the software background, and the final positions of the target lesions are collectively discussed and determined by all the doctors participating in the frame selection work in the later stage.

Image recognition model

The term "image recognition model" refers to an algorithm built based on principles of machine learning and/or deep learning, and may also be referred to as an "image recognition model" or an "image recognition program.

In one embodiment, the program is a neural network, preferably a convolutional neural network; in another embodiment, the neural network is based on a convolutional neural network of the LeNet-5, RCNN, SPP, fast-RCNN and/or Faster-RCNN architecture; wherein the faster-RCNN may be considered as a combination of Fast-RCNN and RPN, in one embodiment based on a faster-RCNN network.

The image recognition program comprises at least the following layers: the method comprises an original image feature extraction layer, a candidate region selection layer and a target identification layer, and trainable parameters are adjusted through a preset algorithm.

The term "artwork feature extraction layer" refers to a hierarchy or combination of hierarchies that is capable of performing mathematical calculations on an input image to be trained to extract artwork information in multiple dimensions. The layers may actually represent a combination of a plurality of different functional layers.

In one embodiment, the artwork feature extraction layer may be based on a ZF or VGG16 network.

The term "convolution layer" refers to a network layer in the original image feature extraction layer, which is responsible for performing convolution operation on an original input image or image information processed by a sampling layer, so as to extract information. The convolution operation is actually implemented by sliding a convolution kernel (for example, 3*3) with a specific size on an input image in a certain step (for example, 1 pixel), multiplying the corresponding weights of the pixel on the picture and the convolution kernel in the process of moving the convolution kernel, and finally adding all the products to obtain an output. In image processing, an image is often represented as a vector of pixels, so a digital image can be regarded as a discrete function in two dimensions, for example, denoted as f (x, y), and if an operation function C (u, v) is performed on two dimensions, an output image g (x, y) =f (x, y) ×c (u, v) is generated, and image blurring and information extraction can be performed by convolution.

The term training refers to repeated self-adjustment of parameters of a trainable image recognition program by inputting a large number of manually marked samples, so that the expected purpose of recognizing the lesion part in the gastric cancer image is achieved.

In one embodiment, the present invention is based on a master-rcnn network and employs the following end-to-end training method in step S4:

(1) Initializing parameters of a target candidate region generation network (RPN) by using a model pre-trained on an ImageNet, and fine-tuning the network;

(2) Initializing FastR-CNN network parameters also using a pre-trained model on ImageNet, followed by training using the RegionPropos extracted by the RPN network in (1);

(3) Reinitializing the RPN using the FastR-CNN network of (2), the fixed convolution layer trimming the RPN network, wherein only the cls and/or reg layers of the RPN in the trim;

(4) Fixing the convolution layer of FastR-CNN in (2), trimming the FastR-CNN network using the region pro-post extracted by RPN in (3), wherein only the fully-connected layer of FastR-CNN is trimmed.

The term "candidate region selection layer": it means that the selection of specific areas on the original image for classification recognition and frame regression is realized through an algorithm, and similar to the original image feature extraction layer, the layer can also represent a combination of a plurality of different layers.

In one embodiment the candidate region selection layer is directly connected to the original input layer.

In one embodiment, the candidate region selection layer is directly connected to the last layer of the artwork feature extraction layer.

In one embodiment, the "candidate region selection layer" may be based on the RPN.

The term "object recognition layer" the term "sampling layer", sometimes called pooling layer, operates similar to a convolution layer except that the convolution kernel of the sampling layer takes only the maximum value, average value, etc. of the corresponding position (maximum pooling, average pooling).

The term "feature map", also called featuremap, refers to a small-area high-dimensional multi-channel image obtained by performing convolution operation on an original image through an original image feature extraction layer, and as an example, the feature map may be a 256-channel image with a scale of 51×39.

The term "sliding window" refers to a small sized (e.g., 2 x 2,3 x 3) window generated on the feature map that moves along each position of the feature map, although the feature map size is not large, a larger field of view can be achieved with a smaller sliding window on the feature map because the feature map has been subjected to multiple layers of data extraction (e.g., convolution).

The term "candidate region" may also be referred to as a candidate window, a target candidate region, referencebox, boundingbox, and may also be used interchangeably herein with anchor or anchor box.

In one embodiment, the sliding window is first positioned to a position of the feature map, for which position k rectangular or square windows with different areas and different proportions, for example 9, are generated and anchored at the center of the position, thus also called anchor or anchor box, and based on the relationship between each sliding window in the feature map and the center position of the original map, candidate regions are formed, which may be essentially considered as the original map region range corresponding to the sliding window (3*3) moved on the last convolution layer.

In one embodiment of the invention, k=9, when generating the candidate region comprises the steps of:

(1) Firstly, generating 9 anchors according to different areas and length-width ratios, wherein the anchors do not change in the size of a feature map or an original input image;

(2) For each input image, calculating the center point of the original image corresponding to each sliding window according to the image size;

(3) And establishing a mapping relation between the sliding window position and the original image position based on the calculation.

The term "middle layer" refers to a layer that can be considered a new hierarchy, herein referred to as a middle layer, by further mapping the feature map into a multi-dimensional (e.g., 256-dimensional or 512-dimensional) vector after the target candidate region is formed using a sliding window. And the classification layer and the window regression layer are connected behind the middle layer.

The term "classification layer" (cls_score), a branch of the connection with the middle layer output, is capable of outputting 2k scores, two scores for k target candidate regions, one of which is a foreground (i.e., positive sample) score and one of which is a background (i.e., negative sample) score, which can determine whether the target candidate region is a true target or background. The classification layer can thus output probabilities belonging to the foreground (i.e., positive samples) and the background (i.e., negative samples) from the high-dimensional (e.g., 256-dimensional) features for each sliding window position.

Specifically, in one embodiment, when the IOU (cross-over ratio) of the candidate region and any group-trunk (real sample boundary, i.e., the boundary of the object to be identified in the original image) is greater than 0.7, it can be regarded as a positive sample or a positive label, and when the IOU of the candidate region and any group-trunk is less than 0.3, it is regarded as a background (i.e., a negative sample), so that a class label is assigned to each anchor. The IOU mathematically represents the overlap of the candidate region and the group-trunk, and the calculation method is as follows:

IOU＝(A∩B)/(A∪B)

The classification layer may output a k+1-dimensional array p representing probabilities belonging to class k and the background. For each RoI (RegionofInteresting), a discrete probability distribution is output, and p is calculated by using softmax from the fully connected layers of k+1 class. The mathematical expression is as follows:

p＝(p ₀ ，p ₁ ，…，p _k )

the term "windowed regression layer" (bbox_pred), another leg of the connection to the middle layer output, is juxtaposed to the classification layer. This layer is able to output the parameters that the 9 anchor corresponding windows should pan and zoom at each location. Respectively corresponding to k target candidate regions, wherein each target candidate region has 4 frame position adjustment values, and the 4 frame position adjustment values refer to x of the upper left corner of the target candidate region _a Coordinates, y _a Height h of coordinate and target candidate region _a Sum width w _a Is set in the above-described table. The effect of the branch is to finely adjust the position of the target candidate area, so that the position of the obtained result is more accurate.

The window regression layer may output the displacement of the boundingbox regression, output 4*K dimension array t, representing the parameters that should be scaled by translation when belonging to k classes, respectively. The mathematical expression is as follows:

k represents the index of the category,refers to a translation that is unchanged relative to the objectproposal scale, +.>Refers to the height and width in log space relative to objectproposal.

In one embodiment, the present invention implements simultaneous training of the classification layer and the windowed regression layer by a loss function (Lossfunction) that consists of classificationloss (i.e., classification layer softmaxloss) and regressionloss (i.e., L1 loss) at a specific gravity. :

calculating a calibration result and a prediction result of the corresponding groudtruth of the candidate region required by softmaxloss; three sets of information are needed to calculate regressionloss:

(1) Predicting the central position coordinates x, y and the width and height w, h of the candidate region;

(2) Coordinates x of each center point position in 9 anchor point preferenceboxs around candidate region _a ,y _a And width and height w _a ,h _a 。

(3) The center point position coordinates x, y and the width and height w and h corresponding to the real calibration frame (groudtluth).

The regressionloss and total Loss modes are calculated as follows:

t _x ＝(x-x _a )/w _a ，t _y ＝(y-ya)/h _a ，

t _w ＝log(w/w _a )t _h ＝log(h/h _a )，

wherein p is _i The probability of targeting is predicted for an anchor.

There are two values, +.>Equal to 0 negative label->Equal to 1 is a positive label.

t _i A vector set of 4 parameterized coordinates representing the predicted candidate region.

And the coordinate vector of the groudtruth bounding box corresponding to the posivenchor is represented.

In one embodiment, a mini-batch based gradient descent method is used in the training of the loss function, i.e., a mini-batch containing multiple positive and negative candidate regions (anchors) is generated for each training picture. 256 anchors were then randomly sampled from each picture until the ratio of positive and negative anchors was approximately 1:1, and then the loss function (Lossfunction) of the corresponding mini-batch was calculated. If the number of positive anchors in a picture is less than 128, then the negative anchors are used to fill the mini-batch.

In one specific embodiment, the learning rate of the first 50000 mini-latches is set to 0.001, and the learning rate of the last 50000 mini-latches is set to 0.0001; the dynamic term is preferably set to 0.9 and the weight decay is preferably set to 0.0005.

After the training, the trained deep learning network is used for identifying the endoscopic picture of the target lesion. In one embodiment, the classification score is set to 0.85, i.e., a lesion with a probability of greater than 85% confirmed by the deep learning network is marked, such that the picture is judged positive; conversely, if no suspicious lesion area is detected in a picture, the picture is determined to be negative.

Examples

1. Exempt from informed consent statement:

(1) The study only uses the endoscope pictures and related clinical data obtained by the digestive department endoscope center of the Beijing friendship hospital in the past clinical diagnosis and treatment to carry out retrospective observation study, and the disease condition, treatment, prognosis and even life safety of the patient are not affected;

(2) A main researcher singly completes all data acquisition work, and immediately applies special software to erase personal information processing on all pictures after the picture data acquisition is completed, so that the privacy information of patients is not leaked in the follow-up doctor screening, frame selection and artificial intelligent programming expert input training, debugging and testing processes;

(3) In the digestive endoscopy center electronic medical record query system, no terms such as contact information or home address are set for display, namely the system does not input contact information of a patient, so that the study cannot be traced back to the step of taking in the patient to sign an informed consent.

2. Pathological image acquisition

Criteria for inclusion：

(1) Patients who received endoscopy (including electronic gastroscope, electronic colonoscope, ultrasonic endoscope, electronic staining endoscope, magnifying endoscope and pigment endoscope) from 1 st 2017 to 6 th 10 th are stopped at the digestive endoscope center of Beijing friendship hospital;

(2) Patients diagnosed under the scope with "gastric cancer" (including and not distinguishing between early gastric cancer and progressive gastric cancer);

(3) Diagnosis of chronic atrophic gastritis under the lens, clear endoscopic pictures of patients with confirmed pathological results and relevant clinical data;

exclusion criteria：

Stomach cancer

(1) Extensive or undefined malignant tumor of digestive tract;

(2) Those suffering from malignant tumors of the pancreatic and biliary system only;

(3) Combining malignant tumors of other systems;

(2) And (5) the endoscope picture is unclear and/or the shooting angle is not satisfactory.

Atrophic gastritis

(1) The biopsy part under the atrophic gastritis endoscope is not clear, and the identification of the lesions of the endoscope picture is difficult;

3. Experimental procedure and results

(1) And (3) data acquisition: searching for endoscope pictures and related clinical data of patients who are subjected to endoscopy (including electronic gastroscope, electronic colonoscope, ultrasonic endoscope, electronic staining endoscope, magnifying endoscope and pigment endoscope) between 1 st 2013 and 6 th 2017 and are diagnosed as 'stomach cancer' (including and not distinguishing early stomach cancer and progressive stomach cancer) and 'chronic atrophic gastritis' under the lens from an electronic medical record system of a digestive endoscope center of a Beijing friendship hospital;

(2) Personal information is wiped off: and wiping out personal information processing on all the pictures immediately after the acquisition is completed.

(3) And (3) picture screening: finish machining is carried out on all the processed pictures, endoscopic pictures corresponding to the cases with clear pathological results confirmed as gastric cancer are screened out, and the pictures with clear target pathological positions and little background interference in each case are finally screened out according to the biopsy pathological positions, wherein the pictures are 3774 in total; and

screening endoscopic pictures corresponding to the cases with definite pathological results confirmed to be atrophic gastritis, and finally screening out clear pictures with less background interference, which contain target pathological change parts, in each case according to the biopsy pathological change parts, wherein the total number of the pictures is 10064;

Aiming at gastric cancer:

(4) Constructing a test data set: the test pictures comprise 50 "stomach cancer" (both early stomach cancer and progressive stomach cancer) with confirmed pathological results, and 50 "non-tumor lesions" (including gastric benign ulcer, polyp, interstitial tumor, lipoma and ectopic pancreas) of the stomach with confirmed pathological results are randomly collected in a database. The specific operation comprises the following steps:

firstly, randomly selecting 50 stomach cancer pictures from all stomach cancer pictures screened in the step (3);

then, 50 endoscopic pictures of the stomach (including benign ulcer, polyp, interstitial tumor, lipoma and ectopic pancreas) with confirmed pathological results are randomly collected in a database, and the 50 pictures are immediately subjected to personal information erasing treatment;

(5) Building a training data set: excluding the pictures randomly selected in the step (4) from the gastric cancer pictures screened in the step (3) for constructing a test data set, and using the remaining 3724 pictures for deep learning network training so as to form a training data set;

for atrophic gastritis:

(4) ' build test dataset: the test pictures comprise 50 pictures of the "chronic atrophic gastritis" endoscope with pathological results and 50 pictures of the "chronic superficial gastritis" endoscope with pathological results. The specific operation comprises the following steps:

Firstly, randomly selecting 50 pictures from all the atrophic gastritis pictures screened in the step (3);

then 50 endoscope pictures of the chronic superficial gastritis with the pathological result being confirmed are randomly collected in a database, and the 50 pictures are immediately wiped out for personal information processing;

(5) ' building a training dataset: excluding the pictures randomly selected in the step (4)' from the atrophic gastritis pictures screened in the step (3) for constructing a test data set, and using the rest 10014 pictures for deep learning network training so as to form a training data set;

(6) Selecting target lesions: 6 endoscopists randomly divide 6 persons into 3 groups, 2 persons/group, in a "back-to-back" fashion; all the training pictures after screening were randomly aliquoted into 3 and randomly assigned to each group of physicians for frame selection. The implementation of the lesion frame selection step is based on self-written software, the software can display the picture to be processed on an operation interface after the picture to be processed is imported into the software, at the moment, a doctor needs to drag a mouse along the direction from the upper left to the lower right of a target lesion part to be framed, so that a rectangular frame covering the target lesion is formed, and meanwhile, accurate coordinates of the upper left corner and the lower right corner of the rectangular frame are generated and stored in the background for unique positioning.

After the frame selection is completed, comparing the frame selection results of each group of 2 doctors, comparing the overlapping areas of the rectangular frames determined by the diagonal coordinates for the same lesion picture, and finally determining that if the area (i.e. intersection) of the overlapping parts of the two rectangular frames is greater than 50% of the area covered by the union of the two rectangular frames after the test, the frame selection judgment results of the 2 doctors are considered to be good in consistency, and the diagonal coordinates corresponding to the intersection are stored as the final positioning of the target lesion. If the area of the overlapping part (i.e. intersection) of the two rectangular frames is smaller than 50% of the area covered by the union of the two rectangular frames, the frame selection judgment results of 2 doctors are considered to be quite different, then such lesion pictures are singly selected by a software background (or manual mark), and the final position of the target lesion is determined through the joint discussion of all doctors participating in the frame selection work in the later stage.

(7) And (3) inputting training: inputting all the frames into a master-rcnn convolutional neural network for training, and testing two network structures of ZF and VGG 16; the training adopts an end-to-end mode;

wherein the ZF network has 5 convolution layers, 3 full-link layers and one softmax classified output layer, the VGG16 network has 13 convolution layers, 3 full-link layers and one softmax classified output layer, and under the framework of Faster-RCNN, both the ZF and VGG16 models are basic CNNs used for extracting training image features.

During training, a mini-batch-based gradient descent method is adopted, namely a mini-batch containing a plurality of positive and negative candidate regions (anchors) is generated for each training picture. 256 anchors were then randomly sampled from each picture until the ratio of positive and negative anchors was approximately 1:1, and then the loss function (Lossfunction) of the corresponding mini-batch was calculated. If the number of positive anchors in a picture is less than 128, then the negative anchors are used to fill the mini-batch.

The learning rate of the first 50000 mini-latches is set to 0.001, and the learning rate of the second 50000 mini-latches is set to 0.0001; the dynamic term is preferably set to 0.9 and the weight decay is preferably set to 0.0005.

The loss function (LossFunction) used in training is as follows:

/>

in the above formula, i represents the index of the anchor in each batch, p _i A probability representing whether an anchor is a target (Object); p is p _i * Is the real label of the anchor: when the anchor is Object, the label is 1, otherwise, the label is 0.t is t _i Is a 4-dimensional vector representing the parameterized coordinates of the boundingbox, respectively, and t _i * The label used for the boundingbox parameterized coordinates in boundingbox regression prediction is represented.

(8) Testing and result statistics: the artificial intelligent system and digestive doctors with different annual resources are respectively tested by using a test data set (comprising 50 stomach cancers and 50 pictures of stomach non-neoplastic lesions), and indexes such as sensitivity, specificity, accuracy, consistency and the like of the artificial intelligent system and digestive doctors with different annual resources are compared and evaluated, and statistical analysis is carried out. In the test, the classification score of the training deep learning network for identifying the endoscopic picture of the target lesion is set to 0.85, namely, the deep learning network identifies the lesion with the lesion probability exceeding 85 percent, so that the picture is judged to be positive; conversely, if no suspicious lesion area is detected in a picture, the picture is determined to be negative.

The results were as follows:

(1) Stomach cancer

Based on the platform of the national center for clinical research of digestive diseases, the sensitivity fluctuation of 89 participators in the overall gastric cancer endoscopic lesion diagnosis test is 48-100%, wherein the number of digits is 88%, and the average sensitivity is 87%; specificity fluctuation is within a range of 10% -98% (wherein the number of bits is 78%, average specificity is 74%), accuracy fluctuation is within a range of 51% -91% (wherein the number of bits is 82%, average accuracy is 80%). The recognition sensitivity of the deep learning network model diagnosis is 90%, the specificity is 50%, and the accuracy is 70%. Therefore, in the aspect of gastric cancer diagnosis based on gastroscopy pictures, the sensitivity of artificial intelligence is higher than the level of general doctors, but the specificity is lower than the level of the median, and the accuracy is slightly lower than the level of the median of the doctors, but in consideration of the fact that a deep learning network model diagnosis model has excellent stability in recognition, different doctors have great fluctuation and instability in the aspects of specificity and accuracy, the focus can still be effectively removed from diagnosis deviation caused by individual difference of the doctors by using artificial intelligence for recognition, and the method has good application prospect.

Among them, sensitivity is also called Sensitivity (SEN), also called True Positive Rate (TPR), i.e. the percentage of the actual illness that is correctly diagnosed by the diagnostic criteria.

Specificity, also known as Specificity (SPE), also known as True Negative Rate (TNR), reflects the ability of a screening test to determine non-patients.

Accuracy = total number of correctly identified individuals/total number of identified individuals.

(2) Atrophic gastritis

Based on the platform of the national center for clinical study of digestive diseases, 77 gastroenterologists from different areas and different levels of medical institutions participate in the diagnostic test of the current atrophic gastritis endoscopic picture. The sensitivity range of the 77 participating doctors is 16-100% (median 78%, average sensitivity 74%), the specificity fluctuation range is 0-94% (median 88%, average specificity 82%), and the accuracy fluctuation range is 21-87% (median 81%, average accuracy 78%). And the sensitivity of the deep learning network model diagnosis is 95%, the specificity is 86%, and the accuracy is 90%. Thus, the artificial intelligence is obviously superior to the level of 77 doctors in sensitivity, specificity and accuracy in terms of diagnosis of atrophic gastritis based on gastroscopy pictures.

Among them, sensitivity is also called Sensitivity (SEN), also called true positive rate (true positive rate, TPR), i.e. the percentage of the actual illness that is correctly diagnosed by the diagnostic criteria.

Specificity, also known as Specificity (SPE), also known as true negative rate (true negative rate, TNR), reflects the ability of a screening test to determine non-patients.

Further, endoscopes are divided into four subgroups according to the age of each physician's operation: the first group refers to a physician having an endoscope operational period of less than 5 years, the second group refers to an endoscope operational period of between 5 and 10 years, the third group refers to an endoscope operational period of between 10 and 15 years, and the fourth group refers to an endoscope operational period of 15 years or more. And as a result of further analysis of the diagnostic level of the physicians in each subgroup, the sensitivities from the first group to the fourth group were found to be 61.4%, 72.8%, 82.2% and 79.8% in order, the specificities were found to be 78.2%, 73.8%, 81.4% and 85.4% in order, and the accuracies were found to be 69.8%, 73.3%, 81.1% and 82.6% in order. It can be seen that as the operating life of endoscopes by physicians is prolonged, although the specificity of the second group is slightly reduced compared with that of the first group in the test, the sensitivity, specificity and accuracy of the diagnosis of lesions under the endoscope by the physicians on atrophic gastritis generally gradually increase. The recognition real positive rate and the recognition real negative rate of the deep learning network model are obviously superior to those of doctors in the fourth group (namely, doctors with the longest operation life of the endoscope), that is, the sensitivity, the specificity and the accuracy of the artificial intelligent network model for recognizing pictures under the atrophic gastritis endoscope reach the level of the digestive endoscopist. And, there were statistical differences in sensitivity for the network model compared to each subgroup physician (P < 0.05); the algorithm model was statistically different from the rest of the groups except that there was no statistical difference from the third group of physicians in terms of accuracy (p=0.103). However, the algorithmic model recognizes that the specificity of the endoscopic pictures of atrophic gastritis is statistically different (p=0.034) than the second group of physicians, and not statistically different (P > 0.05) than the other groups of physicians.

While for consistency in diagnosis, the results of consistency among observers of endoscopic pictures for diagnosis of atrophic gastritis by physicians in each subgroup are shown in Table 1. As can be seen from the table, the longer the period of time that the physicians operate their endoscopes, the better the consistency of the diagnosis of the endoscope pictures for the atrophic gastritis by each physician in the group, with the highest consistency of the diagnosis of the physicians in the fourth group. But it is clear that even the diagnostic consistency between expert physicians (fourth group) is significantly lower than for the deep learning network (kappa=1).

Table 1 diagnostic consistency results for each group of physicians

* Fleiss Kappa (used by at least 2 observers).

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A training method of a lesion image recognition model based on a neural network, wherein the lesion is gastric cancer, the training method comprising:

s1, obtaining lesion data, wherein the lesion data are used for obtaining an image containing a lesion part of a patient diagnosed with gastric cancer, and the image is an endoscope image;

s2, image preprocessing

Accurately framing a focus part of gastric cancer, defining a part in the framing as a positive sample, defining a part out of the framing as a negative sample, and outputting position coordinate information and focus type information of a focus;

the frame selection can generate a rectangular frame or square frame containing a lesion part, and the coordinate information is the coordinate information of points at the upper left corner and the lower right corner of the rectangular frame or square frame;

the selected part is determined by the following method: 2n endoscopists perform frame selection in a back-to-back mode, namely, 2n persons are randomly divided into n groups, 2 persons/group, all images are randomly divided into n parts at the same time, and the n parts are randomly distributed to each group of doctors for frame selection; after the frame selection is completed, comparing the frame selection results of each group of two doctors, and evaluating the consistency of the frame selection results between the two doctors to finally determine the frame selection part, wherein n is a natural number between 1 and 100;

the criteria for evaluating the consistency of the framing results between two physicians are as follows:

comparing the overlapping areas of the frame selection results of each group of two doctors aiming at each pathological change picture, if the area of the overlapping part of the frame selection parts of each group of two doctors is greater than 50% of the area covered by the union of the two parts, considering that the frame selection judgment results of the two doctors are good in consistency, and storing diagonal coordinates corresponding to the intersection, namely the coordinates of points at the upper left corner and the lower right corner, as final positioning of the target pathological change;

If the area of the overlapped part is smaller than 50% of the area covered by the union of the two, the frame selection judging results of two doctors are considered to be larger in difference, the lesion pictures are selected independently, and all 2n doctors participating in the frame selection work discuss and determine the final position of the target lesion together;

s3, training an image recognition model

the neural network is a convolutional neural network based on a Faster-RCNN architecture, and the convolutional neural network based on the Faster-RCNN architecture is selected from a ZF network or a VGG16 network;

the training comprises:

(1) Initializing a target candidate region by using a model pre-trained on an ImageNet to generate parameters of a network RPN, and performing fine tuning on the network;

(3) Reinitializing the RPN using the FastR-CNN network of (2), the fixed convolution layer trimming the RPN network, wherein only the cls and/or reg layers of the RPN in the trim; and