CN111553361A

CN111553361A - Pathological section label identification method

Info

Publication number: CN111553361A
Application number: CN202010199537.0A
Authority: CN
Inventors: 王杰; 郑众喜; 向旭辉; 陈杰
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2020-08-18
Anticipated expiration: 2040-03-19
Also published as: CN111553361B

Abstract

The invention discloses a pathological section label identification method, which adopts a deep learning method to identify pathological section label images, wherein the basic network of a model adopted by the deep learning is a RetinaNet network based on ResNet-50 and a module used for helping the basic network to identify direction-sensitive characters, the module comprises a vertical self-attention mechanism branch, a horizontal self-attention mechanism branch and a middle branch, and the fusion method of the modules is as follows: o ═ C_vβ+C_h(1- β) (1) wherein in the formula (1), O represents an output and C represents_vIndicating a vertical self-attentive mechanism branch, C_hIndicating level fromNote that the mechanism branch, β, is the output of the middle branch.

Description

Pathological section label identification method

Technical Field

The invention relates to the field of medical detection, in particular to a pathological section label identification method.

Background

One of the current methods for pathological section label recognition is Optical Character Recognition (OCR). The mainstream OCR algorithms all comprise the following two steps:

1. detecting characters in a scene;

2. the detected text is identified.

The output of the first step in the above steps is usually the position information of a word or a line of characters, and the currently used technology is mostly based on a general target detection algorithm; second, the corresponding text is cut out of the image according to the detection result of the first step and scaled into a fixed-height image, and then recognized by using a CTC or attention mechanism based method, and they generally assume that the text satisfies the forward direction and is from left to right at the time of recognition. Most of the current research is focused on the first step and the main focus is on how to recognize irregular text.

The mainstream OCR algorithm is directly applied to the pathological section label recognition, and the following problems exist:

1. at present, the mainstream OCR technology needs a large amount of training data, usually the first step needs 10 k-50 k of marking data, and the second step needs more than 1000k of training data, and collection of pathological section data of the order is almost impossible, wherein the number of marking data used in the patent is less than 2000, which is far smaller than the data volume used by the mainstream OCR technology;

2. most of the mainstream OCR technology focuses on how to detect irregular characters, as shown in fig. 1, the labels of pathological sections are scanned by a digital section scanner, as shown in fig. 2, there is almost no deformation;

3. the characters in the label of the pathological section can be in any direction (different directions may exist in the same label at the same time), and the mainstream OCR technology has little interest in this aspect, and most OCR methods directly assume that the characters are arranged upwards and from left to right;

4. most of the mainstream OCR detection is natural language, the recognized target is a word, semantic correlation exists between words, characters in a pathological label have high randomness, and the correlation between the characters is small;

5. the technology that the part can directly process the character in any direction has the limitation of using scenes, such as that the character is generated in a fixed position according to a rule, a locator which needs to be assisted is required, a fixed font is used, and the like.

As described above, since the current mainstream OCR technology and tag recognition have great differences in data amount and attention point, it is not possible to achieve a good effect by directly using the OCR technology in tag recognition.

Disclosure of Invention

The invention aims to provide a pathological section label identification method which can correctly process characters in different directions.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

the invention discloses a pathological section label identification method, which adopts a deep learning method to identify pathological section label images, wherein the basic network of a model adopted by the deep learning is a RetinaNet network based on ResNet-50 and a module used for helping the basic network to identify direction-sensitive characters, the module comprises a vertical self-attention mechanism branch, a horizontal self-attention mechanism branch and a middle branch, and the fusion method of the modules is as follows:

O＝C_vβ+C_h(1-β) (1)

in formula (1): o represents an output, C_vIndicating a vertical self-attentive mechanism branch, C_hIndicating a horizontal self-attention mechanism branch, β is the output result of the middle branch.

Preferably, the ratio of the top layer Anchor box of the basic network is 1:1,1:7 and 7:1, and the ratio of the Anchor box of the middle layer is 1:1,1:5 and 5: 1; the bottom most layer has Anchor box ratios of 1:1,1:2 and 2: 1.

Preferably, the topmost output network and the middle output network of the model share weights, and the bottommost network uses a separate weight.

Preferably, the loss function of the training network is as follows:

L＝L_cls(p,u)+λ[u≥1]L_loc(t^u,v)+γL_dre(p,w) (2)

in formula (2): l is_cls(p,u)＝-log p_uU is the type of the target box in the output result, wherein the category number of the background is 0, L_locIs the regression loss of the target box, L_dre(p,w)＝-log p_wW is the direction of the target box in the output result, and λ, γ are the weights of the corresponding losses.

Preferably, λ is 10 and γ is 1.

Preferably, the deep learning training phase processing steps are as follows:

step 1, preprocessing an input image;

step 2, carrying out random cutting, left-right turning, up-down turning, rotation at any angle, color disturbance, random brightness transformation and random noise addition on the preprocessed image to carry out data enhancement

Step 3, zooming the image processed in the step 2 into a fixed size;

step 4, forming a batch by the zoomed images;

step 5, forward propagation is carried out by using the model;

step 6, calculating loss by using a loss function, reversely propagating, and updating training parameters;

and 7, carrying out iterative training until the model converges.

Preferably, the prediction stage processing steps of the deep learning are as follows:

a. preprocessing an input image;

b. scaling the pre-processed image to a fixed size;

c. forward propagation using the model;

d. dividing the result output in the step c into two groups of words and characters;

e. aggregating the characters into words according to whether the words and the characters are overlapped;

f. counting the directions of all characters in the same word, and determining the direction of the current word by using a voting method;

g. arranging the characters in the words according to the direction of the words in sequence;

h. determining whether spaces exist among the characters according to the distance among the characters in the word, and if yes, adding the spaces;

i. and outputting the result.

Preferably, the pretreatment method comprises the following steps:

in the formula (3), μ is a mean value of the image, and σ is a variance of the image.

Preferably, the fixed size is 512 by 512, and the number of sheets is 16.

The invention has the following beneficial effects:

1. the present invention requires only a very small number of training samples. Compared with the classical OCR, the network architecture of the invention is easier to train, and meanwhile, the invention uses the training methods of migration training, simulation data addition and the like to greatly reduce the requirement of the algorithm on samples, and less than 1400 training samples used at present are far less than the million-level requirement of the classical OCR.

2. The invention can correctly process characters in different directions. The algorithm of the invention uses a self-defined LineAttention module and increases direction prediction during output, and compared with the mainstream OCR algorithm (generally, characters are assumed to be arranged upwards and from left to right), the method can correctly process characters in different directions.

Drawings

FIG. 1 is a schematic view of a picture with irregular text;

FIG. 2 is an example of pathological section label data;

FIG. 3 is a model architecture diagram of the present invention;

FIG. 4 is a schematic diagram of a LineAttention module;

FIG. 5 is an exemplary graph of synthetic data samples;

FIG. 6 is a diagram illustrating the detection results.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

The invention discloses an algorithm for pathological section label character recognition (hereinafter referred to as label recognition). The algorithm is based on RetinaNet, but RetinaNet is designed for general target detection and cannot correctly identify characters in different directions, in order to identify characters in different directions, a direction prediction branch is newly added in network output, and simultaneously, in order to correctly process characters which are sensitive to directions, such as '6', '9', and the like in different directions, a unique line attention module is designed for effectively processing the characters which are sensitive to directions; another improvement point of the invention to RetinaNet is the special Anchorbox parameter setting for effectively handling the situation of larger aspect ratio in the text detection, and the invention is also adjusted in the aspect of the basic architecture of the model. After the individual characters are detected, the characters are combined into lines and output using a corresponding post-processing algorithm. The specific milk is as follows:

model architecture

The basic structure of the model is shown in FIG. 3. the invention uses RetinaNet [2] based on ResNet-50[3] as the basic network structure of the invention. However, RetinaNet is designed for general purpose target detection, and cannot achieve the optimal effect when being directly used for label character recognition. Therefore, the invention carries out the following improvement on RetinaNet:

the invention designs a module called 'LineAttention' (orange boxes in an architecture diagram) to help the model correctly recognize the direction-sensitive characters. FIG. 4 shows a specific structure of LineAttention, and the fusion (fusion) method in FIG. 4 is:

O＝C_vβ+C_h(1-β) (1)

wherein O represents an output, C_vShows the vertical attention mechanism branch (third branch in the block diagram), C_hShowing the branch of the horizontal self-attention mechanism (the first branch in the block diagram), β being the output result of the intermediate sigmod branch]。

LineAttention can automatically detect the direction of a current character, and increase the recognition accuracy of the current character by analyzing adjacent characters in the same direction as the current character through correlation, and particularly has obvious effect on promoting characters with sensitive directions, such as ' 6 ', ' 9 ' and ' - ', ', and the like.

The RetinaNet model only outputs the position and the size of the target frame and the category information of the target, and the invention increases the direction information of the target in the output. The invention can accurately process the label data in different directions only after the direction information exists.

The invention is optimized on the Anchor box parameters of different output layers, and the proportion of the Anchor box at the topmost layer is 1:1,1:7 and 7: 1; the Anchor box ratios of the middle layer are 1:1,1:5 and 5: 1; the Anchor box ratios of the bottommost layer are 1:1,1:2 and 2:1, the topmost layer and the middle layer are dedicated to processing words with large aspect ratios, and the bottommost layer is dedicated to processing words with small aspect ratios and characters;

another difference from RetinaNet is that the topmost output network and the middle output network share the weight, and the bottommost network uses a single weight, so that the design is based on the assumption that the topmost and middle output networks are mainly used for detecting words, and the bottommost output network is mainly used for detecting characters, and the tasks are different, so that different weight sharing rules are designed, and RetinaNet does not have the requirement, so that all output layers of RetinaNet share the weight.

Loss function

The loss function used by the training network is defined as follows:

L＝L_cls(p,u)+λ[u≥1]L_loc(t^u,v)+γL_dre(p,w) (2)

wherein L is_cls(p,u)＝-log p_uU is the type of the target frame in the output result (the category number of the background is 0), L_locRegression loss as target Box (vs Fast R-CNN [5 ]]The same definition). L is_dre(p,w)＝-log p_wAnd w is the direction of the target frame in the output result. And lambda and gamma are corresponding lost weights, and in the experiment, the lambda is 10 and the gamma is 1.

Detailed processing steps

The invention relates to an algorithm based on deep learning, which is divided into a training (learning) stage and a prediction (using) stage, and the corresponding processing steps are respectively explained as follows:

step 1, preprocessing an input image, wherein the preprocessing method comprises the following steps:

in the formula (3), μ is the mean value of the image, σ is the variance of the image, and img is the image;

Step 3, scaling the image processed in the step 2 into a fixed size (512 x 512);

step 4, forming a batch by a plurality of (16) zoomed images;

step 5, forward propagation is carried out by using the model;

and 7, carrying out iterative training until the model converges.

a. the method comprises the following steps of preprocessing an input image, wherein the preprocessing method comprises the following steps:

b. scaling the pre-processed image to a fixed size (512 x 512);

c. forward propagation using the model;

i. and outputting the result.

Results of the experiment

In the experiment we used 1900 more than 1900 medical slice data from more than ten hospitals as samples, 1400 as training data and 500 as test data. For deep learning, 1400 samples are very few, and we use the following method to alleviate the data shortage problem:

1. the model is pre-trained on COCO 6, and then is transferred to the label character recognition problem;

2. as shown in fig. 5, we automatically generated about 50000 samples using the program, but the weight of the automatically generated samples at the time of training was 1/30 for the real samples;

3. and data enhancement methods such as random up-down turning, random left-right turning, random rotation, random color disturbance, random brightness disturbance and the like are used.

The final properties of our model are shown in Table 1

TABLE 1 model characters and test results

Number of samples tested	Rate of accuracy	Recall rate	Rate of accuracy of direction	mAP@0.5
					500	96.5％	95.7％	95.9％	93.1％

Through our post-processing algorithm, if only the label samples are classified, the classification is Her-2, Ki-67, ER, PR and the like. Automatic classification of the tags may provide the necessary prerequisites for subsequent automatic processing of digital pathological sections. The test results of the model are shown in table 2:

TABLE 2 model Classification results

Number of samples tested	Rate of accuracy	Recall rate
			925	100.0％	97.5％

As shown in fig. 6 as an example of the detection result, the colors of the target boxes in fig. 6 represent different directions, such as yellow for right, blue for up, green for left, and the text in the label may be in any direction, if simple character-level detection using a general target detector such as RetinaNet is unable to correctly distinguish the direction-sensitive characters such as "6", "9" and "-", "_" and the like, we can correctly distinguish the direction-sensitive characters with the help of lineattention module.

The present invention is capable of other embodiments, and various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention.

The prior art documents to which the present invention relates are as follows:

[1].Yuliang L,Lianwen J,Shuaitao Z,et al.Detecting Curve Text in theWild:New Dataset and New Solution[J].2017.

[2].Lin T Y,Goyal P,Girshick R,et al.Focal Loss for Dense ObjectDetection[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2017,PP(99):2999-3007.

[3].Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun.Deep ResidualLearning for Image Recognition.The IEEE Conference on Computer Vision andPattern Recognition(CVPR),2016,pp.770-778

[4].A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser,and I.Polosukhin.Attention is all you need.In Neural InformationProcessing Systems(NIPS),2017.2,3,6

[5].R.Girshick,“Fast R-CNN,”in IEEE International Conference onComputer Vision(ICCV),2015.

[6].T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P.Perona,D.Ramanan,P.Dollár,and C.L.Zitnick.Microsoft coco: Common objects in context.In EuropeanConference on Computer Vision,pages 740–755.Springer,2014.4。

Claims

1. a pathological section label identification method is characterized in that: the pathological section label image is identified by adopting a deep learning method, a basic network of a model adopted by the deep learning is a RetinaNet network based on ResNet-50, and a module for helping the basic network to identify direction-sensitive characters is adopted, the module comprises a vertical self-attention mechanism branch, a horizontal self-attention mechanism branch and a middle branch, and the fusion method of the module is as follows:

O＝C_vβ+C_h(1-β) (1)

2. The pathological section tag identification method according to claim 1, wherein: the ratio of the Anchor box at the topmost layer of the model is 1:1,1:7 and 7:1, and the ratio of the Anchor box at the middle layer is 1:1,1:5 and 5: 1; the bottom most layer has Anchor box ratios of 1:1,1:2 and 2: 1.

3. The pathological section tag identification method according to claim 1, wherein: the topmost output network and the middle output network of the base network share weights, and the bottommost network uses a single weight.

4. The pathological section tag identification method according to any one of claims 1 to 3, wherein: the loss function of the training network is as follows:

L＝L_cls(p,u)+λ[u≥1]L_loc(t^u,v)+γL_dre(p,w) (2)

in formula (2): l is_cls(p,u)＝-logp_uU is the type of the target box in the output result, wherein the category number of the background is 0, L_locIs the regression loss of the target box, L_dre(p,w)＝-logp_wW is the direction of the target box in the output result, and λ, γ are the weights of the corresponding losses.

5. The pathological section tag identification method according to claim 4, wherein: λ is 10 and γ is 1.

6. The pathological section tag identification method according to claim 4, wherein: the deep learning training phase comprises the following processing steps:

step 1, preprocessing an input image;

step 2, carrying out random cutting, left-right turning, up-down turning, rotation at any angle, color disturbance, random brightness transformation and random noise addition on the preprocessed image to carry out data enhancement;

step 3, zooming the image processed in the step 2 into a fixed size;

step 4, forming a batch by the zoomed images;

step 5, forward propagation is carried out by using the model;

and 7, carrying out iterative training until the model converges.

7. The pathological section tag identification method according to claim 6, wherein: the prediction stage processing steps of the deep learning are as follows:

a. preprocessing an input image;

b. scaling the pre-processed image to a fixed size;

c. forward propagation using the model;

i. and outputting the result.

8. The pathological section tag identification method according to claim 6 or 7, wherein: the pretreatment method comprises the following steps:

9. The pathological section tag identification method according to claim 6 or 7, wherein: the fixed size is 512 x 512, and the number of sheets is 16.