CN111553361B

CN111553361B - Pathological section label identification method

Info

Publication number: CN111553361B
Application number: CN202010199537.0A
Authority: CN
Inventors: 王杰; 郑众喜; 向旭辉; 陈杰
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2022-11-01
Anticipated expiration: 2040-03-19
Also published as: CN111553361A

Abstract

The invention discloses a pathological section label identification method, which adopts a deep learning method to identify pathological section label images, wherein the basic network of a model adopted by the deep learning is a RetinaNet network based on ResNet-50 and a module used for helping the basic network to identify direction-sensitive characters, the module comprises a vertical self-attention mechanism branch, a horizontal self-attention mechanism branch and a middle branch, and the fusion method of the modules is as follows: o = C_vβ+C_h(1-. Beta.) (1) formula (1): o represents an output, C_vIndicating a vertical self-attentive mechanism branch, C_hRepresenting the horizontal self-attention mechanism branch, beta is the output result of the middle branch. The invention can correctly process characters in different directions.

Description

Pathological section label identification method

Technical Field

The invention relates to the field of medical detection, in particular to a pathological section label identification method.

Background

One of the current methods for pathological section label recognition is Optical Character Recognition (OCR). Mainstream OCR algorithms all comprise the following two steps:

1. detecting characters in a scene;

2. the detected text is identified.

The output of the first step in the above steps is usually the position information of a word or a line of characters, and the currently used technology is mostly based on a general target detection algorithm; the second step is to use the CTC or attention mechanism based method to recognize the corresponding text cut out of the image and scaled to a fixed height image according to the detection result of the first step, and they usually assume that the text satisfies the forward direction and is left to right at the time of recognition. Most of the current research is focused on the first step and the main focus is on how to recognize irregular text.

The mainstream OCR algorithm is directly applied to the pathological section label recognition, and the following problems exist:

1. at present, the mainstream OCR technology needs a large amount of training data, usually the first step needs 10 k-50 k of marking data, and the second step needs more than 1000k of training data, and collection of pathological section data of the order is almost impossible, wherein the number of marking data used in the patent is less than 2000, which is far smaller than the data volume used by the mainstream OCR technology;

2. most of the mainstream OCR technology focuses on how to detect irregular characters, as shown in fig. 1, the labels of pathological sections are scanned by a digital section scanner, as shown in fig. 2, there is almost no deformation;

3. the characters in the label of the pathological section can be in any direction (different directions may exist in the same label at the same time), and the mainstream OCR technology has little interest in this aspect, and most OCR methods directly assume that the characters are arranged upwards and from left to right;

4. most of the mainstream OCR detection is natural language, the recognized target is a word, semantic correlation exists between words, characters in a pathological label have high randomness, and the correlation between the characters is small;

5. the technology that the part can directly process the character in any direction has the limitation of using scenes, such as that the character is generated in a fixed position according to a rule, a locator which needs to be assisted is required, a fixed font is used, and the like.

As described above, since the current mainstream OCR technology and tag recognition have great differences in data amount and attention point, it is not possible to achieve a good effect by directly using the OCR technology in tag recognition.

Disclosure of Invention

The invention aims to provide a pathological section label identification method which can correctly process characters in different directions.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a pathological section label identification method, which adopts a deep learning method to identify pathological section label images, wherein the basic network of a model adopted by the deep learning is a RetinaNet network based on ResNet-50 and a module used for helping the basic network to identify direction-sensitive characters, the module comprises a vertical self-attention mechanism branch, a horizontal self-attention mechanism branch and a middle branch, and the fusion method of the modules is as follows:

O＝C_vβ+C_h(1-β) (1)

in formula (1): o represents an output, C_vIndicating a vertical self-attentive mechanism branch, C_hRepresenting the horizontal self-attention mechanism branch, beta is the output result of the middle branch.

Preferably, the ratio of the top-most Anchor box of the base network is 1, 1; the bottom most Anchor box ratios are 1, 2 and 2.

Preferably, the topmost output network and the middle output network of the model share weights, and the bottommost network uses separate weights.

Preferably, the loss function of the training network is as follows:

L＝L_cls(p,u)+λ[u≥1]L_loc(t^u,v)+γL_dre(p,w) (2)

in formula (2): l is_cls(p,u)＝-log p_uU is the type of the target box in the output result, where the class number of the background is 0_locIs the regression loss of the target box, L_dre(p,w)＝-log p_wW is the direction of the target box in the output result, and λ, γ are the weights of the corresponding losses.

Preferably, λ is 10 and γ is 1.

Preferably, the deep learning training phase processing steps are as follows:

step 1, preprocessing an input image;

step 2, carrying out random cutting, left-right turning, up-down turning, rotation at any angle, color disturbance, random brightness transformation and random noise addition on the preprocessed image to carry out data enhancement

Step 3, zooming the image processed in the step 2 into a fixed size;

step 4, forming a batch by a plurality of zoomed images;

step 5, forward propagation is carried out by using the model;

step 6, calculating loss by using a loss function, reversely propagating, and updating training parameters;

and 7, carrying out iterative training until the model converges.

Preferably, the prediction stage processing steps of the deep learning are as follows:

a. preprocessing an input image;

b. zooming the preprocessed image into a fixed size;

c. forward propagation using the model;

d. dividing the result output in the step c into two groups of words and characters;

e. aggregating the characters into words according to whether the words and the characters are overlapped;

f. counting the directions of all characters in the same word, and determining the direction of the current word by using a voting method;

g. arranging the characters in the words according to the direction of the words in sequence;

h. determining whether spaces exist among the characters according to the distance among the characters in the word, and if yes, adding the spaces;

i. and outputting the result.

Preferably, the pretreatment method comprises the following steps:

in the formula (3), μ is a mean value of the image, and σ is a variance of the image.

Preferably, the fixed size is 512 by 512, and the number of sheets is 16.

The invention has the following beneficial effects:

1. the present invention requires only a very small number of training samples. Compared with the classical OCR, the network architecture of the invention is easier to train, and meanwhile, the invention uses the training methods of migration training, simulation data addition and the like to greatly reduce the requirement of the algorithm on samples, and less than 1400 training samples used at present are far less than the million-level requirement of the classical OCR.

2. The invention can correctly process characters in different directions. The algorithm of the invention uses the self-defined LineAttention module and increases direction prediction during output, and compared with the mainstream OCR algorithm (generally, characters are supposed to be arranged upwards and from left to right), the algorithm of the invention can correctly process characters in different directions.

Drawings

FIG. 1 is a schematic view of a picture with irregular text;

FIG. 2 is an example of pathological section label data;

FIG. 3 is a diagram of the model architecture of the present invention;

FIG. 4 is a schematic diagram of a LineAttention module;

FIG. 5 is an exemplary graph of synthetic data samples;

FIG. 6 is a diagram illustrating the detection results.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

The invention discloses an algorithm for pathological section label character recognition (hereinafter referred to as label recognition). The algorithm is based on RetinaNet, but RetinaNet is designed for general target detection and cannot correctly identify characters in different directions, in order to identify characters in different directions, a direction prediction branch is newly added in network output, and simultaneously, in order to correctly process characters which are sensitive to directions, such as '6', '9', and the like in different directions, a unique line attention module is designed for effectively processing the characters which are sensitive to directions; another improvement point of the RetinaNet of the invention lies in the special Anchor box parameter setting, which is used for effectively processing the condition of larger aspect ratio in character detection, and the invention is also adjusted in the aspect of basic architecture of the model. After the individual characters are detected, the characters are combined into lines and output using a corresponding post-processing algorithm. The concrete milk is as follows:

model architecture

The basic architecture of the model is shown in FIG. 3, and the invention uses RetinaNet [2] based on ResNet-50[3] as the basic network architecture of the invention. However, retinaNet is designed for general purpose target detection, and cannot achieve the optimal effect when being directly used for label character recognition. Therefore, the invention carries out the following improvement on RetinaNet:

the invention designs a module called 'LineAttention' (orange boxes in an architecture diagram) to help the model correctly recognize the direction-sensitive characters. FIG. 4 shows a specific structure of LineAttention, and the fusion (fusion) method in FIG. 4 is:

O＝C_vβ+C_h(1-β) (1)

wherein O represents an output, C_vShows the vertical attention mechanism branch (third branch in the block diagram), C_hRepresenting the horizontal self-attention mechanism branch (first branch in the structure diagram), β is the output result of the intermediate sigmod branch. Reference is made to the detailed implementation of the self-attention mechanism [4 ]]。

LineAttention can automatically detect the direction of a current character, and increase the recognition accuracy of the current character by analyzing adjacent characters in the same direction as the current character through association, and particularly has obvious effect on promoting characters such as ' 6 ', ' 9 ', and ' minus ', ' and the like which are sensitive to directions.

The RetinaNet model only outputs the position and the size of the target frame and the category information of the target, and the invention increases the direction information of the target in the output. The invention can accurately process the label data in different directions only after the direction information exists.

The method is optimized on the Anchor box parameters of different output layers, and the proportion of the Anchor box at the topmost layer is 1, 1; the ratio of the Anchor box of the intermediate layer is 1, 1; the bottom-most Anchor box ratios are 1, 2 and 2, the top-most and middle layers are devoted to handling words with large aspect ratios, and the bottom-most layer is devoted to handling words with small aspect ratios and characters;

another difference from RetinaNet is that the topmost output network and the middle output network share the weight, and the bottommost network uses a single weight, so that the design is based on the assumption that the topmost and middle output networks are mainly used for detecting words, and the bottommost output network is mainly used for detecting characters, and the tasks are different, so that different weight sharing rules are designed, and RetinaNet does not have the requirement, so that all output layers of RetinaNet share the weight.

Loss function

The loss function used by the training network is defined as follows:

L＝L_cls(p,u)+λ[u≥1]L_loc(t^u,v)+γL_dre(p,w) (2)

wherein L is_cls(p,u)＝-log p_uU is the type of the target box in the output result (the class number of the background is 0), L_locRegression loss as target Box (vs Fast R-CNN [5 ]]The same definition). L is_dre(p,w)＝-log p_wAnd w is the direction of the target frame in the output result. And lambda and gamma are corresponding lost weights, and in the experiment, the lambda is 10 and the gamma is 1.

Detailed processing steps

The invention relates to an algorithm based on deep learning, which is divided into a training (learning) stage and a prediction (using) stage, and the following respectively describes the corresponding processing steps:

step 1, preprocessing an input image, wherein the preprocessing method comprises the following steps:

in the formula (3), μ is the mean value of the image, σ is the variance of the image, and img is the image;

Step 3, scaling the image processed in the step 2 into a fixed size (512 x 512);

step 4, forming a batch by a plurality of (16) zoomed images;

step 5, forward propagation is carried out by using the model;

and 7, carrying out iterative training until the model converges.

a. the method comprises the following steps of preprocessing an input image, wherein the preprocessing method comprises the following steps:

b. scaling the pre-processed image to a fixed size (512 x 512);

c. forward propagation using the model;

f. counting the direction of each character in the same word, and determining the direction of the current word by using a voting method;

i. and outputting the result.

Results of the experiment

In the experiment we used 1900 more than 1900 medical slice data from more than ten hospitals as samples, 1400 as training data and 500 as test data. For deep learning, 1400 samples are very few, and we use the following method to alleviate the data shortage problem:

1. the model is pre-trained on COCO 6, and then is transferred to the problem of label character recognition;

2. as shown in FIG. 5, we automatically generated about 50000 samples using a program, but the weight of the automatically generated samples at the time of training was 1/30 of the real samples;

3. and data enhancement methods such as random up-down turning, random left-right turning, random rotation, random color disturbance, random brightness disturbance and the like are used.

The final properties of our model are shown in Table 1

TABLE 1 model characters and test results

Number of samples tested	Rate of accuracy	Recall rate	Rate of accuracy of direction	mAP@0.5
					500	96.5％	95.7％	95.9％	93.1％

Through our post-processing algorithm, if only the label samples are classified, the classification is Her-2, ki-67, ER, PR and the like. Automatic classification of the tags may provide the necessary prerequisites for subsequent automatic processing of digital pathological sections. The test results of the model are shown in table 2:

TABLE 2 model Classification results

Number of samples tested	Rate of accuracy	Recall rate
			925	100.0％	97.5％

As fig. 6 shows an example of detection results, colors of the object boxes in fig. 6 represent different directions, such as yellow for right, blue for up, green for left, and text in the label may be in any direction, if simple character-level detection using a general object detector such as RetinaNet cannot correctly distinguish direction-sensitive characters such as "6", "9" and "-", "_", etc., with the help of lineattention module, we can correctly distinguish the direction-sensitive characters.

The present invention is capable of other embodiments, and various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention.

The prior art documents to which the present invention relates are as follows:

[1].Yuliang L,Lianwen J,Shuaitao Z,et al.Detecting Curve Text in the Wild:New Dataset and New Solution[J].2017.

[2].Lin T Y,Goyal P,Girshick R,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2017,PP(99):2999-3007.

[3].Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun.Deep Residual Learning for Image Recognition.The IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2016,pp.770-778

[4].A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser,and I.Polosukhin.Attention is all you need.In Neural Information Processing Systems(NIPS),2017.2,3,6

[5].R.Girshick,“Fast R-CNN,”in IEEE International Conference on Computer Vision(ICCV),2015.

[6].T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P.Perona,D.Ramanan,P.Dollár,and C.L.Zitnick.Microsoft coco:Common objects in context.In European Conference on Computer Vision,pages 740–755.Springer,2014.4

Claims

1. a pathological section label identification method is characterized in that: identifying the pathological section label image by adopting a deep learning method, wherein a basic network of a model adopted by the deep learning is a RetinaNet network based on ResNet-50, and a module for helping the basic network to identify the direction-sensitive characters; the topmost output network and the middle output network of the basic network share the weight, and the bottommost network uses the independent weight; the module comprises a vertical self-attention mechanism branch, a horizontal self-attention mechanism branch and a middle branch, and the fusion method of the module comprises the following steps:

O＝C_vβ+C_h(1-β) (1)

in formula (1): o represents an output, C_vIndicating a vertical self-attentive mechanism branch, C_hRepresenting a horizontal self-attention mechanism branch, wherein beta is an output result of the middle branch;

the prediction stage processing steps of the deep learning are as follows:

a. preprocessing an input image;

b. zooming the preprocessed image into a fixed size;

c. forward propagation using the model;

i. and outputting the result.

2. The pathological section tag identification method according to claim 1, wherein: the model has a ratio of the top most Anchor box of 1, 7, and 7; the bottommost Anchor box ratios are 1, 2 and 2.

3. The pathological section tag identification method according to any one of claims 1-2, wherein: the loss function of the training network is as follows:

L＝L_cls(p,u)+λ[u≥1]L_loc(t^u,v)+γL_dre(p,w) (2)

in formula (2): l is_cls(p,u)＝-logp_uU is the type of the target box in the output result, where the class number of the background is 0_locRegression loss for the target Box, L_dre(p,w)＝-log p_wW is the direction of the target frame in the output result, and λ, γ are the weights lost accordingly.

4. The pathological section tag identification method according to claim 3, wherein: λ is 10 and γ is 1.

5. The pathological section tag identification method according to claim 3, wherein: the deep learning training stage comprises the following processing steps:

step 1, preprocessing an input image;

Step 3, zooming the image processed in the step 2 into a fixed size;

step 4, forming a batch by the zoomed images;

step 5, forward propagation is carried out by using the model;

and 7, carrying out iterative training until the model converges.

6. The pathological section tag identification method according to claim 5, wherein: the pretreatment method comprises the following steps:

in the formula (3), μ is the mean value of the image, and δ is the variance of the image.

7. The pathological section tag identification method according to claim 5, wherein: the fixed size is 512 by 512, and the number of sheets is 16.