CN111709293A

CN111709293A - Chemical structural formula segmentation method based on Resunet neural network

Info

Publication number: CN111709293A
Application number: CN202010419502.3A
Authority: CN
Inventors: 王毅刚; 邵锦涛
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2020-09-25
Anticipated expiration: 2040-05-18
Also published as: CN111709293B

Abstract

The invention discloses a chemical structural formula segmentation method based on a ResUNet neural network. The invention comprises the following steps: constructing a training set T, wherein the training set T comprises a manual labeling training set T-1 and an automatic generation training set T-2; step (2) sending the training set T into a Resunet neural network for training until the specified times of training or the Loss curve is not reduced and the precision is not improved any more, and storing the trained Resunet neural network model; and (3) segmenting the chemical structural formula by using the ResUNet neural network model trained in the step (2). The invention provides an improved ResUNet neural network on the basis of the ResUNet neural network, and simultaneously provides a method for automatically generating a large number of training sets of chemical structural formulas to generate the training sets, so that the ResUNet neural network can segment the chemical structural formulas, and the aim of improving the recognition accuracy of the neural network by using a large amount of data is fulfilled.

Description

Chemical structural formula segmentation method based on Resunet neural network

Technical Field

The invention belongs to the technical field of computer detection, and particularly relates to a chemical structural formula segmentation method based on a Resunet neural network.

Background

Part of scientific experimentation that is often of critical importance is the rapid processing and absorption of newly acquired data. In addition, new research methods do not necessarily collect, analyze, and utilize previously published experimental data. This is particularly useful for the discovery of small molecule drugs where the experimentally tested molecular pool is used for virtual screening programs, quantitative structure activity/property relationship (QSAR/QSPR) analysis, or physical modeling method based validation. Due to the difficulty and expense of generating large amounts of experimental data, many drug discovery programs are forced to rely on relatively small internal experimental databases. One promising solution to address the general lack of appropriate training set data in drug discovery is to utilize data that is currently being published. The Medline report published more than 2000 new life science papers every day, and it is becoming increasingly important to address problems associated with data extraction and management, and to automate these processes as much as possible, given that new experimental data is entering the public literature at such a high rate. The extraction of chemical structures from published sources such as journal articles and patent documents in life sciences is still difficult and very time consuming.

At present, a large number of books and other publications are still available only in paper or scanned versions, making reuse difficult. On the one hand, the materials of paper or scanned plates are not easy to retrieve, so that information scattered in a large amount of documents is not easy to find, and thus, the information is not fully utilized. On the other hand, further processing of these materials involves cumbersome and error-prone re-input work.

The research on the identification of chemical structural formulas has progressed slowly, mainly due to: firstly, a formula is surrounded by natural language in a document and is difficult to locate; secondly, due to the complex structure of the chemical structural formula, the symbols are various and have various fonts and different sizes, and the characters have the characteristics of non-regularity, logicality, complexity and the like.

The existing identification method of chemical structural formulas is divided into two steps: firstly, positioning and segmenting a chemical structural formula from a natural language; and secondly, sending the divided chemical structural formula into an identification engine for identification. The current segmentation method of chemical structural formula is basically completed based on the traditional image processing method, the segmentation accuracy is low, and special situations such as close distance between natural language and chemical molecular formula cannot be processed.

Disclosure of Invention

Based on the above, in order to improve the positioning and segmentation accuracy of the chemical structural formula, the invention provides an improved ResUNet neural network on the basis of the ResUNet neural network, and simultaneously provides a method for automatically generating a large number of training sets of the chemical structural formula to generate the training sets, so that the ResUNet neural network can segment the chemical structural formula, and the aim of improving the recognition accuracy of the neural network by using a large amount of data is fulfilled.

A method for segmenting a chemical structural formula based on a ResUNet neural network comprises the following steps:

and (1) constructing a training set T, wherein the training set T comprises a manual labeling training set T-1 and an automatic generation training set T-2. Wherein, the chemical formula in the manual label publication is used as part of training set T-1, and a method for automatically generating the chemical structural formula training set is used for generating the training set T-2, the capacity ratio of the training set T-1 to the training set T-2 is 1: 50;

step (2) sending the training set T into an improved Resunet neural network for training until the specified times of training or the Loss curve is not reduced and the precision is not improved any more, and storing the trained Resunet neural network model;

and (3) segmenting the chemical structural formula by using the ResUNet neural network model trained in the step (2).

Further, the method for automatically generating the training set of the chemical structural formula is used for generating the training set based on the random filling of the images of the typesetting template, and the construction method comprises the following steps:

a. and constructing a typesetting template, and randomly generating text data in the character area.

b. A large number of chemical structural images are generated.

c. And searching blank positions in the typesetting template, randomly filling the chemical structural formula image formula and marking.

Further, the method for constructing the typesetting template comprises the following steps:

a-1, manually calibrating character areas in 200 pages of publications, rotating, turning up, down, left and right to expand data, and generating a typesetting template with 1000 pages, wherein the typesetting template marked manually is shown in an attached figure 2.

a-2, using the characters generated by the internet characters and the random text generator as the text data, and filling the text data into the character area in the typesetting template at random, and the generated result is shown in figure 3.

Further, the method for generating a plurality of chemical structural formula images comprises the following steps:

and b-1, rendering 5700 ten thousand molecule data available in a PubChem database into a 3-channel PNG format image of 256x256 pixels of various types (key width, character size and the like) at random by using Indigo software.

And b-2, carrying out angle rotation and data expansion operation of up-down and left-right inversion on the images to generate 10 ten thousand small molecular chemical structural images.

Further, the method for searching the blank position in the typesetting template to randomly fill the chemical structural formula image and mark comprises the following steps:

and c-1, randomly taking out the generated chemical structural formula image, and placing the chemical structural formula image at a blank position outside the text area after random scaling to obtain a data part in the training set T-2, as shown in the attached figure 4.

And c-2, marking the positions of the pixels occupied by the chemical structural formula image pixel by pixel to obtain the label part of the training set T-2, as shown in the attached figure 5.

Further, the improved ResUNet neural network is implemented as:

taking the training set T as an input image of the improved ResUNet neural network, wherein the input image is 512 multiplied by 3, and outputting a feature map res-1 with the size of 256 multiplied by 64 after 7 multiplied by 7 convolution of a first layer; then, pooling by using the maximum value of 3 × 3, repeating the convolution for three times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 for 9 times, and outputting a feature map res-2 with the size of 128 × 128 × 256; then, after repeating convolution for 12 times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 four times, outputting a feature map res-3 with the size of 64 × 64 × 512, and then repeating convolution for 18 times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 six times, outputting a feature map res-4 with the size of 32 × 32 × 1024; then, after repeating convolution for three times with the size of 1 multiplied by 1, the size of 3 multiplied by 3 and the size of 1 multiplied by 1 for 9 times, outputting a characteristic diagram res-5 with the size of 16 multiplied by 2048; then carrying out convolution with the size of 1 multiplied by 1, and outputting a characteristic diagram conv-1 with the size of 16 multiplied by 1024; then 2 x2 upsampling is carried out, and the output characteristic diagram up-1 and the characteristic diagram res-4 are spliced to obtain a 32 x 2048 size characteristic diagram concat-1; then, carrying out convolution with the size of 3 multiplied by 3, and outputting a feature map conv-2 with the size of 32 multiplied by 512; then 2 x2 upsampling is carried out, and the output characteristic graph up-2 and the characteristic graph res-3 are spliced to obtain a characteristic graph concat-2 with the size of 64 x 1024; then, carrying out convolution with the size of 3 multiplied by 3 to output a feature map conv-3 with the size of 64 multiplied by 256; then 2 x2 upsampling is carried out, and the output characteristic graph up-3 and the characteristic graph res-2 are spliced to obtain a 128 x 512 size characteristic graph concat-3; then carrying out convolution with the size of 3 multiplied by 3 to output a feature map conv-4 with the size of 128 multiplied by 64; then 2 x2 upsampling is carried out, and the output characteristic diagram up-4 and the characteristic diagram res-1 are spliced to obtain a 256x 128 size characteristic diagram concat-4; then carrying out convolution with the size of 3 multiplied by 3 to output a characteristic diagram conv-5 with the size of 256 multiplied by 64; finally, after 2 × 2 upsampling and 1 × 1 size convolution, a 512 × 512 × 2 result graph corresponding to the size of the original input image is output.

The invention has the following beneficial effects:

the invention provides an improved ResUNet neural network on the basis of the ResUNet neural network, and simultaneously provides a method for automatically generating a large number of training sets of chemical structural formulas to generate the training sets, so that the ResUNet neural network can segment the chemical structural formulas, and the aim of improving the recognition accuracy of the neural network by using a large amount of data is fulfilled.

Drawings

FIG. 1 is a schematic flow chart of an improved ResUNet neural network of the present invention;

FIG. 2 is a schematic diagram of a sample manual marking template according to the present invention;

FIG. 3 is a schematic diagram of a template sample after random text padding according to the present invention;

FIG. 4 is a schematic diagram of a template sample after random filling of a chemical structural formula according to the present invention;

FIG. 5 is a schematic diagram of a corresponding example of a mark of the template of the present invention;

Detailed Description

In order to describe the present invention more specifically, a method for segmenting a chemical structural formula based on a ResUNet neural network according to the present invention is described in detail below with reference to the accompanying drawings and the detailed description.

the method for automatically generating the chemical structural formula training set is used for generating the training set based on the random filling of the images of the typesetting template, and the construction method comprises the following steps:

a. constructing a typesetting template, randomly generating character data in a character area, manually calibrating the character area in the typesetting object of 200 pages, rotating, reversing up, down, left and right to expand data, generating 1000 pages of the template, wherein the manual marking template is shown in figure 2, characters generated by an internet character and random text generator are used as text data, the text data is randomly filled in the character area in the typesetting template, and the generated result is shown in figure 3.

b. Generating a large number of chemical structural formula images, using 5700 ten thousand molecule data available in a PubChem database, randomly rendering part of molecular data in the images into 3-channel PNG format images of 256x256 pixels of various types (key width, character size and the like) by using Indigo software, performing angle rotation on the images, performing data expansion operation of up-down and left-right inversion, and generating 10 ten thousand small molecule chemical structural formula images.

c. And searching blank positions in the typesetting template, randomly filling and marking the chemical structural formulas, randomly taking out the generated chemical structural formula images, and placing the images at the blank positions outside the text area after random scaling, as shown in the attached figure 4. The pixel-by-pixel positions occupied by the chemical structural image are marked as shown in fig. 5.

Step (2) constructing an improved Resunet neural network as shown in fig. 1, sending a training data set into the improved Resunet neural network for training until the training is appointed times or the Loss curve is not reduced any more and the precision is not improved any more, and storing the trained model;

further, the improved ResUNet neural network comprises the following steps: taking the training set T as an input image of the improved ResUNet neural network, wherein the input image is 512 multiplied by 3, and outputting a feature map res-1 with the size of 256 multiplied by 64 after 7 multiplied by 7 convolution of a first layer; then, pooling by using the maximum value of 3 × 3, repeating the convolution for three times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 for 9 times, and outputting a feature map res-2 with the size of 128 × 128 × 256; then, after repeating convolution for 12 times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 four times, outputting a feature map res-3 with the size of 64 × 64 × 512, and then repeating convolution for 18 times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 six times, outputting a feature map res-4 with the size of 32 × 32 × 1024; then, after repeating convolution for 9 times with the size of 1 multiplied by 1, the size of 3 multiplied by 3 and the size of 1 multiplied by 1 for three times, outputting a size characteristic diagram res-5 of 16 multiplied by 2048; then carrying out convolution with the size of 1 multiplied by 1, and outputting a characteristic diagram conv-1 with the size of 16 multiplied by 1024; then 2 x2 upsampling is carried out, and the output characteristic diagram up-1 and the characteristic diagram res-4 are spliced to obtain a 32 x 2048 size characteristic diagram concat-1; then, carrying out convolution with the size of 3 multiplied by 3, and outputting a feature map conv-2 with the size of 32 multiplied by 512; then 2 x2 upsampling is carried out, and the output characteristic graph up-2 and the characteristic graph res-3 are spliced to obtain a characteristic graph concat-2 with the size of 64 x 1024; then, carrying out convolution with the size of 3 multiplied by 3 to output a feature map conv-3 with the size of 64 multiplied by 256; then 2 x2 upsampling is carried out, and the output characteristic graph up-3 and the characteristic graph res-2 are spliced to obtain a 128 x 512 size characteristic graph concat-3; then carrying out convolution with the size of 3 multiplied by 3 to output a feature map conv-4 with the size of 128 multiplied by 64; then 2 x2 upsampling is carried out, and the output characteristic diagram up-4 and the characteristic diagram res-1 are spliced to obtain a 256x 128 size characteristic diagram concat-4; then carrying out convolution with the size of 3 multiplied by 3 to output a characteristic diagram conv-5 with the size of 256 multiplied by 64; finally, after 2 × 2 upsampling and 1 × 1 size convolution, a 512 × 512 × 2 result graph corresponding to the size of the original input image is output.

An improved reset neural network was constructed according to the following table:

and (3) segmenting by using the neural network trained in the step (2) to obtain a segmentation result.

Claims

1. A method for segmenting a chemical structural formula based on a ResUNet neural network is characterized by comprising the following steps:

constructing a training set T, wherein the training set T comprises a manual labeling training set T-1 and an automatic generation training set T-2;

step (2) sending the training set T into a Resunet neural network for training until the specified times of training or the Loss curve is not reduced and the precision is not improved any more, and storing the trained Resunet neural network model;

step (3) segmenting the chemical structural formula by using the ResUNet neural network model trained in the step (2);

the training set T-2 is generated by image random filling based on typesetting template through a method for automatically generating a chemical structural formula training set, and the construction method comprises the following steps:

a. constructing a typesetting template, and randomly generating text data in a character area;

b. generating a plurality of chemical structural formula images;

2. The method of claim 1, wherein the formula in the manually labeled publication is used as a part of the training set T-1, and the ratio of the capacities of the training set T-1 and the training set T-2 is 1: 50.

3. the method for segmenting chemical structural formulas based on ResUNet neural network as claimed in claim 1 or 2, wherein the method for constructing typeset templates comprises the following steps:

a-1, manually calibrating a text area in 200 pages of publications, rotating, reversing up and down and left and right to expand data, and generating 1000 pages of a typesetting template;

and a-2, using the characters generated by the Internet characters and the random text generator as text data, and filling the text data into character areas in the typesetting template at random.

4. The method for segmenting chemical structural formulas based on ResUNet neural network as claimed in claim 3, wherein the method for generating a plurality of chemical structural formula images comprises the following steps:

b-1, rendering 5700 ten thousand molecule data available in a PubChem database into various types of 3-channel PNG format images with 256x256 pixels by using Indigo software randomly;

5. The method of claim 4, wherein the method of searching for the blank position in the typesetting template to randomly fill and mark the chemical structural formula image comprises the following steps:

c-1, randomly taking out the generated chemical structural image, and placing the chemical structural image at a blank position outside the text area after random scaling to obtain a data part in the training set T-2;

and c-2, marking the positions of the pixels occupied by the chemical structural formula image pixel by pixel to obtain the label part of the training set T-2.

6. The method for dividing chemical structural formula based on ResUNet neural network as claimed in claim 5, wherein the ResUNet neural network is a modified ResUNet neural network, which is implemented as:

taking the training set T as an input image of the improved ResUNet neural network, wherein the input image is 512 multiplied by 3, and outputting a feature map res-1 with the size of 256 multiplied by 64 after 7 multiplied by 7 convolution of a first layer; then, pooling by using the maximum value of 3 × 3, repeating the convolution for three times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 for 9 times, and outputting a feature map res-2 with the size of 128 × 128 × 256; then, after repeating convolution for 12 times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 four times, outputting a feature map res-3 with the size of 64 × 64 × 512, and then repeating convolution for 18 times with the size of 1 × 1, the size of 3 × 3 and the size of 1 × 1 six times, outputting a feature map res-4 with the size of 32 × 32 × 1024; then, after repeating convolution for three times with the size of 1 multiplied by 1, the size of 3 multiplied by 3 and the size of 1 multiplied by 1 for 9 times, outputting a size characteristic diagram res-5 of 16 multiplied by 2048; then carrying out convolution with the size of 1 multiplied by 1, and outputting a characteristic diagram conv-1 with the size of 16 multiplied by 1024; then 2 x2 upsampling is carried out, and the output characteristic diagram up-1 and the characteristic diagram res-4 are spliced to obtain a 32 x 2048 size characteristic diagram concat-1; then, carrying out convolution with the size of 3 multiplied by 3, and outputting a feature map conv-2 with the size of 32 multiplied by 512; then 2 x2 upsampling is carried out, and the output characteristic graph up-2 and the characteristic graph res-3 are spliced to obtain a characteristic graph concat-2 with the size of 64 x 1024; then, carrying out convolution with the size of 3 multiplied by 3 to output a feature map conv-3 with the size of 64 multiplied by 256; then 2 x2 upsampling is carried out, and the output characteristic graph up-3 and the characteristic graph res-2 are spliced to obtain a 128 x 512 size characteristic graph concat-3; then carrying out convolution with the size of 3 multiplied by 3 to output a feature map conv-4 with the size of 128 multiplied by 64; then 2 x2 upsampling is carried out, and the output characteristic diagram up-4 and the characteristic diagram res-1 are spliced to obtain a 256x 128 size characteristic diagram concat-4; then carrying out convolution with the size of 3 multiplied by 3 to output a characteristic diagram conv-5 with the size of 256 multiplied by 64; finally, after 2 × 2 upsampling and 1 × 1 size convolution, a 512 × 512 × 2 result graph corresponding to the size of the original input image is output.