CN111709293B

CN111709293B - Chemical structural formula segmentation method based on Resunet neural network

Info

Publication number: CN111709293B
Application number: CN202010419502.3A
Authority: CN
Inventors: 王毅刚; 邵锦涛
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2023-10-03
Anticipated expiration: 2040-05-18
Also published as: CN111709293A

Abstract

The invention discloses a chemical structural formula segmentation method based on a Resunet neural network. The method comprises the following steps: constructing a training set T, wherein the training set T comprises a manual labeling training set T-1 and an automatic generating training set T-2; step (2) the training set T is sent to a Resune neural network for training, and the trained Resune neural network model is stored until the training designated times or the Loss curve is no longer reduced and the accuracy is no longer improved; and (3) segmenting the chemical structural formula by using the Resunet neural network model trained in the step (2). The invention provides an improved Resunet neural network based on the Resunet neural network, and simultaneously provides a method for automatically generating a large number of chemical structural formula training sets to generate the training sets, so that the Resunet neural network can divide the chemical structural formula, and the aim of improving the recognition precision of the neural network by a large amount of data is fulfilled.

Description

Chemical structural formula segmentation method based on Resunet neural network

Technical Field

The invention belongs to the technical field of computer detection, and particularly relates to a chemical structural formula segmentation method based on a Resunet neural network.

Background

A part of the scientific experimentation that is often critical is the rapid processing and absorption of newly acquired data. Furthermore, new research methods are also not available to collect, analyze and utilize previously published experimental data. This is particularly applicable for the discovery of small molecule drugs where experimentally tested molecular sets are used for virtual screening programs, quantitative structure activity/property relationship (QSAR/QSPR) analysis, or validation based on physical modeling methods. Because of the difficulty and expense of generating large amounts of experimental data, many drug discovery projects are forced to rely on relatively small internal experimental databases. One promising solution to address the general lack of adequate training set data in drug discovery is to utilize the data currently being published. The Medline report published more than 2000 new life science papers per day, and given that new experimental data is entering public literature at such high speeds, it is becoming increasingly important to solve the problems associated with data extraction and management, and to automate these processes as much as possible. Extraction of chemical structures from published sources, such as journal articles and patent documents, in life sciences remains difficult and time consuming.

Currently, a large number of books and other publications are still available in paper or scanned versions, creating difficulties in reuse. On the one hand, the materials of the paper or scanning plate are not easy to retrieve, so that information dispersed in a large number of documents is not easily found and is not fully utilized. On the other hand, further processing of these materials involves tedious and error-prone re-entry work.

Research on chemical structural formula identification is slow, and the main reasons are as follows: 1. formulas are surrounded by natural language in the document, so that the formulas are difficult to locate; and secondly, due to the complex structure of the chemical structural formula, the characters are various in variety, various in fonts and different in size, and the characters have the characteristics of irregularity, logicalness, complexity and the like.

The existing identification method of the chemical structural formula is divided into two steps: 1. positioning and dividing the chemical structural formula from natural language; 2. and sending the divided chemical structural formulas into a recognition engine for recognition. The current chemical structural formula segmentation method is basically completed based on the traditional image processing method, has low segmentation accuracy, and cannot deal with special cases such as natural language and chemical molecular formula with relatively close distance.

Disclosure of Invention

Based on the method, in order to improve the positioning and segmentation accuracy of the chemical structural formula, the invention provides an improved Resune neural network based on the Resune neural network, and simultaneously provides a method for automatically generating a large number of large-scale structural formula training sets to generate the training sets, so that the Resune neural network can segment the chemical structural formula, and the aim of improving the recognition accuracy of the neural network by a large amount of data is fulfilled.

A chemical structural formula segmentation method based on a Resunet neural network comprises the following steps:

and (2) constructing a training set T, wherein the training set T comprises a manual labeling training set T-1 and an automatic generating training set T-2. The method comprises the steps of manually marking a chemical formula in a publication as a part of a training set T-1, generating a training set T-2 by using a method for automatically generating the training set with the chemical structural formula, wherein the capacity ratio of the training set T-1 to the training set T-2 is 1:50;

step (2) the training set T is sent into an improved Resune neural network for training, and the trained Resune neural network model is stored until the training designated times or the Loss curve is no longer reduced and the accuracy is no longer improved;

and (3) segmenting the chemical structural formula by using the Resunet neural network model trained in the step (2).

Further, by using the method for automatically generating the chemical structural formula training set, the training set is generated based on random filling of images of typesetting templates, and the construction method comprises the following steps:

a. and constructing a typesetting template, and randomly generating text data in the text area.

b. A large number of chemical structural images are generated.

c. And searching blank positions in the typesetting template, randomly filling the chemical structural formula image formula, and marking.

Further, the method for constructing the typesetting template comprises the following steps:

a-1, manually calibrating text areas in 200 pages of publications, rotating, reversing up and down, and expanding data, and generating 1000 pages of typesetting templates by symbiosis, wherein the typesetting templates marked manually are shown in figure 2.

and a-2, taking the Internet characters and the characters generated by the random text generator as text data, and randomly filling the text data into the character areas in the typesetting template, wherein the generation result is shown in figure 3.

Further, the method for generating a plurality of chemical structural image comprises the following steps:

b-1. 5700 million sub-data available in the pubhem database, part of which is rendered randomly using the Indigo software into 256x256 pixel 3-channel PNG format images of various types (key width, character size, etc.).

And b-2, performing angle rotation on the image, and performing data expansion operation of up-down left-right inversion to generate 10 ten-thousand small molecular chemical structural formula images.

Further, the method for searching blank positions in the typesetting template to randomly fill the chemical structural formula image and mark comprises the following steps:

c-1, randomly taking out the generated chemical structural formula image, and placing the image at a blank position outside the text region after random scaling to obtain a data part in the training set T-2, as shown in figure 4.

c-2. The positions of the pixels occupied by the pixel-by-pixel marking chemical structural formula image obtain a label part of the training set T-2, as shown in figure 5.

Further, the improved ResUNet neural network is implemented as:

taking the training set T as an input image of the improved Resunet neural network, wherein the input image is 512 multiplied by 3, and outputting a feature map res-1 with the size of 256 multiplied by 64 after 7 multiplied by 7 on the first layer; then, using maximum value of 3×3 to pool, repeating three times to 9 times of convolution of 1×1 size, 3×3 size and 1×1 size, and outputting feature map res-2 of 128×128×256 size; then, the convolution is repeated for 12 times for four times for 1×1 size, 3×3 size and 1×1 size to output a 64×64×512 size feature map res-3, and then the convolution is repeated for 18 times for six times for 1×1 size, 3×3 size and 1×1 size to output a 32×32×1024 size feature map res-4; then, the convolution is repeated for 9 times by repeating the steps of three times, namely the size of 1 multiplied by 1, the size of 3 multiplied by 3 and the size of 1 multiplied by 1, and a characteristic diagram res-5 with the size of 16 multiplied by 2048 is output; then, a convolution with the size of 1 multiplied by 1 is carried out, and a feature map conv-1 with the size of 16 multiplied by 1024 is output; then 2X 2 up-sampling is carried out, and the output characteristic diagram up-1 and the characteristic diagram res-4 are spliced to obtain a characteristic diagram concat-1 with the size of 32X 2048; then, a 3×3 size convolution is performed to output a 32×32×512 size feature map conv-2; then 2X 2 up-sampling is carried out, and the output feature map up-2 and the feature map res-3 are spliced to obtain a feature map concat-2 with the size of 64X 1024; then, a 3×3 size convolution is performed to output a 64×64×256 size feature map conv-3; then 2X 2 up-sampling is carried out, and the output feature map up-3 and the feature map res-2 are spliced to obtain a feature map concat-3 with the size of 128X 512; then, a 3×3 size convolution is performed to output a 128×128×64 size feature map conv-4; then 2X 2 up-sampling is carried out, and the output characteristic diagram up-4 and the characteristic diagram res-1 are spliced to obtain a characteristic diagram concat-4 with the size of 256X 128; then, a 3×3 size convolution is performed to output a 256×256×64 size feature map conv-5; finally, through up-sampling of 2×2 and convolution of 1×1 size, 512×512×2 result map corresponding to original input image size is output.

The invention has the following beneficial effects:

the invention provides an improved Resunet neural network based on the Resunet neural network, and simultaneously provides a method for automatically generating a large number of chemical structural formula training sets to generate the training sets, so that the Resunet neural network can divide the chemical structural formula, and the aim of improving the recognition precision of the neural network by a large amount of data is fulfilled.

Drawings

FIG. 1 is a schematic flow diagram of a modified Resunet neural network of the present invention;

FIG. 2 is a schematic illustration of a manually labeled template sample according to the present invention;

FIG. 3 is a schematic representation of a template sample after random text population in accordance with the present invention;

FIG. 4 is a schematic representation of a template sample after random filling of chemical formulas in accordance with the present invention;

FIG. 5 is a schematic diagram of a corresponding label sample of the template of the present invention;

Detailed Description

In order to more specifically describe the present invention, a chemical structural formula segmentation method based on a Resunet neural network according to the present invention is described in detail below with reference to the accompanying drawings and the detailed description.

the method for automatically generating the chemical structural training set is used for generating the training set based on the random filling of the images of the typesetting templates, and the construction method comprises the following steps:

a. constructing a typesetting template, randomly generating text data in a text region, manually calibrating the text region in 200 pages of publications, rotating, reversing up and down and left and right to expand data, generating 1000 pages of templates by symbiosis, manually marking the template as shown in figure 2, taking the text generated by an Internet text and random text generator as text data, randomly filling the text data in the text region in the typesetting template, and generating a result as shown in figure 3.

b. Generating a large number of chemical structural formula images, using 5700 ppm sub data available in a PubCHem database, using Indigo software to randomly render part of molecular data therein into 256-x 256-pixel 3-channel PNG format images of various types (bond width, character size and the like), performing angle rotation on the images, performing data expansion operation of up-down left-right inversion, and generating 10 ten thousands of small molecular chemical structural formula images.

c. And searching for a blank position in the typesetting template, randomly filling a chemical structural formula, marking, randomly taking out a generated chemical structural formula image, and placing the image at the blank position outside the text area after random scaling, as shown in figure 4. The pixel-by-pixel signature chemistry image occupies the pixel location as shown in fig. 5.

Step (2) constructing an improved Resunet neural network as shown in fig. 1, sending a training data set into the improved Resunet neural network for training, and storing a trained model until the training designated times or the Loss curve is no longer reduced and the accuracy is no longer improved;

further, the improved ResUNet neural network comprises the following steps: taking the training set T as an input image of the improved Resunet neural network, wherein the input image is 512 multiplied by 3, and outputting a feature map res-1 with the size of 256 multiplied by 64 after 7 multiplied by 7 on the first layer; then, using maximum value of 3×3 to pool, repeating three times to 9 times of convolution of 1×1 size, 3×3 size and 1×1 size, and outputting feature map res-2 of 128×128×256 size; then, the convolution is repeated for 12 times for four times for 1×1 size, 3×3 size and 1×1 size to output a 64×64×512 size feature map res-3, and then the convolution is repeated for 18 times for six times for 1×1 size, 3×3 size and 1×1 size to output a 32×32×1024 size feature map res-4; then, the convolution is repeated for 9 times by repeating the steps of three times, namely the size of 1 multiplied by 1, the size of 3 multiplied by 3 and the size of 1 multiplied by 1, and a characteristic diagram res-5 with the size of 16 multiplied by 2048 is output; then, a convolution with the size of 1 multiplied by 1 is carried out, and a feature map conv-1 with the size of 16 multiplied by 1024 is output; then 2X 2 up-sampling is carried out, and the output characteristic diagram up-1 and the characteristic diagram res-4 are spliced to obtain a characteristic diagram concat-1 with the size of 32X 2048; then, a 3×3 size convolution is performed to output a 32×32×512 size feature map conv-2; then 2X 2 up-sampling is carried out, and the output feature map up-2 and the feature map res-3 are spliced to obtain a feature map concat-2 with the size of 64X 1024; then, a 3×3 size convolution is performed to output a 64×64×256 size feature map conv-3; then 2X 2 up-sampling is carried out, and the output feature map up-3 and the feature map res-2 are spliced to obtain a feature map concat-3 with the size of 128X 512; then, a 3×3 size convolution is performed to output a 128×128×64 size feature map conv-4; then 2X 2 up-sampling is carried out, and the output characteristic diagram up-4 and the characteristic diagram res-1 are spliced to obtain a characteristic diagram concat-4 with the size of 256X 128; then, a 3×3 size convolution is performed to output a 256×256×64 size feature map conv-5; finally, through up-sampling of 2×2 and convolution of 1×1 size, 512×512×2 result map corresponding to original input image size is output.

An improved ResUNet neural network was constructed according to the following table:

and (3) segmenting by using the neural network trained in the step (2) to obtain segmentation results.

Claims

1. The chemical structural formula segmentation method based on the Resunet neural network is characterized by comprising the following steps of:

constructing a training set T, wherein the training set T comprises a manual labeling training set T-1 and an automatic generating training set T-2;

step (2) the training set T is sent to a Resune neural network for training, and the trained Resune neural network model is stored until the training designated times or the Loss curve is no longer reduced and the accuracy is no longer improved;

step (3), segmenting the chemical structural formula by using the Resunet neural network model trained in the step (2);

the training set T-2 is generated by a method for automatically generating a chemical structural formula training set and randomly filling images based on typesetting templates, and the construction method comprises the following steps:

a. constructing a typesetting template, and randomly generating text data in a text area;

b. generating a plurality of chemical structural formula images;

c. searching blank positions in the typesetting template, randomly filling chemical structural formula images and marking;

taking the chemical structural formula in the manual labeling publication as a training set T-1, wherein the capacity ratio of the training set T-1 to the training set T-2 is 1:50;

the method for constructing the typesetting template comprises the following steps:

a-1, manually calibrating text areas in 200 pages of publications, rotating, reversing up and down, and expanding data by left and right, and generating 1000 pages of typesetting templates;

a-2, taking the Internet characters and the characters generated by the random text generator as text data, and randomly filling the text data into the character areas in the typesetting template;

the method for generating a large number of chemical structural formula images comprises the following steps:

b-1, rendering part of molecular data in 5700 ppm sub-data available in the PubCHem database into various types of 256x256 pixel 3-channel PNG format images by using Indigo software at random;

b-2, performing data expansion operation of angle rotation, up-down left-right inversion on the image, and generating 10 ten thousand micromolecular chemical structural formula images;

the method for searching blank positions in the typesetting template to randomly fill chemical structural formula images and mark comprises the following steps:

c-1, randomly taking out the generated chemical structural formula image, and placing the image at a blank position outside a text region after random scaling to obtain a data part in a training set T-2;

c-2, marking the positions of pixels occupied by the chemical structural formula image pixel by pixel to obtain a label part of the training set T-2;

the Resunet neural network is an improved Resunet neural network, which is realized by the following steps:

taking the training set T as an input image of the improved Resunet neural network, wherein the input image is 512 multiplied by 3, and outputting a feature map res-1 with the size of 256 multiplied by 64 after 7 multiplied by 7 of the first layer; then, using maximum value of 3×3 to pool, repeating three times to 9 times of convolution of 1×1 size, 3×3 size and 1×1 size, and outputting feature map res-2 of 128×128×256 size; then, the convolution is repeated for 12 times for four times for 1×1 size, 3×3 size and 1×1 size to output a 64×64×512 size feature map res-3, and then the convolution is repeated for 18 times for six times for 1×1 size, 3×3 size and 1×1 size to output a 32×32×1024 size feature map res-4; then, the convolution is repeated for 9 times by repeating the steps of three times, namely the size of 1 multiplied by 1, the size of 3 multiplied by 3 and the size of 1 multiplied by 1, and a characteristic diagram res-5 with the size of 16 multiplied by 2048 is output; then, a convolution with the size of 1 multiplied by 1 is carried out, and a feature map conv-1 with the size of 16 multiplied by 1024 is output; then 2X 2 up-sampling is carried out, and the output characteristic diagram up-1 and the characteristic diagram res-4 are spliced to obtain a characteristic diagram concat-1 with the size of 32X 2048; then, a 3×3 size convolution is performed to output a 32×32×512 size feature map conv-2; then 2X 2 up-sampling is carried out, and the output feature map up-2 and the feature map res-3 are spliced to obtain a feature map concat-2 with the size of 64X 1024; then, a 3×3 size convolution is performed to output a 64×64×256 size feature map conv-3; then 2X 2 up-sampling is carried out, and the output feature map up-3 and the feature map res-2 are spliced to obtain a feature map concat-3 with the size of 128X 512; then, a 3×3 size convolution is performed to output a 128×128×64 size feature map conv-4; then 2X 2 up-sampling is carried out, and the output characteristic diagram up-4 and the characteristic diagram res-1 are spliced to obtain a characteristic diagram concat-4 with the size of 256X 128; then, a 3×3 size convolution is performed to output a 256×256×64 size feature map conv-5; finally, through up-sampling of 2×2 and convolution of 1×1 size, 512×512×2 result map corresponding to original input image size is output.