CN114842486A - Handwritten chemical structural formula recognition method, system, storage medium and equipment - Google Patents
Handwritten chemical structural formula recognition method, system, storage medium and equipment Download PDFInfo
- Publication number
- CN114842486A CN114842486A CN202210776419.0A CN202210776419A CN114842486A CN 114842486 A CN114842486 A CN 114842486A CN 202210776419 A CN202210776419 A CN 202210776419A CN 114842486 A CN114842486 A CN 114842486A
- Authority
- CN
- China
- Prior art keywords
- image
- structural formula
- chemical structural
- chemical
- chemical structure
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000000126 substance Substances 0.000 title claims abstract description 237
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000007781 pre-processing Methods 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 22
- 238000005070 sampling Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000010801 machine learning Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 229910052760 oxygen Inorganic materials 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 229910052717 sulfur Inorganic materials 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/32—Digital ink
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/28—Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19147—Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/32—Digital ink
- G06V30/333—Preprocessing; Feature extraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/32—Digital ink
- G06V30/36—Matching; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a handwritten chemical structural formula recognition method, a system, a storage medium and equipment, wherein the method comprises the steps of obtaining a first image with a chemical structural formula, preprocessing the first image, facilitating improvement of accuracy of recognition of an original handwritten chemical structural formula, obtaining a second image, inputting the second image into a target recognition model, recognizing the chemical structural formula, obtaining a corresponding chemical structural formula name label, improving recognition speed and recognition accuracy greatly, matching a corresponding chemical form file according to the chemical structural formula name label, generating standard information, and effectively solving the problems that the handwritten chemical structural formula recognition efficiency is low, the accuracy is poor, and the standard information cannot be generated.
Description
Technical Field
The invention belongs to the technical field of chemical structural formula recognition, and particularly relates to a handwritten chemical structural formula recognition method, a handwritten chemical structural formula recognition system, a storage medium and handwritten chemical structural formula recognition equipment.
Background
The chemistry is mainly scientific on the molecular and atomic level for researching the composition, property, structure and change rule of substances, the natural rule of molecules and atoms is closely related to the physical and chemical properties of substances in the macroscopic world where people live, and the chemical structural formula is an expression mode of the natural rule of molecules and atoms, so that people can understand and research the natural rule conveniently.
In the chemical field, we have achieved a remarkable achievement, wherein the efforts made by chemists for this purpose are few, and in order to provide more convenience for the chemists, some identification methods of handwritten chemical structural formulas are provided, so that the method can be used for greatly assisting in standardizing the chemical structural formulas.
Although the traditional identification method can identify the chemical structural formula and standardize the chemical structural formula, in the relevant work of identification, due to different writing habits, photographing light and other factors, the consumed time in the identification process is longer, the identification accuracy is lower, the user experience is seriously influenced, in addition, the identified chemical structural formula cannot directly provide the standardized simplified molecule linear input specification for the user, and corresponding attribute data cannot be obtained.
Disclosure of Invention
Based on this, embodiments of the present invention provide a method, a system, a storage medium, and an apparatus for identifying a handwritten chemical structural formula, which are used to solve the problems in the prior art that the efficiency of identifying a handwritten chemical structural formula is low, the accuracy is poor, and standard information cannot be generated.
The first aspect of the embodiments of the present invention provides a method for identifying a handwritten chemical structural formula, where the method includes:
acquiring a first image with a chemical structural formula, and preprocessing the first image to obtain a second image;
inputting the second image into a target recognition model, recognizing the chemical structural formula and obtaining a corresponding chemical structure name label;
and matching the corresponding chemical table file according to the chemical structure name label, and generating standard information.
Further, the step of acquiring a first image with a chemical structural formula and preprocessing the first image to obtain a second image comprises:
using a plurality of images with chemical structural formulas as an original training set, wherein the original training set is used for machine learning, and establishing an original training model;
and carrying out generalization processing on the original training model through a data enhancement technology to obtain a target recognition model.
Further, the step of acquiring a first image with a chemical structural formula and preprocessing the first image to obtain a second image includes:
normalizing the first image to obtain a first sub-image;
carrying out binarization processing on the first sub-image to obtain a chemical structure image and a background image;
and denoising the chemical structure image to obtain the second image.
Further, the step of denoising the chemical structure image to obtain the second image includes:
acquiring any pixel point in the chemical structure image and a first pixel value corresponding to the pixel point;
according to the pixel point and the first pixel value, carrying out weighted average on the first pixel value and other pixel values in the pixel point neighborhood corresponding to the first pixel value to obtain a second pixel value;
and changing the first pixel value into the second pixel value, and acting on the corresponding pixel point to obtain the second image.
Further, the step of inputting the second image into a target recognition model, recognizing the chemical structural formula, and obtaining a corresponding chemical structure name tag includes:
traversing the second image to obtain a first characteristic parameter, wherein the traversal formula is as follows:
wherein Q (x, y) represents the first characteristic parameter of the pixel point with coordinates (x, y), I (x + I, y + j) represents the second pixel value of the pixel point with coordinates (x + I, y + j), and K (x + I, y + j) represents the weight of the pixel point with coordinates (x + I, y + j);
performing down-sampling processing on the first characteristic parameter to obtain a second characteristic parameter, wherein the formula of the down-sampling processing is as follows:
and performing full-connection processing on the second characteristic parameters, and respectively calculating score values of the second characteristic parameters, wherein the full-connection processing formula is as follows:
where Yz represents a score value of the z-th second feature parameter, Wz represents a weight coefficient of the z-th second feature parameter, Qz represents the z-th second feature parameter, bz represents a bias value of the z-th second feature parameter, and n represents the total number of the second feature parameters;
and acquiring the chemical structural formula according to the second characteristic parameter and the corresponding score value.
Further, the step of matching the corresponding chemical form file according to the chemical structure name tag and generating specification information further includes:
and establishing a chemical table file library, and simultaneously establishing a mapping relation between the chemical structure name label and the chemical table file.
Further, the step of performing neural network convolution processing on the second image to identify the chemical structural formula and obtain the corresponding chemical structure name tag further includes:
acquiring the chemical structural formula, and acquiring the element types of the chemical structural formula and the quantity corresponding to the element types according to the chemical structural formula;
judging whether the element types contain target element types or not;
if yes, judging whether the element type belongs to a preset element type;
if yes, judging whether the chemical structural formula is correct or not according to the number of the target element types in the element types;
when the chemical structural formula is judged to be correct, matching the chemical structure name label corresponding to the chemical structural formula;
and when the chemical structural formula is judged to be incorrect, matching a target chemical structural formula in a database according to the element types in the chemical structural formula and the quantity corresponding to the element types.
A second aspect of an embodiment of the present invention provides a handwritten chemical structural formula recognition system, including:
the system comprises a preprocessing module, a first image acquisition module, a second image acquisition module and a second image acquisition module, wherein the preprocessing module is used for acquiring a first image with a chemical structural formula and preprocessing the first image to obtain a second image;
the identification module is used for carrying out neural network convolution processing on the second image, identifying the chemical structural formula and obtaining a corresponding chemical structure name label;
and the standard information generating module is used for matching the corresponding chemical table file according to the chemical structure name label and generating standard information.
A third aspect of embodiments of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the handwritten chemical structural formula recognition method provided in the first aspect.
A fourth aspect of embodiments of the present invention provides an apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the procedures provided in the first aspect when executing the procedures.
The invention has the beneficial effects that: the method comprises the steps of obtaining a first image with a chemical structural formula, preprocessing the first image, being beneficial to improving the accuracy of identifying an original handwritten chemical structural formula, obtaining a second image, inputting the second image into a target identification model, identifying the chemical structural formula, and obtaining a corresponding chemical structure name label.
Drawings
FIG. 1 is a flow chart of a method for recognizing a chemical formula by handwriting according to a first embodiment of the present invention;
FIG. 2 is a block diagram of a handwritten chemical structural formula recognition system provided in a third embodiment of the present invention;
fig. 3 is a block diagram of a handwritten chemical structural formula recognition apparatus according to a fourth embodiment of the present invention.
The following detailed description will be further described in conjunction with the above-identified drawing figures.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Several embodiments of the invention are presented in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like as used herein are for illustrative purposes only.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Example one
Referring to fig. 1, fig. 1 is a flowchart illustrating an implementation of a handwritten chemical structural formula recognition method according to a first embodiment of the present invention, where the method specifically includes steps S01 to S03.
Step S01, acquiring a first image with a chemical structural formula, and preprocessing the first image to obtain a second image.
It should be noted that, a plurality of images with chemical structural formulas are used as an original training set, the original training set is used for machine learning, an original training model is established, the original training model is generalized through a data enhancement technology to obtain a target recognition model, wherein one image only contains one chemical structural formula drawn by hand, in this embodiment, 24136 handwritten pictures are used as the original training set, in order to increase the number of training samples and properly add interference to improve the generalization ability of the training model, specifically, the data enhancement technology including rotation, inversion, noise addition and the like is adopted, the training set is finally expanded to 49560, 49560 handwritten chemical structural formulas obtained through the generalization processing are used for deep learning to establish the target recognition model, and it can be understood that through the generalization processing, not only the number of training samples is increased, but also more characteristics are added, namely, in the subsequent recognition process, when the handwritten chemical structural formula is sloppy or overturned, the judgment can be accurately carried out, and the robustness is enhanced.
In addition, after the target recognition model is established, a first image with a chemical structural formula, namely an original image, is obtained firstly, preprocessing is carried out, in the preprocessing process, the first image is cut into a uniform size of 32 x 32, the preprocessed data is limited in a range of [0,1] by normalization, adverse effects caused by singular sample data are eliminated, and then binarization processing and denoising processing are carried out on the image respectively. In order to obtain a relatively clearer picture, binarization processing is carried out on the picture, only a chemical structure image is pure black, the rest background images are pure white, denoising processing is carried out on the image, wherein the chemical structure image is denoised by adopting Gaussian filtering, specifically, the Gaussian filtering is realized by adopting discretization window sliding window convolution, the denoising process mainly comprises the steps of carrying out weighted average on all pixel values in the whole chemical structure image, updating the values of the pixel points into the pixel values of the pixel points after weighted average of the values and other pixel values in a neighborhood, and acting the updated pixel values on the corresponding pixel points to obtain a second image, namely finishing preprocessing.
And step S02, inputting the second image into a target recognition model, recognizing the chemical structural formula and obtaining a corresponding chemical structure name label.
It should be noted that, in the identification process, the second image is traversed to obtain the first characteristic parameter, where the traversal formula is:
wherein Q (x, y) represents the first characteristic parameter of the pixel point with coordinates (x, y), I (x + I, y + j) represents the second pixel value of the pixel point with coordinates (x + I, y + j), and K (x + I, y + j) represents the weight of the pixel point with coordinates (x + I, y + j), in this embodiment, the weight of K (x, y) in the center of the convolution kernel is set to be 1, when traversing the pixel points in the image with I being-1, 0,1 and j being-1, 0,1, the calculated first characteristic parameter Q (x, y) is obtained, specifically, in order to improve the performance of the whole chemical structure identification, reduce the dependence on real-time data, complete a larger-scale chemical structure identification, design the convolution layer to be 5 layers, the size of the convolution kernel to be 3 x 3, slide one unit to the right with a fixed window of 3 x 3 each time, identifying a first characteristic parameter, and calculating the image size after the convolution is finished, wherein the calculation formula is as follows:
wherein the input image size isOutput image size ofThe convolution kernel size is F × F, S denotes the step of convolution kernel movement, P is the Padding value and denotes the Padding pixels, and in the present embodiment, the input image is the second image.
Further, the first characteristic parameter is subjected to down-sampling processing to obtain a second characteristic parameter, and the formula of the down-sampling processing is as follows:
wherein,which is indicative of a second characteristic parameter,the method adopts maximum pooling, namely, a point with the maximum value in a local receiving area is taken, specifically, under the condition that the image quality is not influenced, a picture is compressed, parameters are reduced, 3 × 3 pooling is adopted by a pooling layer, every 3 × 3 elements of a sub-matrix in first characteristic parameters are changed into one element, the dimension of the matrix is reduced, namely, the maximum value is taken in a target area of 3 × 3, in addition, after the pooling is finished, the image size is calculated, and the calculation formula is as follows:
wherein the input image size isOutput image size ofThe convolution kernel size is F × F, S represents the step of the convolution kernel movement, and in the present embodiment, the input image is the convolved image of the second image.
Further, the second feature parameters are subjected to full-join processing, and score values of the second feature parameters are respectively calculated, wherein a formula of the full-join processing is as follows:
in addition, in some other embodiments, after performing full connection processing, a full connection result is input to the active layer, and an Sigmoid function is adopted to map an input value between [0 and 1], where Yz represents a score value of a z-th second feature parameter, Wz represents a weight coefficient of the z-th second feature parameter, Qz represents a z-th second feature parameter, bz represents a bias value of the z-th second feature parameter, n represents a total number of second feature parameters, and then a chemical structural formula corresponding to a second feature parameter with a highest score value is obtained according to the second feature parameters and corresponding score values, and then a chemical structure name tag corresponding to the chemical structural formula is obtained, and in other embodiments, the full connection result is input to the active layer, and the input value is mapped between [0 and 1], and a specific calculation formula is as follows:
f(Y) =1/(1+exp(-Y));
and Y is an output result obtained by the full connection layer, namely the score value of the second characteristic parameter.
And step S03, matching the corresponding chemical form file according to the chemical structure name label, and generating the standard information.
In this embodiment, a chemical form file library needs to be established first, and meanwhile, a mapping relationship between a chemical structure name tag and a chemical form file is established, that is, the chemical structure name tag is used as a tag, and when the chemical structure name tag is obtained, a corresponding chemical form file can be output.
Specifically, after the handwritten chemical structural formula is identified, the label is matched with the corresponding sdf file, because each sdf file of the chemical structural formula has its corresponding SMILES, then the RDKit tool in Python is used to convert the sdf corresponding to the chemical structural formula into SMILES normalized information, in addition, the chemical information such as molecular weight, smart, 3D display and the like corresponding to the chemical structural formula can be obtained according to the sdf file, in some optional other embodiments, after the handwritten chemical structural formula is identified, the normalized chemical structure can be generated according to the SMILES and stored as a picture, which can be used for inserting a text.
In summary, in the handwritten chemical structural formula recognition method in the above embodiment of the present invention, the first image with the chemical structural formula is obtained and preprocessed, so that the accuracy of recognizing the original handwritten chemical structural formula is improved, the second image is obtained, the second image is input into the target recognition model, the chemical structural formula and the corresponding chemical structure name tag thereof are recognized, the recognition speed is increased, the recognition precision is greatly improved, the corresponding chemical form file is matched according to the chemical structure name tag, and the specification information is generated, thereby effectively solving the problems that the handwritten chemical structural formula recognition efficiency is low, the precision is poor, and the specification information cannot be generated.
Example two
The second embodiment of the invention provides a handwritten chemical structural formula recognition method, which specifically comprises a step S20 to a step S22.
Step S20, acquiring a first image with a chemical structural formula, and preprocessing the first image to obtain a second image.
It can be understood that the first image can be obtained by shooting through a mobile phone and stored in an album of the mobile phone, specifically, by calling the functions of shooting and accessing the album of the Android system, the handwritten chemical structural formula image to be identified can be shot through the mobile phone, or the shot handwritten chemical structural formula image to be identified is selected from the album, after the chemical structural formula identification is performed, a plurality of possible identification results are provided for a user, after the user selects a required chemical formula, the chemical structure is converted into a standard format such as SMILES and InChI, and meanwhile, the molecular weight, the molecular complexity and the like can be calculated.
And step S21, inputting the second image into a target recognition model, recognizing the chemical structural formula and obtaining a corresponding chemical structure name label.
Specifically, after the chemical structural formula is identified, the element types of the chemical structural formula and the corresponding quantity of the element types are obtained according to the chemical structural formula, for example, when the user wants to write the chemical structural formula of ethanol, the user will input C for the reasons of writing errors and the like 2 H 5 OH is written as C 2 H 6 OH, identified chemical formula as C 2 H 6 OH, wherein there are 3 element types, i.e. C, H and O, the number of C elements is 2, the number of H elements is 7, the number of O elements is 1, and then it is determined whether the element type includes the target element type, in this embodiment, if the target element type is H element, then it is determined whether the element type belongs to the predetermined element type, specifically, C, H and O, C, the predetermined element type 2 H 6 OH meets the requirements, further, depending on the element speciesThe number of target element species of (1) is judged to be correct, wherein the number of C element is 2, the number of H element is 7, and the number of O element is 1, it should be noted that, when the compound consists of only C, H and O, the number of H should be even since C and O are tetravalent and divalent, respectively, and are even valences, and H is monovalent, the number of H should be even, so C 2 H 6 OH is a wrong chemical structure and cannot match C 2 H 6 The chemical structure name label corresponding to the OH chemical structure formula is used for matching a target chemical structure formula in a database according to the types of elements in the chemical structure formula and the corresponding quantity of the element types in order to obtain the chemical structure formula required by a user, wherein the database comprises all known chemical structure formulas and corresponding related information thereof, and specifically C is used 2 H 6 For example, OH is used, the matching method is to know that the number of C elements is 2, the number of H elements is 7, and the number of O elements is 1, match in the database that the types of the elements are consistent with those in the chemical structural formula, that is, C, H and O, and then obtain all correct target chemical structural formulas with the numbers of the element types within a preset range, in this embodiment, the preset range is ± 1, it can be understood that the range of C number is 1 to 3, the range of H number is 6 to 8, and the range of O number is 0 to 2, wherein according to the above conditions, the target chemical structural formula can be C 2 H 5 OH、C 3 H 5 OH, etc., wherein C 2 H 5 The number of C elements in OH is 2, the number of H elements is 6, the number of O elements is 1, C 2 H 5 The number of H elements in OH is larger than that of C 2 H 6 The number of H elements in OH is less than 1; c 3 H 5 The number of C elements in OH is 3, the number of H elements is 6, the number of O elements is 1, C 3 H 5 The number of C elements in OH is greater than that of C 2 H 6 The number of C elements in OH is more than 1, and the chemical structural formula C required by a user is included 2 H 5 OH, completing intelligent error correction of the chemical structural formula.
Similarly, when the identified element type of the chemical structural formula comprises the X element besides the C, H and O elements, the judgment is carried out according to the price of the X element, and when the price of the X element is a base number and the quantity of the price of the X element is also the base number, the quantity of the H element is the base number; when the valence of the X element is a base number and the valence number of the X element is an even number, the number of the H element is an even number; when the valence of the X element is an even number, the number of the H element is an even number.
In addition, in some other alternative embodiments, when the handwritten chemical structural formula has writing irregularities such as blurred notes and skew, and recognition is performed, TOP-K values of recognition results, namely the first K possible recognition results, are output, and error correction is performed at the same time, so that SMILES which is correct as much as possible is provided for a user, and the defect of model recognition on resistant images is reduced.
And step S22, matching the corresponding chemical form file according to the chemical structure name label, and generating the standard information.
In summary, in the handwritten chemical structural formula recognition method in the above embodiment of the present invention, the first image with the chemical structural formula is obtained and preprocessed, which is beneficial to improving the accuracy of recognizing the original handwritten chemical structural formula, so as to obtain the second image, the second image is subjected to the neural network convolution processing, so as to recognize the chemical structural formula and the corresponding chemical structure name label, so that the recognition speed is increased, the recognition precision is greatly improved, according to the chemical structure name label, the corresponding chemical form file is matched, and the specification information is generated, thereby effectively solving the problems of low recognition efficiency, poor precision and incapability of generating the specification information of the handwritten chemical structural formula, specifically, by obtaining the element types and the corresponding number in the chemical structural formula, judging whether the writing of the chemical structural formula is correct, and playing a function of correcting the chemical structural formula, the recognition is more intelligent, and the user experience is better.
EXAMPLE III
Referring to fig. 2, fig. 2 is a block diagram of a handwriting chemical structural formula recognition system according to an embodiment of the present invention. Handwritten chemical structural formula recognition system 300 includes: a preprocessing module 31, a recognition module 32, and a specification information generating module 33, wherein:
the preprocessing module 31 is configured to obtain a first image with a chemical structural formula, and preprocess the first image to obtain a second image;
the identification module 32 is configured to input the second image into a target identification model, identify the chemical structural formula, and obtain a corresponding chemical structure name tag;
and the specification information generating module 33 is configured to match the corresponding chemical form file according to the chemical structure name tag, and generate specification information.
Further, in some alternative embodiments of the present invention, the handwritten chemical structural formula recognition system 300 further includes:
the training module is used for taking a plurality of images with chemical structural formulas as an original training set, wherein the original training set is used for machine learning and establishing an original training model;
and the target recognition model establishing module is used for generalizing the original training model through a data enhancement technology to obtain a target recognition model.
Further, in some optional embodiments of the present invention, the preprocessing module 31 includes:
the normalization processing unit is used for performing normalization processing on the first image to obtain a first sub-image;
a binarization processing unit, configured to perform binarization processing on the first sub-image to obtain a chemical structure image and a background image;
and the denoising processing unit is used for denoising the chemical structure image to obtain the second image.
Further, in some optional embodiments of the present invention, the denoising processing unit includes:
the acquisition subunit is used for acquiring any pixel point in the chemical structure image and a first pixel value corresponding to the pixel point;
the calculating subunit is configured to perform weighted average on the first pixel value and other pixel values in the pixel neighborhood corresponding to the first pixel value according to the pixel point and the first pixel value, so as to obtain a second pixel value;
and the changing subunit is configured to change the first pixel value into the second pixel value, and act on the corresponding pixel point to obtain the second image.
Further, in some optional embodiments of the present invention, the identifying module 32 includes:
a traversing unit, configured to traverse the second image to obtain a first feature parameter, where the traversing formula is:
wherein Q (x, y) represents the first characteristic parameter of the pixel point with coordinates (x, y), I (x + I, y + j) represents the second pixel value of the pixel point with coordinates (x + I, y + j), and K (x + I, y + j) represents the weight of the pixel point with coordinates (x + I, y + j);
the down-sampling processing unit is used for performing down-sampling processing on the first characteristic parameter to obtain a second characteristic parameter, and the formula of the down-sampling processing is as follows:
and the full-connection unit is used for performing full-connection processing on the second characteristic parameters and respectively calculating score values of the second characteristic parameters, and the formula of the full-connection processing is as follows:
where Yz represents a score value of the z-th second feature parameter, Wz represents a weight coefficient of the z-th second feature parameter, Qz represents the z-th second feature parameter, bz represents a bias value of the z-th second feature parameter, and n represents the total number of the second feature parameters;
and the chemical structural formula obtaining unit is used for obtaining the chemical structural formula according to the second characteristic parameter and the corresponding score value.
Further, in some alternative embodiments of the present invention, the handwritten chemical structural formula recognition system 300 further includes:
and the chemical table file library establishing module is used for establishing a chemical table file library and simultaneously establishing a mapping relation between the chemical structure name label and the chemical table file.
Further, in some optional embodiments of the present invention, the identifying module 32 further includes:
the first obtaining unit is used for obtaining the chemical structural formula and obtaining each element type of the chemical structural formula and the quantity corresponding to each element type according to the chemical structural formula;
a first judgment unit configured to judge whether the element type includes a target element type;
a second judging unit, configured to, when it is judged that the element type includes a target element type, judge whether the element type belongs to a preset element type;
a third judging unit, configured to, when it is judged that the element type belongs to a preset element type, judge whether the chemical structural formula is correct according to the number of the target element types in the element type;
the first matching unit is used for matching the chemical structure name label corresponding to the chemical structural formula when the chemical structural formula is judged to be correct;
and the second matching unit is used for matching the target chemical structural formula in the database according to the types of the elements in the chemical structural formula and the quantity corresponding to the types of the elements when the chemical structural formula is judged to be wrong.
Example four
Referring to fig. 3, a block diagram of a handwritten chemical structural formula recognition apparatus according to a fourth embodiment of the present invention is shown, and includes a memory 20, a processor 10, and a computer program 30 stored in the memory and executable on the processor, where the processor 10 implements the handwritten chemical structural formula recognition method when executing the computer program 30.
The processor 10 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 20 or Processing data, such as executing an access restriction program.
The memory 20 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 20 may be an internal storage unit of the handwritten chemical formula recognition device, such as a hard disk of the handwritten chemical formula recognition device, in some embodiments. The memory 20 may also be an external storage device of the handwritten chemical structural formula identification device in other embodiments, such as a plug-in hard disk equipped on the handwritten chemical structural formula identification device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 20 may also include both an internal storage unit and an external storage device of the handwritten chemical structural formula recognition apparatus. The memory 20 may be used not only to store application software of the handwritten chemical structural formula recognition apparatus and various types of data, but also to temporarily store data that has been output or will be output.
It should be noted that the configuration shown in fig. 3 does not constitute a limitation of the handwritten chemical formula recognition device, and in other embodiments, the handwritten chemical formula recognition device may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
The embodiment of the invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for recognizing the handwritten chemical structural formula is implemented.
Those of skill in the art will understand that the logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be viewed as implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
Claims (10)
1. A method for identifying handwritten chemical structural formulae, the method comprising:
acquiring a first image with a chemical structural formula, and preprocessing the first image to obtain a second image;
inputting the second image into a target recognition model, recognizing the chemical structural formula and obtaining a corresponding chemical structure name label;
and matching the corresponding chemical table file according to the chemical structure name label, and generating standard information.
2. The method of claim 1, wherein the step of obtaining a first image of the chemical structure and preprocessing the first image to obtain a second image comprises:
using a plurality of images with chemical structural formulas as an original training set, wherein the original training set is used for machine learning, and establishing an original training model;
and carrying out generalization processing on the original training model through a data enhancement technology to obtain the target recognition model.
3. The method of claim 1, wherein the step of obtaining a first image of the chemical structure and preprocessing the first image to obtain a second image comprises:
normalizing the first image to obtain a first sub-image;
carrying out binarization processing on the first sub-image to obtain a chemical structure image and a background image;
and denoising the chemical structure image to obtain the second image.
4. The handwritten chemical structural formula recognition method of claim 3, wherein said step of denoising said chemical structure image to obtain said second image comprises:
acquiring any pixel point in the chemical structure image and a first pixel value corresponding to the pixel point;
according to the pixel point and the first pixel value, carrying out weighted average on the first pixel value and other pixel values in the pixel point neighborhood corresponding to the first pixel value to obtain a second pixel value;
and changing the first pixel value into the second pixel value, and acting on the corresponding pixel point to obtain the second image.
5. The method of claim 4, wherein the step of inputting the second image into the object recognition model to recognize the chemical structure and obtain the corresponding chemical structure name label comprises:
traversing the second image to obtain a first characteristic parameter, wherein the traversal formula is as follows:
wherein Q (x, y) represents the first characteristic parameter of the pixel point with coordinates (x, y), I (x + I, y + j) represents the second pixel value of the pixel point with coordinates (x + I, y + j), and K (x + I, y + j) represents the weight of the pixel point with coordinates (x + I, y + j);
performing down-sampling processing on the first characteristic parameter to obtain a second characteristic parameter, wherein the formula of the down-sampling processing is as follows:
and performing full-connection processing on the second characteristic parameters, and respectively calculating score values of the second characteristic parameters, wherein the full-connection processing formula is as follows:
where Yz represents a score value of the z-th second feature parameter, Wz represents a weight coefficient of the z-th second feature parameter, Qz represents the z-th second feature parameter, bz represents a bias value of the z-th second feature parameter, and n represents the total number of the second feature parameters;
and acquiring the chemical structural formula according to the second characteristic parameter and the corresponding score value.
6. The method of claim 1, wherein the step of matching the corresponding chemical form file according to the chemical structure name tag and generating specification information further comprises:
and establishing a chemical table file library, and simultaneously establishing a mapping relation between the chemical structure name label and the chemical table file.
7. The method of claim 6, wherein the step of performing neural network convolution processing on the second image to identify the chemical structural formula and obtain the corresponding chemical structural name tag further comprises:
acquiring the chemical structural formula, and acquiring the element types of the chemical structural formula and the quantity corresponding to the element types according to the chemical structural formula;
judging whether the element types contain target element types or not;
if yes, judging whether the element type belongs to a preset element type;
if yes, judging whether the chemical structural formula is correct or not according to the number of the target element types in the element types;
when the chemical structural formula is judged to be correct, matching the chemical structure name label corresponding to the chemical structural formula;
and when the chemical structural formula is judged to be incorrect, matching a target chemical structural formula in a database according to the element types in the chemical structural formula and the quantity corresponding to the element types.
8. A handwritten chemical structural formula recognition system, said system comprising:
the system comprises a preprocessing module, a first image acquisition module, a second image acquisition module and a second image acquisition module, wherein the preprocessing module is used for acquiring a first image with a chemical structural formula and preprocessing the first image to obtain a second image;
the identification module is used for inputting the second image into a target identification model, identifying the chemical structural formula and obtaining a corresponding chemical structure name label;
and the standard information generating module is used for matching the corresponding chemical table file according to the chemical structure name label and generating standard information.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the handwritten chemical structural formula recognition method according to any of claims 1 to 7.
10. A handwritten chemical structure recognition device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the handwritten chemical structure recognition method according to any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210776419.0A CN114842486A (en) | 2022-07-04 | 2022-07-04 | Handwritten chemical structural formula recognition method, system, storage medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210776419.0A CN114842486A (en) | 2022-07-04 | 2022-07-04 | Handwritten chemical structural formula recognition method, system, storage medium and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114842486A true CN114842486A (en) | 2022-08-02 |
Family
ID=82573725
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210776419.0A Pending CN114842486A (en) | 2022-07-04 | 2022-07-04 | Handwritten chemical structural formula recognition method, system, storage medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114842486A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116721713A (en) * | 2023-08-09 | 2023-09-08 | 北京望石智慧科技有限公司 | Data set construction method and device oriented to chemical structural formula identification |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100163316A1 (en) * | 2008-12-30 | 2010-07-01 | Microsoft Corporation | Handwriting Recognition System Using Multiple Path Recognition Framework |
US20140301608A1 (en) * | 2011-08-26 | 2014-10-09 | Council Of Scientific & Industrial Research | Chemical structure recognition tool |
CN108062529A (en) * | 2017-12-22 | 2018-05-22 | 上海鹰谷信息科技有限公司 | A kind of intelligent identification Method of chemical structural formula |
CN108334839A (en) * | 2018-01-31 | 2018-07-27 | 青岛清原精准农业科技有限公司 | A kind of chemical information recognition methods based on deep learning image recognition technology |
CN110263631A (en) * | 2019-05-10 | 2019-09-20 | 南京大学 | A kind of hand-written chemical formula identification and Calculate Ways |
CN111553423A (en) * | 2020-04-29 | 2020-08-18 | 河北地质大学 | Handwriting recognition method based on deep convolutional neural network image processing technology |
CN111709293A (en) * | 2020-05-18 | 2020-09-25 | 杭州电子科技大学 | Chemical structural formula segmentation method based on Resunet neural network |
CN114529908A (en) * | 2021-12-28 | 2022-05-24 | 天翼电子商务有限公司 | Offline handwritten chemical reaction type image recognition technology |
-
2022
- 2022-07-04 CN CN202210776419.0A patent/CN114842486A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100163316A1 (en) * | 2008-12-30 | 2010-07-01 | Microsoft Corporation | Handwriting Recognition System Using Multiple Path Recognition Framework |
US20140301608A1 (en) * | 2011-08-26 | 2014-10-09 | Council Of Scientific & Industrial Research | Chemical structure recognition tool |
CN108062529A (en) * | 2017-12-22 | 2018-05-22 | 上海鹰谷信息科技有限公司 | A kind of intelligent identification Method of chemical structural formula |
CN108334839A (en) * | 2018-01-31 | 2018-07-27 | 青岛清原精准农业科技有限公司 | A kind of chemical information recognition methods based on deep learning image recognition technology |
CN110263631A (en) * | 2019-05-10 | 2019-09-20 | 南京大学 | A kind of hand-written chemical formula identification and Calculate Ways |
CN111553423A (en) * | 2020-04-29 | 2020-08-18 | 河北地质大学 | Handwriting recognition method based on deep convolutional neural network image processing technology |
CN111709293A (en) * | 2020-05-18 | 2020-09-25 | 杭州电子科技大学 | Chemical structural formula segmentation method based on Resunet neural network |
CN114529908A (en) * | 2021-12-28 | 2022-05-24 | 天翼电子商务有限公司 | Offline handwritten chemical reaction type image recognition technology |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116721713A (en) * | 2023-08-09 | 2023-09-08 | 北京望石智慧科技有限公司 | Data set construction method and device oriented to chemical structural formula identification |
CN116721713B (en) * | 2023-08-09 | 2023-10-31 | 北京望石智慧科技有限公司 | Data set construction method and device oriented to chemical structural formula identification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110147774B (en) | Table format picture layout analysis method and computer storage medium | |
CN112651289B (en) | Value-added tax common invoice intelligent recognition and verification system and method thereof | |
CN111915704A (en) | Apple hierarchical identification method based on deep learning | |
CN107945111B (en) | Image stitching method based on SURF (speeded up robust features) feature extraction and CS-LBP (local binary Pattern) descriptor | |
CN107506765B (en) | License plate inclination correction method based on neural network | |
US8396303B2 (en) | Method, apparatus and computer program product for providing pattern detection with unknown noise levels | |
CN109034245A (en) | A kind of object detection method merged using characteristic pattern | |
CN111639558A (en) | Finger vein identity verification method based on ArcFace Loss and improved residual error network | |
CN113392856B (en) | Image forgery detection device and method | |
CN111915635A (en) | Test question analysis information generation method and system supporting self-examination paper marking | |
CN114444565A (en) | Image tampering detection method, terminal device and storage medium | |
CN116343223A (en) | Character wheel type water meter reading method based on deep learning | |
CN114842486A (en) | Handwritten chemical structural formula recognition method, system, storage medium and equipment | |
CN112926508A (en) | Training method and device of living body detection model | |
CN109165551B (en) | Expression recognition method for adaptively weighting and fusing significance structure tensor and LBP characteristics | |
CN112132750B (en) | Video processing method and device | |
CN117746018A (en) | Customized intention understanding method and system for plane scanning image | |
CN113628113A (en) | Image splicing method and related equipment thereof | |
CN108776958B (en) | Mix the image quality evaluating method and device of degraded image | |
CN117011223A (en) | Method and system for detecting few PCB defect samples based on dense prediction | |
US20230386023A1 (en) | Method for detecting medical images, electronic device, and storage medium | |
CN111401415A (en) | Training method, device, equipment and storage medium of computer vision task model | |
CN116012860A (en) | Teacher blackboard writing design level diagnosis method and device based on image recognition | |
CN113051901B (en) | Identification card text recognition method, system, medium and electronic terminal | |
CN115223173A (en) | Object identification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220802 |