WO2023020210A1

WO2023020210A1 - Chemical structure formula identification method and apparatus, storage medium, and electronic device

Info

Publication number: WO2023020210A1
Application number: PCT/CN2022/107752
Authority: WO
Inventors: 郑明月; 蒋华良; 钟飞盛; 熊嘉诚; 刘小红
Original assignee: 中国科学院上海药物研究所; 苏州阿尔脉生物科技有限公司
Priority date: 2021-08-16
Filing date: 2022-07-26
Publication date: 2023-02-23
Also published as: CN115908775A

Abstract

The present disclosure provides a chemical structure formula identification method and apparatus, a storage medium, and an electronic device. The identification method comprises: acquiring a chemical structure image, the chemical structure image containing at least one complete chemical structure formula; converting the chemical structure image into a corresponding chemical text thereof by using a pre-trained conversion model, the conversion model performing single conversion on the complete chemical structure formulae in the chemical structure image. Compared with performing image vectorization on the chemical structure image, respectively converting the obtained lines and nodes, and combining to form the chemical text, in the present disclosure, a single conversion is executed on each complete chemical structure formula in chemical structure images in publications and patents by means of a pre-trained conversion model, thereby obtaining in one step a complete chemical text corresponding to a complete chemical structure formulae. The method has a short development period, low development costs, and easy maintenance, and can guarantee high accuracy of recognition results when processing images that have relatively large blur and noise.

Description

Recognition method, device, storage medium and electronic equipment of chemical structural formula

technical field

The present disclosure relates to the technical field of chemical informatics, in particular to a method, device, storage medium and electronic equipment for identifying chemical structural formulas.

Background technique

In publications such as journals and patents, organic compounds are often represented as chemical structural formulas. But these images of chemical structures are not the language of chemistry that computers can recognize. Therefore, automatically identifying chemical texts corresponding to computer-readable chemical structures (including but not limited to Inchi, Smiles, IUPAC) from such image files can enable chemists to quickly obtain valuable reference "chemical data".

In the prior art, recognition and reading are carried out by methods such as InDraw, KingDraw, etc. Specifically, lines and nodes are interpreted as bonds and atoms after image vectorization, involving image segmentation, image thinning, line enhancement, optical character recognition, and molecular reconstruction , that is, it needs to divide the complete chemical structural formula, convert each line to obtain the small molecule corresponding to each line, and then combine the small molecules according to the preset rules and grammar to obtain the chemical text corresponding to the chemical structural formula . However, these methods need to extract transformation rules and summarize grammars, which have long development cycle, high development cost, and difficult maintenance; moreover, the accuracy of recognition results is low when the existing methods deal with blurry and noisy images.

Contents of the invention

In view of this, the purpose of the embodiments of the present disclosure is to provide a chemical structural formula recognition method, device, storage medium and electronic equipment, which are used to solve the need to extract conversion rules and summarize grammar in the prior art, which has a long development cycle and high development cost. , Difficulty in maintenance, and low accuracy of recognition results when dealing with blurry and noisy images.

In the first aspect, the embodiment of the present disclosure provides a method for identifying a chemical structural formula, which includes:

Acquiring a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

The chemical structure image is converted to its corresponding chemical text by using a pre-trained conversion model, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.

In a possible implementation manner, before using a pre-trained conversion model to convert the chemical structure image to its corresponding chemical text, it also includes:

identifying the region each complete chemical structure occupies in said chemical structure image;

The chemical structure image is cropped according to the area occupied by the chemical structural formula to obtain multiple chemical structure sub-images.

In a possible implementation manner, the conversion of the chemical structure image into its corresponding chemical text using a pre-trained conversion model includes:

The chemical structure sub-image is used as an input of the conversion model, so that the conversion model performs calculation on the chemical structure sub-image, and outputs a chemical text corresponding to the chemical structure sub-image.

In a possible implementation manner, the conversion model calculates the chemical structure sub-image, and outputs the chemical text corresponding to the chemical structure sub-image, including:

The conversion model calculates the sub-image of the chemical structure to obtain a plurality of candidate texts and a probability value corresponding to each candidate text;

The candidate text with the largest probability value is selected as the chemical text corresponding to the chemical structure sub-image.

In a possible implementation manner, the step of training the conversion model includes:

Obtain a training set, the training set includes a first image sample and its corresponding first text sample;

converting the first image sample into a first input vector, and inputting the first input vector into a conversion model to be trained to obtain a first actual text;

calculating whether a first error between the first actual text and the first text sample is within an allowable range;

If the first error is not within the allowable range, adjusting parameters of the conversion model to be trained until the first error falls within the allowable range.

In a possible implementation manner, the identification method also includes:

When there are multiple conversion models to be trained, the second image sample included in the verification set is converted into a second input vector, and the second input vector is respectively input to each conversion after the adjustment parameters In the model, the second actual text is obtained;

calculating a second error between each of said second actual text and a second text sample included in said validation set;

The conversion model after adjusting parameters corresponding to the smallest second error is used as the conversion model.

In the second aspect, the embodiment of the present disclosure also provides a chemical structural formula recognition device, which includes:

An acquisition module configured to acquire a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

A conversion module configured to convert the chemical structure image to its corresponding chemical text using a pre-trained conversion model, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.

In a possible implementation manner, the identification device further includes a cropping module, which is configured to:

In a third aspect, an embodiment of the present disclosure further provides a storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the following steps are performed:

In a fourth aspect, an embodiment of the present disclosure further provides an electronic device, which includes: a processor and a memory, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the following steps are performed:

Compared with the prior art that converts the obtained lines and nodes after image vectorization of chemical structure images, and then combines them to form chemical texts, the embodiment of the present disclosure uses a pre-trained conversion model to publish journals, patents, etc. Each complete chemical structural formula in the chemical structure image in the object is converted once, and then the complete chemical text corresponding to the complete chemical structural formula is obtained at one time. The development cycle is short, the development cost is low, and it is easy to maintain. When it comes to images with large noise and noise, it can ensure a high accuracy of recognition results.

In order to make the above-mentioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments will be described in detail below together with the accompanying drawings.

Description of drawings

In order to more clearly illustrate the technical solutions in the present disclosure or the prior art, the accompanying drawings that need to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings in the following description are only the present invention. For some embodiments described in the publication, for those skilled in the art, other drawings can also be obtained based on these drawings without any creative effort.

Fig. 1 shows the flowchart of the identification method of the chemical structural formula provided by the present disclosure;

Fig. 2 shows a flow chart of a training conversion model in the recognition method provided by the present disclosure;

FIG. 3 shows a flowchart of another training conversion model in the recognition method provided by the present disclosure;

FIG. 4 shows a schematic structural diagram of a device for identifying chemical structural formulas provided by the present disclosure;

Fig. 5 shows a schematic structural diagram of an electronic device provided by the present disclosure.

Detailed ways

Various aspects and features of the present disclosure are described herein with reference to the accompanying drawings.

It should be understood that various modifications may be made to the embodiments applied for herein. Accordingly, the above description should not be viewed as limiting, but only as exemplifications of embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the disclosure.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and, together with the general description of the disclosure given above and the detailed description of the embodiments given below, serve to explain the embodiments of the disclosure. principle.

These and other characteristics of the present disclosure will become apparent from the following description of preferred forms of embodiment given as non-limiting examples with reference to the accompanying drawings.

It should also be understood that, while the disclosure has been described with reference to a few specific examples, those skilled in the art will surely be able to implement many other equivalents of the disclosure which have the features of the claims and which are therefore situated within the scope of the claims. within the limited scope of protection.

The above and other aspects, features and advantages of the present disclosure will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.

Specific embodiments of the present disclosure are hereinafter described with reference to the accompanying drawings; however, it should be understood that the applied embodiments are merely examples of the disclosure, which may be embodied in various ways. Well-known and/or repetitive functions and constructions are not described in detail to avoid obscuring the disclosure with unnecessary or redundant detail. Therefore, specific structural and functional details disclosed herein are not intended to be limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention in virtually any suitable detailed structure. public.

This specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may refer to the same or one or more of the different embodiments.

In the first aspect, in order to facilitate the understanding of the present disclosure, a method for identifying a chemical structural formula provided in the present disclosure is first introduced in detail. As shown in Figure 1, the identification method of the chemical structural formula provided for the embodiment of the present disclosure specifically includes the following steps:

S101. Acquire a chemical structure image, where the chemical structure image includes at least one complete chemical structural formula.

Here, in publications such as journals and patents, organic compounds are often represented in the form of chemical structural formulas. Furthermore, when users consult periodicals, patents and other publications, any page in the periodicals, patents and other documents can be used as a chemical structure image.

Wherein, the chemical structure image may be in JPG format, PNG format or the like.

S102, using a pre-trained conversion model to convert the chemical structure image into its corresponding chemical text, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.

In the specific implementation, there are situations where a chemical structure image contains multiple complete chemical structural formulas. Therefore, before using the pre-trained conversion model to convert the chemical structure image into its corresponding chemical text, first identify each complete chemical formula The area occupied by the chemical structural formula in the chemical structure image, and then cut the chemical structure image according to the area occupied by the chemical structural formula to obtain multiple chemical structure sub-images, each chemical structure sub-image contains only a complete chemical structural formula, and also That is, a complete chemical structural formula is converted each time.

Here, the embodiment of the present disclosure does not need to divide the complete chemical structural formula, and converts each line separately to obtain the small molecule corresponding to each line, and then combines the small molecules according to the preset rules and syntax to obtain the chemical text corresponding to the chemical structural formula , but using the Graphics Processing Unit (GPU) assisted conversion model, on the basis of improving the recognition and processing speed of the chemical structure sub-image, one-time conversion of the chemical structure sub-image can obtain the chemical text, Compared with the segmentation, multiple conversion, and recombination of chemical structure sub-images, the development cycle and development cost are lower, the operation rules are simple, the operation efficiency is high, and the accuracy of recognition results is also improved.

Wherein, the chemical structure sub-image may be a preset shape, or a preset size, etc., which is not specifically limited in this embodiment of the present disclosure.

In the specific implementation process, the chemical structure sub-image is used as the input of the conversion model, and the chemical structure sub-image is converted into a feature vector according to the preset conversion algorithm, so that the conversion model calculates the feature vector corresponding to the chemical structure sub-image, wherein, The preset conversion algorithm may be a mapping relationship between chemical structure sub-images and feature vectors. Afterwards, the conversion model outputs the chemical text corresponding to the chemical structure sub-image, and then completes the conversion of the chemical structural formula to the chemical text.

Optionally, when the conversion model converts the chemical structure sub-image, after calculating the feature vector corresponding to the chemical structure sub-image, a plurality of candidate texts and a probability value corresponding to each candidate text are obtained; wherein, each candidate text Both are possible texts corresponding to the chemical structural formula in the chemical structure sub-image. Further, the candidate text with the highest probability value is selected as the chemical text corresponding to the chemical structure sub-image.

The embodiment of the present disclosure also provides a method for training a transformation model, specifically referring to the steps shown in FIG. 2 , which includes S201-S204.

S201. Acquire a training set, where the training set includes a first image sample and a corresponding first text sample.

S202. Convert the first image sample into a first input vector, and input the first input vector into a conversion model to be trained to obtain a first actual text.

S203. Calculate whether the first error between the first actual text and the first text sample is within an allowable range.

S204. If the error is not within the allowable range, adjust the parameters of the conversion model to be trained until the error falls within the allowable range.

In a specific implementation, the training set is obtained first, and the training set includes the first image sample and its corresponding first text sample, the first text sample is obtained by manual conversion, or manually verified after automatic conversion by a preset algorithm got after.

Afterwards, the first image sample is converted into a first input vector according to a preset conversion algorithm, wherein the first image sample can be converted into a first input vector based on a pre-established dictionary, wherein the dictionary includes the image sample and the input vector The mapping relationship between and the mapping relationship between the candidate text and the output vector. After that, input the first input vector into the conversion model to be trained, and calculate the first input vector through the conversion model to be trained to obtain the first actual text. Of course, the conversion model to be trained will also calculate multiple candidate text, and the first actual text is the candidate text with the largest probability value calculated by the conversion model to be trained. Wherein, the conversion model to be trained calculates the first input vector to obtain the first output vector, and converts the first output vector into a candidate text based on the dictionary.

The conversion model to be trained in the embodiment of the present disclosure includes but not limited to random forest, support vector machine, neural network, etc. Optionally, the conversion model to be trained uses a feature extractor-translator architecture, feature extractor and translator Both are composed of neural networks. Certainly, those skilled in the art should know that the foregoing is an embodiment of the present disclosure, and is not limited thereto.

After obtaining the first actual text, calculate a first error between the first actual text and the first text sample, and determine whether the first error is within an allowable range. If the error is not within the allowable range, adjust the parameters of the conversion model to be trained, and use the conversion model after adjusting the parameters to perform the next round of training until the first error falls within the allowable range, and complete the training of the conversion model.

In specific implementation, different numbers of processing layers in the model or different order of processing layers may lead to different calculation results. Therefore, multiple conversion models to be trained can be established in advance, and each conversion model to be trained is completed After training, use the verification set to determine the final conversion model, specifically refer to the method flowchart shown in FIG. 3 , the steps include S301-S303.

S301, when there are multiple conversion models to be trained, convert the second image sample included in the verification set into a second input vector, and input the second input vector into each conversion model after adjusting parameters, to obtain Second actual text.

S302. Calculate a second error between each second actual text and the second text samples included in the verification set.

S303. Use the conversion model after adjusting the parameters corresponding to the smallest second error as the conversion model.

Here, in the case of multiple conversion models to be trained, the second image sample included in the verification set is converted into a second input vector, and the second input vector is respectively input into each conversion model after adjusting parameters, The second actual text is obtained, wherein the method of converting the second image sample into the second input vector is the same as the method of converting the first image sample into the first input vector, and will not be repeated here.

After obtaining the second actual text corresponding to the conversion model after each adjustment parameter, calculate the second error between the second actual text and the second text sample included in the verification set, that is, the conversion model after the adjustment parameter produces error.

Afterwards, the smallest second error is selected from the plurality of second errors, and the conversion model after adjusting parameters corresponding to the smallest second error is used as the conversion model.

Further, the final conversion model can also be tested by using the test set, so as to further verify the accuracy of the conversion model. In addition, the conversion model can also be updated and trained periodically to ensure the accuracy of the conversion model.

Based on the same inventive concept, the second aspect of the present disclosure also provides a device for identifying chemical structural formulas. Since the problem-solving principle of the device in the present disclosure is similar to the identification method for the above-mentioned chemical structural formulas in the present disclosure, the implementation of the device can be found in Methods The implementation of this method will not be repeated here.

Referring to Fig. 4, the recognition device of the chemical structural formula includes:

An acquisition module 401 configured to acquire a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

The conversion module 402 is configured to convert the chemical structure image to its corresponding chemical text by using a pre-trained conversion model, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.

In another embodiment, the device for identifying chemical structural formulas further includes a tailoring module 403, which is configured to:

In another embodiment, the conversion module 402 is specifically configured as:

In another embodiment, the conversion model in the conversion module 402 calculates the chemical structure sub-image, and when outputting the chemical text corresponding to the chemical structure sub-image, specifically includes:

In another embodiment, the device for identifying chemical structural formulas further includes a first training module 404 configured to:

In another embodiment, the device for identifying chemical structural formulas further includes a second training module 405 configured to:

calculating a second error between each of said second actual text and a second text sample included in said verification set;

The third aspect of the present disclosure also provides a storage medium, which is a computer-readable medium and stores a computer program. When the computer program is executed by a processor, the method provided by any embodiment of the present disclosure is implemented, including the following steps:

S11, acquiring a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

S12. Using a pre-trained conversion model to convert the chemical structure image into its corresponding chemical text, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.

Before the computer program is executed by the processor to convert the chemical structure image into its corresponding chemical text using a pre-trained conversion model, it is also specifically executed by the processor as follows: identify each complete chemical structure formula in the chemical structure image The area occupied by ; cropping the chemical structure image according to the area occupied by the chemical structural formula to obtain multiple chemical structure sub-images.

When the computer program is executed by the processor to convert the chemical structure image to its corresponding chemical text using a pre-trained conversion model, the processor specifically performs the following steps: using the chemical structure sub-image as the input of the conversion model , so that the conversion model performs calculation on the chemical structure sub-image, and outputs the chemical text corresponding to the chemical structure sub-image.

The computer program is executed by the processor to convert the model to calculate the chemical structure sub-image, and when the chemical text corresponding to the chemical structure sub-image is output, the processor also executes the following steps: the conversion model converts the chemical structure sub-image Performing calculations to obtain a plurality of candidate texts and a probability value corresponding to each candidate text; selecting the candidate text with the largest probability value as the chemical text corresponding to the chemical structure sub-image.

When the computer program is executed by the processor to perform the recognition method, the processor also executes the following steps: obtaining a training set, the training set including a first image sample and its corresponding first text sample; converting the first image sample into a first text sample An input vector, and input the first input vector into the conversion model to be trained to obtain the first actual text; calculate whether the first error between the first actual text and the first text sample is allowed within the range; if the first error is not within the allowable range, adjust the parameters of the conversion model to be trained until the first error falls within the allowable range.

When the computer program is executed by the processor to perform the recognition method, the processor also executes the following steps: when there are multiple conversion models to be trained, convert the second image sample included in the verification set into a second input vector, and convert The second input vector is respectively input into the conversion model after each of the adjusted parameters to obtain the second actual text; calculating the distance between each of the second actual text and the second text sample included in the verification set The second error: the conversion model after adjusting parameters corresponding to the smallest second error is used as the conversion model.

It should be noted that the storage medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any storage medium other than a computer-readable storage medium that can transmit, propagate, or transport a program for use by or in conjunction with an instruction execution system, apparatus, or device. Program code contained on a storage medium may be transmitted using any appropriate medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

The fourth aspect of the present disclosure also provides an electronic device. As shown in FIG. The program implements the method provided by any embodiment of the present disclosure. Exemplarily, the method executed by the computer program of the electronic device is as follows:

S21, acquiring a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

S22. Convert the chemical structure image to its corresponding chemical text by using a pre-trained conversion model, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.

Before the processor executes converting the chemical structure image into its corresponding chemical text using a pre-trained conversion model stored on the memory, it further executes the following computer program: identifying each complete chemical structural formula in the chemical structure image The occupied area: the chemical structure image is cropped according to the area occupied by the chemical structural formula to obtain multiple chemical structure sub-images.

When the processor executes the pre-trained conversion model stored on the memory to convert the chemical structure image to its corresponding chemical text, it also executes the following computer program: using the chemical structure sub-image as the input of the conversion model , so that the conversion model performs calculation on the chemical structure sub-image, and outputs the chemical text corresponding to the chemical structure sub-image.

When the processor executes the conversion model stored in the memory to calculate the chemical structure sub-image, and outputs the chemical text corresponding to the chemical structure sub-image, it also executes the following computer program: the conversion model converts the chemical structure sub-image Performing calculations to obtain a plurality of candidate texts and a probability value corresponding to each candidate text; selecting the candidate text with the largest probability value as the chemical text corresponding to the chemical structure sub-image.

When the processor executes the recognition method stored on the memory, it also executes the following computer program: obtain a training set, the training set includes the first image sample and its corresponding first text sample; convert the first image sample into the first text sample An input vector, and input the first input vector into the conversion model to be trained to obtain the first actual text; calculate whether the first error between the first actual text and the first text sample is allowed within the range; if the first error is not within the allowable range, adjust the parameters of the conversion model to be trained until the first error falls within the allowable range.

When the processor executes the recognition method stored on the memory, it also executes the following computer program: when there are multiple conversion models to be trained, convert the second image sample included in the verification set into a second input vector, and convert The second input vector is respectively input into the conversion model after each of the adjusted parameters to obtain the second actual text; calculating the distance between each of the second actual text and the second text sample included in the verification set The second error: the conversion model after adjusting parameters corresponding to the smallest second error is used as the conversion model.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. Those skilled in the art should understand that the scope of disclosure involved in this disclosure is not limited to the technical solution formed by the specific combination of the above technical features, but also covers the technical solutions made by the above technical features without departing from the above disclosed concepts. Other technical solutions formed by any combination of or equivalent features thereof. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.

In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Multiple embodiments of the present disclosure have been described in detail above, but the present disclosure is not limited to these specific embodiments. Those skilled in the art can make various modifications and modified embodiments on the basis of the concept of the present disclosure. These modifications and Any modifications should fall within the scope of protection claimed by the present disclosure.

Claims

A method for identifying a chemical structural formula, characterized in that it comprises:

Acquiring a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

The chemical structure image is converted to its corresponding chemical text by using a pre-trained conversion model, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.
The recognition method according to claim 1, wherein, before using a pre-trained conversion model to convert the chemical structure image into its corresponding chemical text, it also includes:

identifying the region each complete chemical structure occupies in said chemical structure image;

The chemical structure image is cropped according to the area occupied by the chemical structural formula to obtain multiple chemical structure sub-images.
The recognition method according to claim 2, wherein said converting said chemical structure image into its corresponding chemical text using a pre-trained conversion model comprises:

The chemical structure sub-image is used as an input of the conversion model, so that the conversion model performs calculation on the chemical structure sub-image, and outputs a chemical text corresponding to the chemical structure sub-image.
The recognition method according to claim 3, wherein the conversion model calculates the chemical structure sub-image, and outputs the chemical text corresponding to the chemical structure sub-image, including:

The conversion model calculates the sub-image of the chemical structure to obtain a plurality of candidate texts and a probability value corresponding to each candidate text;

The candidate text with the largest probability value is selected as the chemical text corresponding to the chemical structure sub-image.
The recognition method according to claim 1, wherein the step of training the transformation model comprises:

Obtain a training set, the training set includes a first image sample and its corresponding first text sample;

converting the first image sample into a first input vector, and inputting the first input vector into a conversion model to be trained to obtain a first actual text;

calculating whether a first error between the first actual text and the first text sample is within an allowable range;

If the first error is not within the allowable range, adjusting parameters of the conversion model to be trained until the first error falls within the allowable range.
The identification method according to claim 5, further comprising:

When there are multiple conversion models to be trained, the second image sample included in the verification set is converted into a second input vector, and the second input vector is respectively input to each conversion after the adjustment parameters In the model, the second actual text is obtained;

calculating a second error between each of said second actual text and a second text sample included in said validation set;

The conversion model after adjusting parameters corresponding to the smallest second error is used as the conversion model.
An identification device for a chemical structural formula, characterized in that it comprises:

An acquisition module configured to acquire a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

A conversion module configured to convert the chemical structure image to its corresponding chemical text using a pre-trained conversion model, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.
The identification device according to claim 7, further comprising a cropping module configured to:

identifying the region each complete chemical structure occupies in said chemical structure image;

The chemical structure image is cropped according to the area occupied by the chemical structural formula to obtain multiple chemical structure sub-images.
A storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the following steps are performed:

Acquiring a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

The chemical structure image is converted to its corresponding chemical text by using a pre-trained conversion model, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.
An electronic device is characterized in that it includes: a processor and a memory, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor and the memory pass through Bus communication, when the machine-readable instructions are executed by the processor, the following steps are performed:

Acquiring a chemical structure image, wherein the chemical structure image contains at least one complete chemical structural formula;

The chemical structure image is converted to its corresponding chemical text by using a pre-trained conversion model, wherein the conversion model performs a single conversion on the complete chemical structural formula in the chemical structure diagram.