CN116187327A - Information processing method and system for chemical entity, computer system and storage medium - Google Patents

Information processing method and system for chemical entity, computer system and storage medium Download PDF

Info

Publication number
CN116187327A
CN116187327A CN202111455595.6A CN202111455595A CN116187327A CN 116187327 A CN116187327 A CN 116187327A CN 202111455595 A CN202111455595 A CN 202111455595A CN 116187327 A CN116187327 A CN 116187327A
Authority
CN
China
Prior art keywords
chemical
document
chemical entity
information
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111455595.6A
Other languages
Chinese (zh)
Inventor
张声德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qianyan Intelligent Biotechnology Co ltd
Original Assignee
Nanjing Suikun Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Suikun Intelligent Technology Co ltd filed Critical Nanjing Suikun Intelligent Technology Co ltd
Publication of CN116187327A publication Critical patent/CN116187327A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Automatic Analysis And Handling Materials Therefor (AREA)

Abstract

The application provides an information processing method of a chemical entity, an information processing system of the chemical entity, a computer system and a computer readable storage medium, wherein the information processing method is used for determining an object to be identified and identifying the object to be identified by detecting the line text content and/or the table content of the chemical entity in an acquired document so as to obtain the system naming and the corresponding number of the chemical entity, then converting the system naming of the chemical entity into chemical structure information in a preset data format, correlating the chemical structure information with the number, and then storing and/or outputting the chemical structure information, so that the extraction rate and the accuracy rate of a final effective result are remarkably improved, and no redundant information exists, each chemical entity simultaneously extracts the number information in the document, and can obtain accurate and comprehensive standard structured data after arrangement so as to facilitate the work such as drug discovery, research and development.

Description

Information processing method and system for chemical entity, computer system and storage medium
Technical Field
The present application relates to the field of chemical technology, and in particular, to an information processing method of a chemical entity, an information processing system of a chemical entity, a computer system, and a computer readable storage medium.
Background
In drug development, it is necessary to follow up the latest development progress of a certain target point in the industry, for example, the latest published patent literature or journal papers, so as to perform molecular design and optimization to ensure development efficiency and investment. In order to better interpret and analyze the patent literature, the molecular structure and the corresponding experimental data thereof need to be extracted and arranged into structured information. Manual extraction can naturally sort out high quality structured data from the patent, but the cost and effort required is significant, and is greatly limited in practice by the fact that only some commercial data providers have sufficient resources to support this task, such as Elsevier Reaxys.
Currently, the existing tools for patent text mining include OSCAR4, chemical tag, chem Spot, OCMiner, chem Data Extractor, etc., and the process of extracting information about related compounds in the patent by these tools can be mainly divided into two parts: firstly, identifying chemical named entity (Named Entity Recognition, NER for short) in a text, identifying which fields belong to a chemical entity in the text, extracting all the entities, and acquiring chemical structures corresponding to the entities by using a conversion tool, wherein some text mining tools only comprise the NER part; and secondly, extracting relations among named entities, and identifying related information among the identified chemical entities or other types of entities in the chemical entities through a statistical or rule method, such as reference digestion of a plurality of chemical entities, association between the chemical entities and diseases, proteins and genes, or physical and chemical properties corresponding to the chemical entities, biological experiment marking data and the like. In the prior art, existing methods for making NER by text-mining tools can be divided into three categories: dictionary-based methods, grammar-based methods, and context-based methods.
Dictionary-based methods in which how to design a high quality and comprehensive dictionary is a very critical factor is looking up named entities in the text by comparing the text with a dictionary or directory of known names, the biggest limitation of which is the limited coverage of the dictionary and the efficiency will drop exponentially when the dictionary size is large to some extent. Thus, chemical entities that appear in a non-systematic naming form are often identified by this method.
While for system naming (IUPAC) it is not possible to want to build a dictionary that is exhaustive of all cases, so grammar-based methods are typically used. The system nomenclature is essentially a grammar using a finite set of terminal symbols that generally correspond to chemical name segments (e.g., "methyl/methyl") that are most likely to appear in chemical entities more frequently than in plain text, and are characterized more typically, so that a dictionary of basic name segments can be generalized from a large number of chemical names, the text can be segmented and parts of speech annotated using the dictionary, and the complete named entity can be identified in combination with some grammatical rules. Instead of constructing a basic name fragment dictionary, a sliding window of n-grams (which refers to a sequence of n continuous characters in a string of texts, for example, "methyl" has 3 4-grams: "methyl", "ethyl", "xyl") is used to count the condition frequency of n character sequences in the chemical name and non-chemical name texts or the conversion probability (Markov model) of characters to characters, and in the text mining process, the text is divided into fragments of n character sequences, probability prediction is performed on the fragments, so that the part-of-speech labeling combination with the highest probability is obtained, and then the complete named entity is identified.
In the grammar-based method, named entities are divided into different fragments, the boundaries of the named entities are determined through additional rules or programs, but NER based on context information is not limited, and the method trains a machine learning model by utilizing pre-marked partial text data, and the common marking scheme is to mark each fragment in the text as three types of (B) marking, (I) nner and (O) utside, wherein (B) is used for the first mark of the beginning of a chemical entity, (I) is used for any other mark in the middle of the chemical entity, and (O) is used for all other marks which do not belong to the chemical entity. Using this scheme, the boundaries of chemical entities can be easily determined, and machine learning models (e.g., conditional random fields, CRFs) can learn potential patterns from the data labeled with this scheme, thereby predicting the labeling category to which each fragment belongs in the new text, and identifying the correct chemical entity.
The input of existing tools is mostly HTML, XML, word, text PDF, namely a document with characters which can be selected, can not support the direct input of picture PDF, but in the Fast following new medicine research and development scene, the required documents, such as patent documents, generally only have picture PDF, even if some patent databases have OCR processing on the original picture PDF, the obtained text documents have information loss and errors to different degrees due to the limitation of OCR technology, especially the tables in the patent, typesetting and structure are very likely to be completely lost after OCR recognition, and if only documents after the full text OCR are used for extraction, the extraction effect of the existing tools is necessarily affected by OCR recognition errors. Most tools do not have a correction module for OCR recognition results and remain in simple error correction, if any, thus causing erroneous identification and extraction of chemical entity information and not as useful data for drug development.
Disclosure of Invention
In view of the above-described drawbacks of the related art, an object of the present application is to provide an information processing method of a chemical entity, an information processing system of a chemical entity, a computer system, and a computer-readable storage medium, which solve the problem of high error rate occurring when identifying and extracting chemical entity information from documents such as patents in the related art.
To achieve the above and other related objects, a first aspect of the present application is to disclose an information processing method of a chemical entity, comprising the steps of: detecting the line text content and/or the table content of the chemical entity in the acquired document to determine an object to be identified; the document comprises a text document or/and a picture document; identifying chemical entities and numbered entities in the object to be identified to obtain system naming and corresponding numbers of the chemical entities; and converting the system name of the chemical entity into chemical structure information in a preset data format, and storing and/or outputting the chemical structure information after being associated with the number.
A second aspect of the present application is to disclose an information processing system of a chemical entity, comprising: the detection module is used for detecting the line text content and/or the table content of the chemical entity in the acquired document to determine the object to be identified; the document comprises a text document or/and a picture document; the identification module is used for identifying the chemical entities and the numbering entities in the object to be identified so as to obtain the system naming and the corresponding numbering of the chemical entities; and the conversion module is used for converting the system name of the chemical entity into chemical structure information in a preset data format, and storing and/or outputting the chemical structure information after being associated with the serial number.
A third aspect of the present application is to disclose a computer system comprising: at least one memory for storing at least one program; and the at least one processor is connected with the at least one memory and is used for realizing the information processing method of the chemical entity according to the first aspect when the at least one program is called and executed from the at least one memory.
A fourth aspect of the present application is to disclose a computer readable storage medium comprising a stored computer program, wherein the computer program, when run by a processor of a computer, is controlled to execute and implement the method for information processing of chemical entities as described in the first aspect above.
In summary, according to the information processing method, the information processing system, the computer system and the computer readable storage medium of the chemical entity, the line content and/or the table content of the chemical entity in the obtained document are detected to determine the object to be identified and identify the object to be identified to obtain the system naming and the corresponding number of the chemical entity, and then the system naming of the chemical entity is converted into the chemical structure information in the preset data format and is stored and/or output after being associated with the number, so that the extraction rate and the accuracy of the final effective result are remarkably improved, and no redundant information is excessive, each chemical entity simultaneously extracts the number information in the document, and the accurate, comprehensive and standard structured data can be obtained after arrangement, thereby facilitating the work such as drug discovery, research and development.
Drawings
The specific features of the invention related to this application are set forth in the appended claims. The features and advantages of the invention that are related to the present application will be better understood by reference to the exemplary embodiments and the drawings that are described in detail below. The brief description of the drawings is as follows:
FIG. 1 is a flow chart of an information processing method of a chemical entity of the present application in one embodiment.
FIG. 2 is a schematic diagram of the context of a chemical entity in a text document in one embodiment of the present application.
FIG. 3 is a flow chart of text information identifying the content of a line in one embodiment of the present application.
Fig. 4 shows a flow chart of text information identifying the content of a line text in another embodiment of the present application.
FIG. 5 is a diagram showing an example of identifying chemical entities and numbered entities in text messages in one embodiment of the present application.
FIG. 6 shows a schematic diagram of tabular contents of chemical entities in a text document in an embodiment of the present application.
Fig. 7 shows a flow chart of text information identifying table content in another embodiment of the present application.
FIG. 8 is a diagram showing examples of chemical entities and numbers in the table contents of pictures to be identified in one embodiment of the present application.
Fig. 9 is a flowchart of identifying table contents in a picture to be identified in an embodiment of the present application.
FIG. 10 is a flow chart of another embodiment of a method for processing information of a chemical entity of the present application.
FIG. 11 is a schematic diagram of an information processing system according to an embodiment of the present application.
FIG. 12 is a schematic diagram of an information processing system according to another embodiment of the present application.
FIG. 13 is a system block diagram of a computer system of the present application in one embodiment.
Detailed Description
Further advantages and effects of the present application will be readily apparent to those skilled in the art from the present disclosure, by describing the embodiments of the present application with specific examples.
In the following description, reference is made to the accompanying drawings, which describe several embodiments of the present application. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present disclosure. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
Although the terms first, second, etc. may be used herein to describe various elements, information or parameters in some examples, these elements or parameters should not be limited by these terms. These terms are only used to distinguish one element or parameter from another element or parameter. For example, a first image classifier may be referred to as a second image classifier, and similarly, a second image classifier may be referred to as a first image classifier, without departing from the scope of the various described embodiments. The first image classifier and the second image classifier are both described as one image classifier, but they are not the same image classifier unless the context clearly indicates otherwise.
Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, steps, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, steps, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions, steps or operations are in some way inherently mutually exclusive.
The innovative drug is developed into two modes of a First in Class mode and a Fast following mode. The Fast food is a imitative innovation, and refers to a new drug with the same or similar action mechanism and new therapeutic effect by carrying out molecular structural transformation or modification on the new drug on the basis of the existing target point and mechanism under the condition of not invading other patents. To date, innovative pharmaceutical companies in the industry have been most in the "Fast food" mode. "Fast Follow" requires the recent development and progress of a co-worker or organization at a target, particularly the issued patents, to perform molecular design and optimization. In order to better read and analyze the patent, the molecular structure and the corresponding experimental data thereof need to be extracted and arranged into structured information.
In the Fast following new drug development scenario, the most important chemical entities in the patent literature to be treated are the examples therein and the corresponding biochemical experimental data thereof, and in the examples of the patent, the chemical structures or IUPAC are generally represented in two dimensions. The existing tools need to identify various chemical entities, so that the universality of the tools is improved, and the processing capacity of specific situations is necessarily reduced, namely the tools are in two aspects: (1) the result of the identification contains a plurality of useless redundant items, which is not beneficial to the later use; (2) identification of a chemical entity that is truly valid (such as IUPAC or formula of the example) is incomplete or erroneous.
Most existing tools only extract chemical entities in the text, and the number reference information in the text is lost, and the number reference information is a bridge connecting the chemical entities and biochemical experimental data of the chemical entities, so that complete data comprising chemical entity structures and corresponding experimental records are difficult to automatically extract from the patent without the number information.
In view of this, the present application provides an information processing method of chemical entities for accurately extracting chemical entity information such as IUPAC in documents such as patents, papers, etc. as useful structured data in drug discovery or development work. It will be appreciated that in practice, a chemical entity may appear in the literature in a number of different forms, including for example systematic naming of the compounds (International Union of Pure and Applied Chemistry, abbreviated IUPAC), molecular formula of the compounds (e.g. C2H 6), registration ID of the compounds in a large database (e.g. CAS number), technical terms (e.g. aspirin), drug common name (e.g. acetaminophen), drug trade name (e.g. blackish) etc. For ease of understanding, in the following examples, compound names named according to IUPAC naming convention are described for the sake of brevity, but not limited thereto, and are described herein.
In an embodiment, the information processing method of the chemical entity is mainly performed by an information processing system of the chemical entity or an information processing device of the chemical entity, wherein the information processing system is software and hardware configured in a computer system. The computer system is an electronic device capable of performing data calculation and logic processing, and examples thereof are: at least one or a combination of a personal terminal device, a server, or a cloud architecture based server system. Taking a computer system as a personal terminal device as an example, the information processing system retrieves a program stored in the personal terminal device according to user operation, and is controlled by each instruction in the computer program, and each hardware device in the personal terminal device cooperatively executes to accurately identify one or more chemical structure information in the chemical entity information. Taking a computer system as a server system as an example, the information processing system performs cooperative execution under the control of each instruction in the running program according to the calculation task configured on at least one server, so as to accurately identify and extract one or more chemical structure information in the chemical entity information. The above examples systematically solve the bottleneck problem of a plurality of key steps in the recognition and extraction process of the chemical entity information by text recognition and image recognition, and all the chemical structure information obtained by the process can be restored to the chemical entity information, for example, the molecular structural formula or IUPAC naming of the chemical entity.
In an embodiment, the hardware of the computer system includes at least: at least one memory, and at least one processor. Wherein the at least one memory is used for storing at least one program. In some examples, the at least one memory may also include memory remote from the one or more processors, such as network-attached memory accessed via RF circuitry or external ports and a communication network, which may be the internet, one or more intranets, a Local Area Network (LAN), a wide area network (WLAN), a Storage Area Network (SAN), etc., or suitable combinations thereof. The memory controller may control access to memory by other components of the device, such as the CPU and peripheral interfaces. The memory optionally includes high-speed random access memory, and optionally also non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state memory devices. Access to memory is optionally controlled by other components of the device, such as the CPU and peripheral interfaces, through a memory controller. The Memory may also include Volatile Memory (RAM), such as random access Memory (Random Access Memory); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD).
The at least one processor includes an integrated circuit chip having signal processing capabilities; or comprises a general purpose processor, which may be, for example, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic, a discrete hardware component, a processor (CPU) integrating at least one processor core, or the like. The at least one processor and the at least one memory may be in data communication via one or more communication buses or signal lines, for invoking and executing the at least one program to perform the information processing method.
The computer system further includes a hardware device comprising: at least one of man-machine interaction device, display device, network interface device, etc. The man-machine interaction device is used for being operated by a user (such as a technician) to enable the electric signals generated by operation to be transmitted to at least one processor, so that the corresponding program is called to conduct data processing on the numerical values/instructions/information represented by the received electric signals. Examples of the man-machine interaction device include at least one of the following: a mouse, keyboard, touch screen, etc.
The display device is used for displaying the visual content which is output after being processed by the at least one processor. The visual content includes, for example, text and table content of the identified chemical entity, systematic naming of the chemical entity, corresponding numbering, recognition of error cues, or converted results of chemical structure information in a preset format. The display device is exemplified by a display or the like.
The network interface device is for providing a plurality of network nodes comprising a computer system with data communication with each other. The network interface device includes, for example, a wired network interface such as a fiber optic interface, or a wireless network interface such as a WIFI interface.
The program in the information processing system includes a plurality of software modules, and each software module is executed in time sequence or synchronously by data, instructions, and the like. Each software module and its execution will be described in detail later.
Referring to fig. 1, a flowchart of an information processing method of a chemical entity according to an embodiment of the present application is shown, and the information processing method of the chemical entity includes the following steps:
step S10, detecting the line text content and/or the table content of the chemical entity in the acquired document to determine an object to be identified; in an embodiment, an information processing system of a computer system detects the context content and/or form content of a chemical entity in an acquired document to determine an object to be identified; the document comprises a text document or/and a picture document; in an embodiment, the document is a pharmaceutical product description document, a pharmaceutical paper document, or a pharmaceutical patent document, a clinical trial document, an audit document, or a clinical study document, etc. from different data sources. The document format includes HTML format, XML format, TXT format, word format, PDF format, or the like.
It should be understood that the text document refers to a document in which a computer system directly reads text or character information in a line, for example, in HTML format, XML format, TXT format, word format, or editable PDF format, etc.; the picture document refers to a file in an image format, the format of a computer stored picture/image is common, and the common stored formats include a bmp format, a jpg format and a non-editable pdf format; png format, tif format, gif format, pcx format, tga format, exif format, fpx format, svg format, psd format, cdr format, pcd format, dxf format, ufo format, eps format, ai format, raw format, WMF format, webp format, avif format, apng format, and the like.
In an embodiment, the data source comprises a paper document or an electronic document, such as: a pharmaceutical product description document, a pharmaceutical paper document, or a pharmaceutical patent document, a clinical trial document, an audit document (e.g., a pharmaceutical audit or approval document, etc.), a clinical study document, or a molecular database, etc., for example, a pharmaceutical product description document, a pharmaceutical paper document, or a pharmaceutical patent document or a web interface (e.g., an html format file, etc.), in a format such as Word, pdf, jpg, TXT, etc., or an electronic file, for example, from a molecular database, etc.
Taking the document as an example, the patent document may be a document containing text data such as claims and specifications and graphic data such as a drawing of the specification and a drawing of the abstract. The patent document may be a chinese patent document, for example, a PDF-format document, or a foreign patent document, for example, a us patent document, a european patent document, a japanese patent document, or the like, and if the patent document is a us patent document, the us patent and trademark office also provides an additional image document in TIFF format, and if it is a chemical image, they also provide corresponding chemical structure documents in MOL and CDX format for the XML patent document.
In some embodiments, the document source may also be a journal, paper, or the like. Specifically, for example, chinese journals, magazines, such as journals, magazines or papers recorded in general school newspapers, provincial journals, core journals, and the like; foreign journals such as journals, magazines or papers in SCI (Science Citation Index), science. The journal, journal or paper may contain text information such as abstract, discussion, test, conclusion, and image information such as drawings. In specific implementation, the file sources may be other types of files including text files and image files besides the types of files listed above, according to specific application scenarios. The specific types and contents of the sources of the above documents are not limited in this application.
In an example, where the document is in a pictorial format, the document is obtained, for example, by chemical structure on a captured pharmaceutical product specification. In another example, the screen is obtained from a displayed web interface, such as by a screen capture function of the electronic device. In yet another example, the document is again retrieved from a paper document, such as by scanning the document with a scanner, for example. In yet another example, an image in a molecular database pre-stored in an image database, for example.
In one embodiment, the information processing system of the computer system obtains the document in a manner of receiving local uploading or capturing from a network by using a crawler tool, for example, the computer system obtains the document in a manner of uploading a file by a user, obtains information in the document in a manner of editing (for example, using a file editor) input by the user, obtains the document containing the chemical entity information by using a web crawler technology, and obtains the document containing the chemical entity information by using a NER technology.
In one embodiment, the information processing system of the computer system may further obtain the document containing the chemical entity through a network search, for example, by a user inputting a search request on a search interface displayed on the computer system, and the search engine loaded in the computer system performs a document search on the network to obtain the document containing the chemical entity. In this embodiment, the search element is, for example, a keyword, a search formula, or a document number, for example, patent application number (International Application Number), publication number (International Publication Number), notice number, patent number (Patent No.), or the like.
For the convenience of explanation of the inventive concept and principle of the present application, in the following embodiments, a patent document is temporarily taken as an example of a document obtained by detecting by an information processing system of a computer system, for example, in the following descriptions of step S10 to step S12, a specific illustrative portion is referred to as "temporary application number (International Application Number): PCT/CN2015/071266; patent document publication No. (International Publication Number) WO2015110024 A1; PCT/EP2019/069744; patent document publication No. WO2020020858A1 describes an example; it should be specifically understood that the detailed description of specific examples should not be taken in a limiting sense.
Referring to fig. 2, a schematic diagram of the text content of a chemical entity in a text document according to an embodiment of the present application is shown, and in the embodiment shown in fig. 2, the text content of a chemical entity in a text document in Word format is shown in fig. 2 (a); the text content of the chemical entity in the text document in TXT format is shown in fig. 2 (b). In the embodiment shown in fig. 2 (a), the application number of the information processing system including the chemical entity is: PCT/CN2015/071266, document in Word format of patent document published as WO2015110024A 1; in the embodiment shown in fig. 2 (b), the application number of the information processing system including the chemical entity is: PCT/CN2015/071266, document in TXT format of patent document published as WO2015110024A 1.
Step S11, identifying chemical entities and numbered entities in the object to be identified to obtain the system naming and corresponding numbers of the chemical entities; in an embodiment, an information processing system of a computer system identifies chemical entities and numbered entities in the object to be identified to obtain a system naming and corresponding numbering of the chemical entities. In some documents, the numbering of the chemical entities is numbered in the document with Compound-1, compound-2, compound-3 … … Compass-n, and so forth; alternatively, in another document, the chemical entity is numbered in the document with Example 1,Example 2,Example 3 … … Example n, and so on. It should be understood that the numbering of chemical entities is used in the document to denote marking different chemical entities having a fixed written or line format, such as the "Compound-1" format or the "Compound 1" format, in the same document. It should be understood that the numbering of the chemical entities is not limited to the examples described above, and that in the same document the numbering of the chemical entities is typically in its fixed presentation or written form or format.
In step S10, the line text content in the document is located by text features to determine the object to be identified, and the table content in the document is located by table features to determine the object to be identified. In other words, the contents of the lines and the contents of the tables are extracted in different strategies in step S10.
In an embodiment, the information processing system of the computer system locates the line text content in the document by text feature to determine the object to be identified, and in an embodiment, when the information processing system detects that the acquired document is a text document, the text document is in a format such as an HTML format, an XML format, a TXT format, a Word format, etc., that is, for text document input, the line text content and the table content thereof are extracted according to different strategies. In the following embodiments, a text document in a Word format or a TXT format is temporarily described as an example, as shown in fig. 2.
In an embodiment, the locating the line text content in the document by text features is determining sentences or paragraphs to be identified by identifying character matches of fields. Specifically, in the step of locating the line text content in the document by the information processing system of the computer system through the character feature to determine the object to be identified, the information processing system instantiates the full text into a data structure similar to a linked list, each line is used as an element, then the regular expression of the identification field of the chemical entity is used for carrying out matching on each line, the line where the identification field is located is determined, so as to determine the object to be identified, and the object to be identified at the moment is a sentence or a paragraph.
In step S11 of identifying the chemical entity and the numbering entity in the object to be identified, when the acquired document is detected to be a text document, a preset line text identification model is called to identify the chemical entity and the numbering entity contained in the text information of the line text content, so as to obtain the system name and the numbering of the chemical entity respectively, and the system name and the corresponding numbering of each chemical entity are determined according to the positional relationship of the system name and the numbering of the chemical entity in the text information. In an embodiment, the context recognition model includes a trained deep learning model and/or a regular expression model.
Referring to FIG. 3, which is a flow chart illustrating the text information identifying the context in an embodiment of the present application, as shown, in an embodiment of identifying chemical entities and numbered entities contained in the text information of the context using a context identification model including a regular expression model, an information processing system performs the operations of:
step 201, full text instantiation processing; in this embodiment, the information handling system instantiates the text document in its entirety into a linked list-like data structure with each line as an element.
Step 202, matching each row by using an identification field regular expression of the chemical entity; in this embodiment, the information processing system uses the regular expression of the identification field of the chemical entity to match each line, determines the line where the identification field is located, and extracts the number information that may exist from the front and rear characters of the identification field.
Step 203, performing regular matching of system naming fields of the chemical entities on the front and rear rows of the identification fields; in this embodiment, the information processing system makes a regular match of the system naming (IUPAC) field of the chemical entity before and after the identification field, and determines whether the line belongs to the system naming (IUPAC) line of the chemical entity corresponding to the identification field in step 202.
Step 204, determining whether the system naming of the chemical entity is finished by regular matching of the termination identifier or the line feed identifier; in this embodiment, the information handling system determines whether the system naming of the chemical entity is complete by a canonical match of the system naming termination identifier or wrap identifier of the chemical entity; if the content is finished, the content is truncated and the rest content is discarded; if a line feed identifier is identified and the beginning or end of an adjacent line meets the splice condition (again by regular matching), then the adjacent line is spliced with the current line until a termination identifier is matched.
Step 205, performing regular matching of the number fields; in this embodiment, the information processing system performs regular matching of the number fields in the system naming character segments of the chemical entities obtained in step 204, so as to extract the number information that may exist, and separate the system naming of the complete chemical entities that do not carry other redundant information.
In this embodiment, the information processing system performs the processing from step 203 to step 205 on each identification field identified in step 202, to obtain the system names and the serial numbers of all chemical entities in the whole text. Taking the context shown in (a) or (b) of fig. 2 (e.g. the shaded section of the figure) as an example, that is, by the processing from step 201 to step 205, the system naming that the chemical entity is IUPAC can be obtained by identifying the context in (a) or (b) of fig. 2: "5-chloro-N- (((3S, 3 aS) -1-0xo0-8- (2-oxotridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroo-1H-benzol [ b ] oxazolo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide"; and IUPAC numbering of the chemical entity is: "sample 23", this number "sample 23" is also regarded in this document as the fixed number of the chemical entity.
Referring to fig. 4, which is a flowchart illustrating the text information for identifying the text content in another embodiment of the present application, as shown in the figure, in an embodiment of identifying the chemical entities and the number entities contained in the text information of the text content by using a text identification model including a deep learning model, the information processing system performs the following steps:
step 401, sentence processing is carried out on the whole text; in an embodiment, the information processing system performs sentence processing on the acquired text document.
Step 402, each sentence processed by the clause is sent into an NER (Named Entity Recognition, chinese is chemical named entity identification, NER for short) model; in an embodiment, the information processing system processes each sentence by invoking a pre-stored NER model locally or in the network.
Step 403, the ner model classifies each token of the sentence, and in an embodiment, the information processing system classifies the token in the sentence, for example, using the character "0" to represent the others; the number is denoted by the character "1"; the systematic naming of chemical entities is denoted by the character "2" (here illustrated by the systematic naming of chemical entities as IUPAC). In an embodiment, the token is: tokens represent key words, variable names, punctuation, brackets, etc. and the like.
Referring to FIG. 5, an exemplary diagram of identifying chemical entities and numbered entities in text messages according to one embodiment of the present application is shown, as exemplified: assuming that the sentence a "[00268]Example 23:5-chloro-N- (((3 s,3 as) -1-oxo-8- (2-oxoiridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroxy-1H)" shown in fig. 2 (a) is fed into the NER model of the information processing system, the information processing system classifies each token of the sentence a using the NER model, and thus can obtain the result a':
“0000000011111111110022222222222222222222222222222222222222222222222222222222222222222222222222222”。
further, assuming that a sentence B "-B ] oxazolo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide" is fed into the NER model of the information processing system, the NER model is utilized to classify the result B' of the classification processing for each token in the sentence B:
“2222222222222222222222222222222222222222222222222222222222222222222222222”。
step 403, performing post-processing on the recognition result to obtain a system name (IUPAC) of the chemical entity and corresponding number information thereof, where the system name of the chemical entity is, for Example, IUPAC, namely, sentence a "5-chloro-N- (((3 s,3 as) -1-0xo0-8- (2-oxypridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroxy-1H-benzozo [ b ] 3,4-d ] [1,4] oxazepin-3-yl) methyl-2-carbox mide" corresponding to "2" characters and corresponding number information of the sentence a, and the system name of the chemical entity is, for Example, field "Example 23" corresponding to "1" characters in sentence a, that is, in this Example, pac is "5-chloro-N- (2H) -1, 3,4, 5-tetrahydroxy-1H-benzozo [3, 4] methyl-2-carbox-2-hydroxy ] methyl-2-carbox-2"; the IUPAC number of the chemical entity is: "sample 23".
In step S11 of identifying the chemical entity and the numbering entity in the object to be identified, when the obtained document is detected to be a text document, invoking a preset table identification model to identify the chemical entity and the numbering entity in the text information of the table content, so as to obtain the system name and the numbering of the chemical entity respectively, and determining the system name and the corresponding numbering of each chemical entity according to the table attribute of the table content. In an embodiment, the table identification model comprises a regular expression model. The table attributes include one or more of table title content, table columns, or table rows. Referring to FIG. 6, a schematic diagram of the table contents of chemical entities in a text document according to an embodiment of the present application is shown, and in the embodiment shown in FIG. 6, the table contents of chemical entities in a text document in Word format are shown in FIG. 6 (a); the table contents of chemical entities in the text document in the TXT format are shown in fig. 6 (b). In the embodiment shown in fig. 6, the first type of table content of chemical entities in patent document publication No. WO2020020858A1 is shown in fig. 6 (a) with application No. PCT/EP2019/069744 in Word format; the first type of table content, i.e. a compound table or a system naming table, for chemical entities in publication No. WO2020020858A1, is shown in fig. 6 (b) in the TXT format, application No. PCT/EP 2019/069744.
Referring to FIG. 7, a flowchart of identifying text information of table contents in another embodiment of the present application is shown, as the information processing system performs the following steps in an embodiment of identifying chemical entities and numbered entities contained in the text information of table contents using a table identification model including a regular expression model:
step 601, extracting all table elements in a document, and converting the table elements into table examples in a program; in an embodiment, the information handling system extracts form elements in a text document, such as Word, TXT, or HTML, and converts them into form instances in a program. Taking a table element in an HTML text document as an example, an HTML table is composed of a table element and one or more tr, th or td elements, wherein the tr element defines a table row, the th element defines a table header, and the td element defines a table unit; in addition, more complex HTML tables may also include caption, col, colgroup, thead, tfoot and tbody elements, etc. For example, "Compound No." and "Name" of the Word text document in FIG. 6 (a) are represented as the title contents of the table.
Step 602, performing regular matching of identifiers on the contents of the lattice topics to determine the cells which may be numbered column headings and/or chemical entity column headings; in an embodiment, the information handling system makes a canonical match of identifiers to the table header content to determine the cells of the numbered columns and the cells of the chemical entity (e.g., compound) columns by the headers of the columns. For example, "Compound No." of Word text document in FIG. 6 (a) "is identified as a cell of the numbered column, and" Name "is identified as a cell of the chemical entity (e.g., compound) column.
Step 603, performing regular matching on the table contents to determine a numbered column and a chemical entity column; in an embodiment, the information handling system performs a regular match on all cell contents in the corresponding column of the title cells determined in step 602 to determine whether the column meets the characteristics of a numbered column/chemical entity (e.g., IUPAC) column, thereby determining the correct number columns and chemical entity column.
Step 604, integrating the results; in an embodiment, the information processing system performs the operations of step 602-step 603 on all tables in the document, and integrates the results according to one or more of the table title contents, the table columns, or the table rows in the table, to obtain the system names and the corresponding numbers of the chemical entities in all tables.
Taking the table contents (the shaded section in the table) shown in fig. 6 (a) or (b) as an example, that is, in the embodiment taking the patent document with publication number WO2020020858A1 as an example, by identifying the text information in the table contents in steps 601 to 604 in fig. 7 as described above, the information processing system identifies the IUPAC of one of the chemical entities in the text information in the table contents as "Compound-1", the IUPAC of the chemical entity is: "[ (1S) -2- (7-methyl azofuran-3-yl) -1- [ (2-methyl-vinyl acetate) amino ] ethyl ] carbonic acid", thus, the systematic naming and corresponding number of the chemical entity in the table can be obtained. It should be understood that, in the text document shown in fig. 6 (a) and (b), the table contents of the chemical entities include IUPAC of 8 chemical entities and corresponding numbers thereof, and in actual implementation, the IUPAC of 8 chemical entities and corresponding numbers thereof may be identified by text information in the table contents in steps 601 to 604 in fig. 7, so as to obtain IUPAC of each chemical entity and corresponding numbers thereof, which are not described in detail herein.
In the step S10 of detecting the content of the line text and the table content of the chemical entity in the obtained document, when the obtained document is a picture document, for example, a bmp format, a jpg format, and a pdf format; the method further comprises the steps that the information processing system divides the picture document into a plurality of pictures according to preset segmentation rules, and the pictures are stored as pictures to be identified after image enhancement processing is carried out on each picture.
In this embodiment, when the acquired document is a picture document, the method further includes splitting the picture document into a plurality of pictures according to a page number of the document as a splitting unit, for example, taking the acquired document as a pdf-format patent document as an example, and splitting the whole patent document into a plurality of pictures according to a page number of the pdf-format patent document as a splitting unit by the information processing system, where in a specific implementation process, the page number may be a page number located at a page header position or a page number located at a page header position in the document.
In the present embodiment, the information processing system performs image division processing on the received picture document based on at least one algorithm such as color space similarity, image feature extraction and classification, or image mask. Wherein each of the above algorithms may employ an image segmentation algorithm as generated via machine learning. For example, an image segmentation algorithm set based on the network layer architecture of the fast R-CNN algorithm, or an image segmentation algorithm set based on the network layer architecture of the VGGNet algorithm, or the like.
In order to improve the recognition rate of the line information and the table information in the divided pictures, the information processing system performs image preprocessing on the divided pictures. In practice, the image preprocessing mode includes at least one of the following: and adjusting the size of the segmented picture according to the preset image size, and adjusting the definition of the picture and the like by utilizing an image processing mode. For example, the information processing system performs equal scaling on the text content or the table content according to the preset image size; and/or supplementing a blank background image on the periphery of the line text content or the table content in the picture according to the preset image size. In another example, the information processing system performs noise reduction, sharpening, and other processes on the divided pictures to improve the image definition of the pictures.
In an embodiment, the step of identifying the context content in the picture to be identified includes:
extracting the line text content in the picture to be identified by utilizing OCR to obtain text information; in an embodiment, the information processing system uses an OCR recognition algorithm to recognize the line text content in the picture cut in the page number as a cutting unit, so as to obtain text information in the picture, for example, obtain a TXT text in the picture; in this embodiment, the information processing system identifies each segmented image in the picture document, so as to obtain the full text TXT (i.e., text information) in the picture document.
Invoking a preset line text recognition model to recognize chemical entities and numbered entities contained in text information of the line text content so as to respectively obtain system names and numbers of the chemical entities, and determining the system names and corresponding numbers of the chemical entities according to the position relation of the system names and numbers of the chemical entities in the text information. In an embodiment, the information handling system invokes the context recognition model to include a trained deep learning model and/or a regular expression model.
In the embodiment in which the text recognition model including the regular expression model is used to recognize the chemical entities and the number entities included in the text information of the text content, the operations performed by the information processing system are as described in the embodiments described above with respect to the steps S201 to S205, which are not repeated herein.
In the embodiment in which the text recognition model including the deep learning model is used to recognize the chemical entities and the number entities included in the text information of the text content, the operations performed by the information processing system are as described in the embodiments described in the above for the steps S401 to S403, which are not repeated herein.
Referring to fig. 8, an example diagram of chemical entities and numbers in the table contents of the to-be-identified table in an embodiment of the present application is shown, where (a) and (b) in fig. 8 are each a picture cut from a PDF document to have the to-be-identified table, where (a) in fig. 8 shows IUPAC and numbers in which the table contents include only the chemical entities, and (b) in fig. 8 shows a molecular structure diagram in which the table contents include not only IUPAC and numbers but also the chemical entities. The table contents in the patent document published under number WO2020020858A1, shown in fig. 8 (a) as PCT/EP2019/069744, include only IUPAC and numbering of chemical entities; fig. 8 (b) shows a molecular structure diagram of not only IUPAC and numbering but also chemical entities in the table contents of patent document publication No. WO2020020858A1, PCT/EP 2019/069744.
Referring to fig. 9, a flowchart of identifying table contents in a picture to be identified in an embodiment of the present application is shown, where the step of identifying the table contents in the picture to be identified includes:
step 801, inputting the plurality of pictures into a preset table detection model to detect the table pictures, and then cutting and storing the table pictures as the table pictures to be identified; in an embodiment, the information processing system inputs the plurality of pictures to a preset table detection model, and the table detection model cuts and saves the table pictures as the table pictures to be identified after detecting the table pictures. In this embodiment, the form detection model locates the form image in the picture by using a target detection model/algorithm such as yolo/fast-RCNN, and cuts and saves the form image as the form picture to be identified.
Step 802, detecting the table contents in the table picture to be identified to locate the chemical entity contents and the numbered entity contents therein; in an embodiment, the information processing system performs target detection on the table content in the table picture to be identified, and in this embodiment, the information processing system performs target detection on the chemical entity in the table content in the table picture to be identified and performs target detection on the numbering entity, so as to locate the chemical entity content and the numbering entity content in the table content. In a specific example, the location of each chemical entity content and numbered entity content is located using, for example, a target detection model/algorithm such as yolo/fast-RCNN. In this embodiment, the content of the chemical entity refers to the area of the chemical entity in the form image, and refers to the part of the form image containing the chemical entity; correspondingly, the content of the numbering entity refers to the area of the numbering entity in the form image, and refers to the part of the form image containing the numbering entity.
Step 803, extracting the position relation of the chemical entity content and the numbered entity content in the to-be-identified table picture, and extracting the chemical entity content and the numbered entity content by utilizing OCR to obtain text information; in this embodiment, the information processing system extracts the positional relationship of the chemical entity content and the numbered entity content in the to-be-identified table picture according to the table layout pattern, and identifies the image features of the chemical entity content and the image features of the numbered entity through images, and obtains the positional relationship of each chemical entity content and the corresponding numbered entity according to the image features of the chemical entity content and the coordinate relationship of the image features of the numbered entity in the images.
In this embodiment, the information processing system performs OCR recognition on the chemical entity content and the numbered entity content obtained through the target detection, that is, extracts the chemical entity content and the numbered entity content by using OCR to obtain text information.
Step 804, calling a preset text recognition model to recognize the chemical entities and the numbered entities contained in the text information, so as to obtain the system names and the numbers of the chemical entities respectively, and determining the system names and the corresponding numbers of the chemical entities according to the position relationship. In an embodiment, the information processing system invokes a preset context recognition model (for example, the embodiment described above with respect to step S201 to step S205 or the embodiment described with respect to step S401 to step S403) to recognize the chemical entities and the numbering entities included in the text information, so as to obtain each chemical entity and its corresponding number in all tables.
As shown in fig. 8 (a) or (b), in PCT/EP2019/069744; in the embodiment disclosed in WO2020020858A1, by identifying the text information in the table contents in steps 801 to 804 in fig. 9, the information processing system identifies the text information in the table contents, and the IUPAC number of one chemical entity is "Compound-1", which is obtained by the information processing system: "[ (1S) -2- (7-methylimidazofuran-3-yl) -1- [ (2-methylulfanylacetyl) amino ] ethyl ] boronic acid. It should be understood that, in the table contents of the chemical entities in the picture document shown in fig. 8 (a) or (b), the IUPAC and the corresponding numbers of the chemical entities are included, and in the actual implementation, the IUPAC and the corresponding numbers of the chemical entities may be identified through the processing from step 801 to step 804 in fig. 9, and the information of the table contents is respectively obtained, so that the IUPAC and the corresponding numbers of each chemical entity are not described in detail herein.
And step S12, converting the system name of the chemical entity into chemical structure information in a preset data format, and storing and/or outputting the chemical structure information after being associated with the number. In an embodiment, the information processing system converts IUPAC of the chemical entity into chemical structure information in a preset data format, associates IUPAC of the chemical entity with its number, and then stores the chemical structure information in a storage space of the computer system, or outputs the chemical structure information through a display interface of a display device, so as to allow a user to observe, intercept, copy, and the like. In this embodiment, the data format is a data format that can be saved as text, which is exemplified by SMILES format (Simplified Molecular Input Line Entry Specification; simplified molecular Linear input Specification), or InChi format, MDL Molfile format, or SDF format, or other data formats that facilitate text searching, etc.
The application number shown in fig. 2 is: PCT/CN2015/071266; for example, the patent document with publication number WO2015110024A1, for example, the system of chemical entities is named IUPAC, the fields of which are: "5-chloro-N- (((3S, 3 aS) -1-oxo-8- (2-oxopyridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroo-1H-benzol [ b ] oxazolo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide".
In an embodiment, the information processing method of the present application is converted into the SMILES format by step S12, and is "clc1=cc=c (S1) C (=o) NC [ c@h ]1OC (n2c3=c (OCC [ C@H ] 21) c=c (c=c3) N3C (c=cc=c3) =o".
In another embodiment, the IUPAC is converted to the InChi format by step S12 of the information processing method of the present application, i.e. the information obtaining the InChi format is "inchi=1s/C22H 18ClN3O5S/C23-19-7-6-18 (32-19) 21 (28) 24-12-17-15-8-10-30-16-11-13 (25-9-2-1-3-20 (25) 27) 4-5-14 (16) 26 (15) 22 (29) 31-17/H1-7,9,11,15,17H,8,10,12H2 (H, 24, 28)/t 15-,17-/m0/S1".
In this application, converting the system name of the chemical entity into chemical structure information in a preset data format and associating the chemical structure information with the number refers to integrating the chemical structure information of the chemical entity and the number corresponding to the chemical structure information in a table manner, so that the number is displayed in one column of the table, the chemical structure information of the chemical entity is displayed in another column of the table, and the chemical structure information of each chemical entity and the number corresponding to the chemical structure information are located in the same row of the table. In an embodiment, the table is, for example, an Excel table (xsl format data table) or a CSV table (CSV format data table), etc.
In another embodiment, when the document acquired by the computer system is a picture document, the line content and the table content of the picture document may further include a molecular structure diagram of the chemical entity, for example, as shown in fig. 2 (a) and 8 (b). Thus, the method for processing information of a chemical entity of the present application may further perform recognition of a molecular structure diagram by using an image recognition method to obtain chemical structure information of the chemical entity, and store and/or output the chemical structure information after associating with a corresponding number thereof, that is, perform recognition of the molecular structure diagram by using an image recognition method to obtain a SMILES format corresponding to the molecular structure diagram, or an InChi format, an MDL Molfile format, or an SDF format, etc., preferably, in an embodiment, obtain the SMILES format corresponding to the molecular structure diagram, and in an embodiment, the computer system extracts and recognizes chemical element symbols, charge information, chiral information, chemical bonds in the molecular structure diagram by using one or more image classifiers to obtain accurate chemical structure information; identifying chemical element symbols, charge information, chiral information, chemical bond implementations in the molecular structure map, for example, by one or more of the technical schemes described in chinese patents 202110526390.6, 202110526496.6, and/or 202110526490.9 filed earlier by the applicant; the entire contents of the above-mentioned chinese patent applications 202110526390.6, 202110526496.6, and 202110526490.9 are incorporated herein by reference.
Referring to fig. 10, a flowchart of another embodiment of an information processing method for a chemical entity of the present application is shown, where the information processing method for a chemical entity of the present application further includes a step S111 of performing error correction processing on a system name of the chemical entity by using a preset error correction model, for correcting a system name of a chemical entity with OCR recognition errors, reducing information loss caused by the OCR recognition errors, and improving an extraction rate of the system name of the chemical entity. In this embodiment, the step S111 of performing error correction processing on the system naming of the chemical entity includes:
when failure information of converting the system naming of the chemical entity into chemical structure information in a preset data format is obtained, an error field in the system naming of the chemical entity is positioned; in this embodiment, the information processing system converts the IUPAC of the identified chemical entity into a SMILES format, if the conversion fails, it indicates that an error field exists in the IUPAC of the chemical entity, and locates the error field in the IUPAC of the chemical entity through a pre-stored error correction model. In this embodiment, the error correction model includes a preset field library, where a plurality of error fields and correct fields corresponding to the error fields are stored in the field library.
And correcting the error field according to the error correction model to obtain the correct systematic naming of the chemical entity. In this embodiment, the information processing system corrects the error field in IUPAC into the correct field through the error correction model pair to obtain the correct IUPAC of the chemical entity.
In this embodiment, the step of performing error correction processing on the system name of the chemical entity by the information processing system includes: first, the information handling system inputs the system name of the chemical entity into a field library for matching to locate the error field therein. Then, the information processing system corrects according to the correct field corresponding to the error field to obtain the correct system naming of the chemical entity.
In this embodiment, the error correction model includes a deep learning trained error correction model. In this embodiment, the error correction model includes an error detection model and an error correction model, specifically, the error correction model includes a seq2seq model (Sequence to Sequence model), and the step of obtaining the error detection model through deep learning training includes:
collecting data of the systematic naming of the correct chemical entities of the bulk of the compounds in advance; in an embodiment, the manner of collecting/gathering correct IUPAC data for a large batch of compounds may be by web crawler technology or by NER technology; of course, the system naming data of the correct chemical entity of the mass of the compound can also be obtained by manually sorting and inputting the data into a computer system.
Dividing all IUPAC by using a mode of 1-gram, 2-gram and 3-gram … … n-gram in sequence to obtain an IUPAC character segment library; in an embodiment, the computer system sequentially segments the obtained large quantity of IUPAC by using a mode of 1-gram, 2-gram and 3-gram … … n-gram to obtain an IUPAC character segment library, wherein the n-gram is an algorithm based on a statistical language model and is also called a first-order markov chain; the basic idea is to perform a sliding window operation of size N on the content in the text according to bytes, forming a sequence of byte fragments of length N. Each byte segment is called a gram, the occurrence frequency of all the grams is counted, and the key gram list, namely the vector feature space of the text, is formed by filtering according to a preset threshold value. Each gram in the list is a feature vector dimension.
All the fragments obtained in the steps are converted into pictures, then the pictures are recognized into characters by using different OCR tools, and the result of unsuccessfully restored into the original fragments is used as an error set corresponding to the fragments, so that an error field library is obtained.
And then, randomly constructing error IUPAC training data by utilizing an error set corresponding to the fragment library, training a character-level sequence labeling model, and further obtaining an error detection model, wherein in the process of positioning the error field in the system naming of the chemical entity, the system naming of the chemical entity is input into one error field library of the error detection model for matching so as to position the error field in the error field library.
In an embodiment, the step of deep learning training to obtain the error correction model comprises: training by utilizing the pre-collected correct IUPAC data, randomly masking partial IUPAC characters in the training process, and then restoring; then, training the incorrect IUPAC fragment to the seq2seq model of the correct IUPAC fragment, thereby obtaining the error correction model, and correcting the error field in the identified IUPAC to the correct field by using the error correction model to obtain the IUPAC of the correct chemical entity.
In one example, when the information processing system converts the IUPAC identified as the chemical entity into the SMILES format, the conversion fails, for example, the information processing system obtains an initial IUPAC, with the application number shown in fig. 2: PCT/CN2015/071266; for example, the patent document publication No. WO2015110024A1, the system of chemical entities is named IUPAC:
"5-chloro-N- (((3S, 3 aS) -1-0xo0-8- (2-oxolidin-1 (2H) -yl) -3,3a,4, 5-tetrahydro1H-benzol [ b ] oxazo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide", when the initial IUPAC is converted into SMILES format and not successfully converted, it is found that the characters in the above field "-0xo0-" should be "-oxo-" when the IUPAC is detected as an error via the above-mentioned preset error correction model, so that the correct IUPAC is obtained by correcting it: "5-chloro-N- (((3S, 3 aS) -1-oxo-8- (2-oxopyridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroo-1H-benzol [ b ] oxazolo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide".
The information processing system converts the information into a SMILES format, and then obtains the SMILES information of the IUPAC as follows: "clc1=cc=c (S1) C (=o) NC [ c@h ]1OC (n2c3=c (OCC [ C@H ] 21) c=c (c=c3) N3C (c=cc=c3) =o).
In summary, in the method for processing information of chemical entities, the acquired text document and picture document are detected, the line text content and/or the table content of the chemical entities are determined and then identified, the system names and the corresponding numbers of the chemical entities are obtained, then the system names of the chemical entities are converted into chemical structure information in a preset data format, the chemical structure information is associated with the numbers and then stored and/or output.
The present application also provides an information processing system for a chemical entity, where in embodiments, the information processing system may be centrally located in a terminal computer or distributed between the terminal computer and a server (or server system). Examples of the terminal computer include a mobile phone, a tablet computer, a personal computer, and the like. The server (or server system) includes, for example, a single server, or a server cluster, etc. Referring to FIG. 11, a schematic diagram of an information processing system according to an embodiment of the present application is shown. The information processing system 20 of the chemical entity includes: a detection module 201, an identification module 202, and a conversion module 203.
The detection module 201 is configured to detect the context content and/or the table content of the chemical entity in the acquired document to determine an object to be identified; the document comprises a text document or/and a picture document; in this embodiment, the implementation manner of the detection module 201 is the same as the related embodiment described in the step S10 in the foregoing example, which is not described herein, and the description of the step S10 is all incorporated herein.
The identification module 202 is configured to identify chemical entities and numbered entities in the object to be identified to obtain a system naming and a corresponding number of the chemical entities; in this embodiment, the implementation manner of the identification module 202 is the same as the related embodiment described in step S11 in the foregoing example, which is not described herein, and the description of step S11 is all incorporated herein.
The conversion module 203 is configured to convert the system name of the chemical entity into chemical structure information in a preset data format, and store and/or output the chemical structure information in association with the number. In this embodiment, the implementation of the conversion module 203 is the same as the related embodiment described in step S12 in the previous example, which is not described herein, and the description of step S12 is all incorporated herein.
Referring to FIG. 12, a schematic diagram of an information processing system according to another embodiment of the present application is shown. In this embodiment, the information processing system 20 further includes an error correction module 204 configured to perform error correction processing on the system name of the chemical entity by using a preset error correction model. In this embodiment, the implementation of the error correction module 204 is the same as the related embodiment described in step S111 in the previous example, which is not described herein, and the description of step S111 is all incorporated herein.
Referring to FIG. 13, which is a system block diagram of a computer system according to an embodiment of the present application, as shown in the drawings, the present application provides a computer system 30, where hardware of the computer system 30 includes at least: a display 303, at least one memory 301, and at least one processor 302. Wherein the at least one memory 301 is used for storing at least one program. The processor 302 is connected to the at least one memory 301 and is configured to coordinate execution of the at least one processor 302 when the at least one program is invoked and executed to implement the information processing method of the chemical entity as described above with respect to any of the embodiments of fig. 1-10.
In some examples, the at least one memory may also include memory remote from the one or more processors, such as network-attached memory accessed via RF circuitry or external ports and a communication network, which may be the internet, one or more intranets, a local area network, a wide area network, a storage local area network, etc., or a suitable combination thereof. The memory controller may control access to memory by other components of the device, such as the CPU and peripheral interfaces. The memory optionally includes high-speed random access memory, and optionally also non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state memory devices. Access to memory is optionally controlled by other components of the device, such as the CPU and peripheral interfaces, through a memory controller. The memory may also include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk.
The at least one processor includes an integrated circuit chip having signal processing capabilities; or comprises a general purpose processor, which may be, for example, a digital signal processor, an application specific integrated circuit, a discrete gate or transistor logic, a discrete hardware component, a processor integrated with at least one processor core, or the like. The at least one processor and the at least one memory may be in data communication via one or more communication buses or signal lines for invoking and executing the at least one program to perform the information processing method of the chemical entity.
The computer system further includes a hardware device comprising: at least one of man-machine interaction device, display device, network interface device, etc. The man-machine interaction device is used for being operated by a user (such as a person skilled in the art) to enable the electric signals generated by operation to be transmitted to at least one processor, so that the corresponding program is called to conduct data processing on the numerical values/instructions/information represented by the received electric signals. Examples of the man-machine interaction device include at least one of the following: a mouse, keyboard, touch screen, etc.
The display device is used for displaying the visual content which is output after being processed by the at least one processor. The visual content includes, for example, text and table content of the identified chemical entity, systematic naming of the chemical entity, corresponding numbering, recognition of error cues, or converted results of chemical structure information in a preset format. The display device is exemplified by a display or the like.
The network interface device is for providing a plurality of network nodes comprising a computer system with data communication with each other. The network interface device includes, for example, a wired network interface such as a fiber optic interface, or a wireless network interface such as a WIFI interface.
The application also discloses a computer readable storage medium comprising a stored computer program, wherein the computer program, when being run by a processor of a computer, controls the computer to execute and implement the method for processing information of chemical entities as described above for any of the embodiments of fig. 1-10.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
In the embodiments provided herein, the computer-readable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, U-disk, removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.
In one or more exemplary aspects, the functions described by the image recognition method of the molecular structure diagram, the data entry method of the database, or the computer program of the retrieval method described in the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module, which may be located on a tangible, non-transitory computer-readable and writable storage medium. Tangible, non-transitory computer readable and writable storage media may be any available media that can be accessed by a computer.
The present application also provides a computer program product comprising a computer program which, when run by a processor, executes and implements a method as described above; in embodiments, please refer to the related description of any one of the embodiments of fig. 1 to 10 for the information processing method of the chemical entity.
The flowcharts and block diagrams in the figures described above illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In summary, according to the information processing method, the information processing system, the computer system and the computer readable storage medium of the chemical entity, the line content and/or the table content of the chemical entity in the obtained document are detected to determine the object to be identified and identify the object to be identified to obtain the system naming and the corresponding number of the chemical entity, and then the system naming of the chemical entity is converted into the chemical structure information in the preset data format and is stored and/or output after being associated with the number, so that the extraction rate and the accuracy of the final effective result are remarkably improved, and no redundant information is excessive, each chemical entity simultaneously extracts the number information in the document, and the accurate, comprehensive and standard structured data can be obtained after arrangement, thereby facilitating the work such as drug discovery, research and development.
The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims (22)

1. A method for processing information of a chemical entity, comprising the steps of:
detecting the line text content and/or the table content of the chemical entity in the acquired document to determine an object to be identified; the document comprises a text document or/and a picture document;
identifying chemical entities and numbered entities in the object to be identified to obtain system naming and corresponding numbers of the chemical entities;
and converting the system name of the chemical entity into chemical structure information in a preset data format, and storing and/or outputting the chemical structure information after being associated with the number.
2. The method of claim 1, further comprising obtaining the document in a manner that receives local uploads or is crawled from a network using a crawler tool, wherein the format of the document comprises HTML format, XML format, TXT format, word format, or PDF format.
3. The method of claim 2, wherein the document is a pharmaceutical product description document, a pharmaceutical paper document, or a pharmaceutical patent document, a clinical trial document, an audit document, or a clinical study document.
4. The information processing method of chemical entity according to claim 1, wherein the step of detecting the content of the line text and/or the table of the chemical entity in the acquired document comprises: the method comprises the steps of locating the line text content in the document through text features to determine an object to be identified, and locating the table content in the document through table features to determine the object to be identified.
5. The method according to claim 4, wherein in the step of identifying the chemical entity and the numbering entity in the object to be identified, when the obtained document is detected as a text document, a preset line identification model is called to identify the chemical entity and the numbering entity included in the text information of the line content, so as to obtain the system name and the number of the chemical entity, respectively, and the system name and the corresponding number of each chemical entity are determined according to the positional relationship between the system name and the number of the chemical entity in the text information.
6. The method of claim 5, wherein the context recognition model comprises a trained deep learning model and/or a regular expression model.
7. The method according to claim 4, wherein in the step of identifying the chemical entity and the numbering entity in the object to be identified, when the obtained document is detected as a text document, a preset table identification model is called to identify the chemical entity and the numbering entity in the text information of the table content, so as to obtain the system name and the number of the chemical entity, respectively, and the system name and the corresponding number of each chemical entity are determined according to the table attribute of the table content.
8. The method of claim 7, wherein the tabular recognition model comprises a regular expression model.
9. The method for processing information of chemical entities according to claim 4, wherein in the step of detecting the content of the line text and the content of the table of the chemical entities in the obtained document, the step of dividing the obtained document into a plurality of pictures according to a preset division rule and storing each picture as a picture to be identified after performing image enhancement processing.
10. The method for processing information of chemical entities according to claim 9, wherein when the acquired document is a picture document, further comprising dividing the picture document into a plurality of pictures in units of division according to page numbers of the document.
11. The method for processing information of chemical entities according to claim 9, wherein the step of identifying the context content in the picture to be identified comprises:
extracting the line text content in the picture to be identified by utilizing OCR to obtain text information;
invoking a preset line text recognition model to recognize chemical entities and numbered entities contained in text information of the line text content so as to respectively obtain system names and numbers of the chemical entities, and determining the system names and corresponding numbers of the chemical entities according to the position relation of the system names and numbers of the chemical entities in the text information.
12. The method for processing information of chemical entities according to claim 11, characterized in that the step of identifying the table content in the picture to be identified comprises:
inputting the pictures into a preset form detection model to detect form pictures, and then cutting and storing the form pictures as form pictures to be identified;
detecting table contents in the table picture to be identified to locate chemical entity contents and numbered entity contents therein;
extracting the position relation of the chemical entity content and the numbering entity content in the form picture to be identified, and extracting the chemical entity content and the numbering entity content by utilizing OCR to obtain text information;
invoking a preset line text recognition model to recognize chemical entities and numbered entities contained in the text information so as to respectively obtain the system names and the numbers of the chemical entities, and determining the system names and the corresponding numbers of the chemical entities according to the position relation.
13. The method for processing information of chemical entities according to claim 1, further comprising the step of performing error correction processing on systematic naming of said chemical entities using a preset error correction model.
14. The information processing method of a chemical entity according to claim 13, wherein the step of performing error correction processing on the systematic naming of the chemical entity comprises:
When failure information of converting the system naming of the chemical entity into chemical structure information in a preset data format is obtained, an error field in the system naming of the chemical entity is positioned;
and correcting the error field according to the error correction model to obtain the correct systematic naming of the chemical entity.
15. The method for processing information of chemical entity according to claim 14, wherein the error correction model includes a preset field library, a plurality of error fields and correct fields corresponding to the error fields are stored in the field library, and the step of performing error correction processing on the system naming of the chemical entity includes:
inputting the system name of the chemical entity into a field library for matching so as to locate an error field in the field library;
correcting according to the correct field corresponding to the error field to obtain the correct systematic naming of the chemical entity.
16. The method of claim 14, wherein the error correction model comprises a deep learning trained error correction model.
17. The method for processing information of chemical entities according to claim 1, characterized in that the system name of the chemical entity is a compound name named according to IUPAC naming convention.
18. The method for processing information of chemical entities according to claim 1, wherein the chemical structure information in the preset data format is chemical structure information in SMILES format, inChi format, MDL Molfile format, or SDF format.
19. An information handling system for a chemical entity, comprising:
the detection module is used for detecting the line text content and/or the table content of the chemical entity in the acquired document to determine the object to be identified; the document comprises a text document or/and a picture document;
the identification module is used for identifying the chemical entities and the numbering entities in the object to be identified so as to obtain the system naming and the corresponding numbering of the chemical entities;
and the conversion module is used for converting the system name of the chemical entity into chemical structure information in a preset data format, and storing and/or outputting the chemical structure information after being associated with the serial number.
20. The information handling system of claim 19, further comprising an error correction module configured to perform error correction processing on the system name of the chemical entity using a preset error correction model.
21. A computer system, comprising:
at least one memory for storing at least one program;
At least one processor, coupled to the at least one memory, for implementing the information processing method of the chemical entity of any one of claims 1-18 when the at least one program is called and executed from the at least one memory.
22. A computer readable storage medium, comprising a stored computer program, wherein the computer program, when run by a processor of a computer, controls the computer to perform and implement the method of information processing of a chemical entity as claimed in any one of claims 1-18.
CN202111455595.6A 2021-11-26 2021-12-01 Information processing method and system for chemical entity, computer system and storage medium Pending CN116187327A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111420642 2021-11-26
CN2021114206423 2021-11-26

Publications (1)

Publication Number Publication Date
CN116187327A true CN116187327A (en) 2023-05-30

Family

ID=86446699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111455595.6A Pending CN116187327A (en) 2021-11-26 2021-12-01 Information processing method and system for chemical entity, computer system and storage medium

Country Status (1)

Country Link
CN (1) CN116187327A (en)

Similar Documents

Publication Publication Date Title
US7756871B2 (en) Article extraction
EP1262883B1 (en) Method and system for segmenting and identifying events in images using spoken annotations
Mao et al. Document structure analysis algorithms: a literature survey
US8447588B2 (en) Region-matching transducers for natural language processing
US8266169B2 (en) Complex queries for corpus indexing and search
US7937338B2 (en) System and method for identifying document structure and associated metainformation
EP1745396B1 (en) Document information mining tool
US9367581B2 (en) System and method of quality assessment of a search index
US20100161314A1 (en) Region-Matching Transducers for Text-Characterization
CN1834955A (en) Multilingual translation memory, translation method, and translation program
CN110019641B (en) Medical negative term detection method and system
CN103440252A (en) Method and device for extracting parallel information in Chinese sentence
CN101075251A (en) Method for searching file based on data excavation
CN113806531A (en) Drug relationship classification model construction method, drug relationship classification method and system
US11574287B2 (en) Automatic document classification
US20100185438A1 (en) Method of creating a dictionary
CN1542648A (en) System and method for word analysis
Klampfl et al. Machine learning techniques for automatically extracting contextual information from scientific publications
US11361565B2 (en) Natural language processing (NLP) pipeline for automated attribute extraction
JP2007025834A (en) Method and system for supporting input of image diagnostic reading report
CN116187327A (en) Information processing method and system for chemical entity, computer system and storage medium
CN114003750B (en) Material online method, device, equipment and storage medium
CN113609864B (en) Text semantic recognition processing system and method based on industrial control system
CN116187326A (en) Information processing method and system for chemical entity, computer system and storage medium
Wren A scalable machine-learning approach to recognize chemical names within large text databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20231009

Address after: 102425 Building 60, No. 69 Yanfu Road, Fangshan District, Beijing

Applicant after: Beijing Qianyan Intelligent Biotechnology Co.,Ltd.

Address before: 210033 Room 317-321, Floor 3, F7, No. 9, Weidi Road, Xianlin University City, Xianlin Street, Qixia District, Nanjing, Jiangsu Province

Applicant before: Nanjing Suikun Intelligent Technology Co.,Ltd.