CN116187326A

CN116187326A - Information processing method and system for chemical entity, computer system and storage medium

Info

Publication number: CN116187326A
Application number: CN202111453809.6A
Authority: CN
Inventors: 张声德
Original assignee: Nanjing Suikun Intelligent Technology Co ltd
Current assignee: Beijing Qianyan Intelligent Biotechnology Co ltd
Priority date: 2021-11-26
Filing date: 2021-12-01
Publication date: 2023-05-30

Abstract

The application provides an information processing method of a chemical entity, an information processing system of the chemical entity, a computer system and a computer readable storage medium, wherein the information processing method is implemented by acquiring a document containing the chemical entity; detecting the content of the line text, the first type table and the second type table in the document to determine an object to be identified; identifying chemical entities and numbered entities in the object to be identified to obtain structural information and experimental information of the chemical entities and numbers corresponding to the structural information and the experimental information; and finally, converting the structural information of the chemical entity into chemical structural information in a preset data format, and storing and/or outputting the chemical structural information and experimental information thereof after the chemical structural information and the experimental information are associated by the serial numbers, so that accurate and comprehensive standardized structured data are obtained, and the work of drug discovery, research and development and the like is facilitated.

Description

Information processing method and system for chemical entity, computer system and storage medium

Technical Field

The present application relates to the field of chemical technology, and in particular, to an information processing method of a chemical entity, an information processing system of a chemical entity, a computer system, and a computer readable storage medium.

Background

In drug development, it is necessary to follow up the latest development progress of a certain target point in the industry, for example, the latest published patent literature or journal papers, so as to perform molecular design and optimization to ensure development efficiency and investment. In order to better interpret and analyze the patent literature, the molecular structure and the corresponding experimental data thereof need to be extracted and arranged into structured information. Manual extraction can naturally sort out high quality structured data from the patent, but the cost and effort required is significant, and is greatly limited in practice by the fact that only some commercial data providers have sufficient resources to support this task, such as Elsevier Reaxys.

Currently, the existing tools for patent text mining include OSCAR4, chemical tag, chem Spot, OCMiner, chem Data Extractor, etc., and the process of extracting information about related compounds in the patent by these tools can be mainly divided into two parts: firstly, identifying chemical named entity (Named Entity Recognition, NER for short) in a text, identifying which fields belong to a chemical entity in the text, extracting all the entities, and acquiring chemical structures corresponding to the entities by using a conversion tool, wherein some text mining tools only comprise the NER part; and secondly, extracting relations among named entities, and identifying related information among the identified chemical entities or other types of entities in the chemical entities through a statistical or rule method, such as reference digestion of a plurality of chemical entities, association between the chemical entities and diseases, proteins and genes, or physical and chemical properties corresponding to the chemical entities, biological experiment marking data and the like. In the prior art, existing methods for making NER by text-mining tools can be divided into three categories: dictionary-based methods, grammar-based methods, and context-based methods.

Dictionary-based methods in which how to design a high quality and comprehensive dictionary is a very critical factor is looking up named entities in the text by comparing the text with a dictionary or directory of known names, the biggest limitation of which is the limited coverage of the dictionary and the efficiency will drop exponentially when the dictionary size is large to some extent. Thus, chemical entities that appear in a non-systematic naming form are often identified by this method.

While for system naming (IUPAC) it is not possible to want to build a dictionary that is exhaustive of all cases, so grammar-based methods are typically used. The system nomenclature is essentially a grammar using a finite set of terminal symbols that generally correspond to chemical name segments (e.g., "methyl/methyl") that are most likely to appear in chemical entities more frequently than in plain text, and are characterized more typically, so that a dictionary of basic name segments can be generalized from a large number of chemical names, the text can be segmented and parts of speech annotated using the dictionary, and the complete named entity can be identified in combination with some grammatical rules. Instead of constructing a basic name fragment dictionary, a sliding window of n-grams (which refers to a sequence of n continuous characters in a string of texts, for example, "methyl" has 3 4-grams: "methyl", "ethyl", "xyl") is used to count the condition frequency of n character sequences in the chemical name and non-chemical name texts or the conversion probability (Markov model) of characters to characters, and in the text mining process, the text is divided into fragments of n character sequences, probability prediction is performed on the fragments, so that the part-of-speech labeling combination with the highest probability is obtained, and then the complete named entity is identified.

In the grammar-based method, named entities are divided into different fragments, the boundaries of the named entities are determined through additional rules or programs, but NER based on context information is not limited, and the method trains a machine learning model by utilizing pre-marked partial text data, and the common marking scheme is to mark each fragment in the text as three types of (B) marking, (I) nner and (O) utside, wherein (B) is used for the first mark of the beginning of a chemical entity, (I) is used for any other mark in the middle of the chemical entity, and (O) is used for all other marks which do not belong to the chemical entity. Using this scheme, the boundaries of chemical entities can be easily determined, and machine learning models (e.g., conditional random fields, CRFs) can learn potential patterns from the data labeled with this scheme, thereby predicting the labeling category to which each fragment belongs in the new text, and identifying the correct chemical entity.

The input of existing tools is mostly HTML, XML, word, text PDF, namely a document with characters which can be selected, can not support the direct input of picture PDF, but in the Fast following new medicine research and development scene, the required documents, such as patent documents, generally only have picture PDF, even if some patent databases have OCR processing on the original picture PDF, the obtained text documents have information loss and errors to different degrees due to the limitation of OCR technology, especially the tables in the patent, typesetting and structure are very likely to be completely lost after OCR recognition, and if only documents after the full text OCR are used for extraction, the extraction effect of the existing tools is necessarily affected by OCR recognition errors. Most tools have no correction module aiming at OCR recognition results and stay in simple error correction even if the tools are in existence, so that the information of the identified and extracted chemical entity is wrong and cannot be used as useful data for drug development, and more importantly, in the prior art, the information extracted from the literature by the method is incomplete information or data without integration, and further, the requirement of the available data in drug development cannot be met.

Disclosure of Invention

In view of the above-described drawbacks of the related art, an object of the present application is to provide an information processing method of a chemical entity, an information processing system of a chemical entity, a computer system, and a computer-readable storage medium, which solve the problem that only extracting part of information from a document such as a patent in the prior art cannot meet the demand for available data in drug development.

To achieve the above and other related objects, a first aspect of the present application is to disclose an information processing method of a chemical entity, comprising the steps of: acquiring a document containing a chemical entity; the document comprises a text document or/and a picture document; detecting the line text content in the document, the first type of table content containing the structural information of the chemical entity and the second type of table content containing the experimental information of the chemical entity to determine an object to be identified; the structural information of the chemical entity comprises a molecular structure diagram and/or a system naming; identifying chemical entities and numbered entities in the object to be identified to obtain structural information, experimental information and numbers corresponding to the structural information and the experimental information of the chemical entities; and converting the structural information of the chemical entity into chemical structural information in a preset data format, and storing and/or outputting the chemical structural information and experimental information thereof after the chemical structural information and the experimental information are associated through the serial numbers.

A second aspect of the present application discloses an information processing system for a chemical entity, including an acquisition module for acquiring a document containing the chemical entity; the document comprises a text document or/and a picture document; the detection module is used for detecting the line text content in the document, the first type of table content containing the structural information of the chemical entity and the second type of table content containing the experimental information of the chemical entity so as to determine an object to be identified; the structural information of the chemical entity comprises a molecular structure diagram and/or a system naming; the identification module is used for identifying the chemical entity and the numbering entity in the object to be identified so as to obtain the structural information, the experimental information and the numbering corresponding to the structural information and the experimental information of the chemical entity; and the conversion module is used for converting the structural information of the chemical entity into chemical structural information in a preset data format.

A third aspect of the present application is to disclose a computer system comprising: at least one memory for storing at least one program; and the at least one processor is connected with the at least one memory and is used for realizing the information processing method of the chemical entity according to the first aspect when the at least one program is called and executed from the at least one memory.

A fourth aspect of the present application is to disclose a computer readable storage medium comprising a stored computer program, wherein the computer program, when run by a processor of a computer, is controlled to execute and implement the method for information processing of chemical entities as described in the first aspect above.

In summary, according to the information processing method of a chemical entity, the information processing system, the computer system and the computer readable storage medium of the chemical entity detect the obtained text document and the obtained picture document, determine the line text content and/or the table content of the chemical entity, identify the text document and the picture document, obtain the system naming, the experimental information and the corresponding number of the chemical entity, convert the system naming of the chemical entity into the chemical structure information in the preset data format, correlate the chemical structure information and the experimental information and the number, store and/or output the chemical structure information and the experimental information, and each chemical entity simultaneously extracts the number information, the chemical structure information and the experimental information in the literature, so that the accurate, comprehensive and standard structured data can be obtained after arrangement, thereby facilitating the work such as drug discovery, research and development.

Drawings

The specific features of the invention related to this application are set forth in the appended claims. The features and advantages of the invention that are related to the present application will be better understood by reference to the exemplary embodiments and the drawings that are described in detail below. The brief description of the drawings is as follows:

FIG. 1 is a flow chart of an information processing method of a chemical entity of the present application in one embodiment.

FIG. 2 shows a schematic diagram of the context of a chemical entity in a text document according to an embodiment of the present application.

FIG. 3 is a flow chart of text information identifying the content of a line in one embodiment of the present application.

Fig. 4 shows a flow chart of text information identifying the content of a line text in another embodiment of the present application.

FIG. 5 is a diagram showing an example of identifying chemical entities and numbered entities in text messages in one embodiment of the present application.

FIG. 6 is a schematic diagram of a first type of tabular content of chemical entities in a text document according to one embodiment of the present application.

FIG. 7 is a diagram of the contents of a second type of table of chemical entities in a text document according to one embodiment of the present application.

Fig. 8 shows a flow chart of text information identifying table content in another embodiment of the present application.

FIG. 9 is a diagram showing examples of chemical entities and numbers in the first and second types of table contents of pictures to be identified in one embodiment of the present application.

FIG. 10 is a flow chart of identifying table content in a picture to be identified in one embodiment of the present application.

FIG. 11 is a diagram showing examples of chemical entities and numbers in the second type of table content of a picture to be identified in another embodiment of the present application.

FIG. 12 is a schematic diagram showing the integration of chemical structure information, experimental information and numbers of chemical entities into a table in one embodiment of the present application.

Fig. 13 is a schematic diagram showing the integration of chemical structure information, experimental information and numbers thereof into a table in another embodiment of the present application.

FIG. 14 is a schematic diagram of an information processing system according to an embodiment of the present application.

FIG. 15 is a system block diagram of a computer system of the present application in one embodiment.

Detailed Description

Further advantages and effects of the present application will be readily apparent to those skilled in the art from the present disclosure, by describing the embodiments of the present application with specific examples.

In the following description, reference is made to the accompanying drawings, which describe several embodiments of the present application. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present disclosure. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Although the terms first, second, etc. may be used herein to describe various elements, information or parameters in some examples, these elements or parameters should not be limited by these terms. These terms are only used to distinguish one element or parameter from another element or parameter. For example, a first table of contents may be referred to as a second table of contents, and similarly, a second table of contents may be referred to as a first table of contents, without departing from the scope of the various described embodiments. The first and second table contents are both described as one table content, but they are not the same table content unless the context clearly indicates otherwise.

Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, steps, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, steps, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions, steps or operations are in some way inherently mutually exclusive.

The innovative drug is developed into two modes of a First in Class mode and a Fast following mode. The Fast food is a imitative innovation, and refers to a new drug with the same or similar action mechanism and new therapeutic effect by carrying out molecular structural transformation or modification on the new drug on the basis of the existing target point and mechanism under the condition of not invading other patents. To date, innovative pharmaceutical companies in the industry have been most in the "Fast food" mode. "Fast Follow" requires the recent development and progress of a co-worker or organization at a target, particularly the issued patents, to perform molecular design and optimization. In order to better read and analyze the patent, the molecular structure and the corresponding experimental data thereof need to be extracted and arranged into structured information.

In the Fast following new drug development scenario, the most important chemical entities in the patent literature to be treated are the examples therein and the corresponding biochemical experimental data thereof, and in the examples of the patent, the chemical structures or IUPAC are generally represented in two dimensions. The existing tools need to identify various chemical entities, so that the universality of the tools is improved, and the processing capacity of specific situations is necessarily reduced, namely the tools are in two aspects: (1) the result of the identification contains a plurality of useless redundant items, which is not beneficial to the later use; (2) identification of a chemical entity that is truly valid (such as IUPAC or formula of the example) is incomplete or erroneous.

Most existing tools only extract chemical entities in the text, and the number in the text refers to information is lost, and the number refers to information which is a bridge connecting the chemical entities with biochemical experimental data of the chemical entities, and complete data comprising structures of the chemical entities and corresponding experimental records are difficult to automatically extract from the patent without the number information.

In view of this, the present application provides an information processing method of chemical entities for accurately extracting chemical entity information such as IUPAC in documents such as patents, papers, etc. as useful structured data in drug discovery or development work. It will be appreciated that in practice, a chemical entity may appear in the literature in a number of different forms, including for example systematic naming of the compounds (International Union of Pure and Applied Chemistry, abbreviated IUPAC), molecular formula of the compounds (e.g. C2H 6), registration ID of the compounds in a large database (e.g. CAS number), technical terms (e.g. aspirin), drug common name (e.g. acetaminophen), drug trade name (e.g. blackish) etc. For ease of understanding, in the following examples, compound names named according to IUPAC naming convention are described for the sake of brevity, but not limited thereto, and are described herein.

In an embodiment, the information processing method of the chemical entity is mainly performed by an information processing system of the chemical entity or an information processing device of the chemical entity, wherein the information processing system is software and hardware configured in a computer system. The computer system is an electronic device capable of performing data calculation and logic processing, and examples thereof are: at least one or a combination of a personal terminal device, a server, or a cloud architecture based server system. Taking a computer system as a personal terminal device as an example, the information processing system retrieves a program stored in the personal terminal device according to user operation, and is controlled by each instruction in the computer program, and each hardware device in the personal terminal device cooperatively executes to accurately identify one or more chemical structure information in the chemical entity information. Taking a computer system as a server system as an example, the information processing system performs cooperative execution under the control of each instruction in the running program according to the calculation task configured on at least one server, so as to accurately identify and extract one or more chemical structure information in the chemical entity information. The above examples systematically solve the bottleneck problem of a plurality of key steps in the recognition and extraction process of the chemical entity information by text recognition and image recognition, and all the chemical structure information obtained by the process can be restored to the chemical entity information, for example, the molecular structural formula or IUPAC naming of the chemical entity.

In an embodiment, the hardware of the computer system includes at least: at least one memory, and at least one processor. Wherein the at least one memory is used for storing at least one program. In some examples, the at least one memory may also include memory remote from the one or more processors, such as network-attached memory accessed via RF circuitry or external ports and a communication network, which may be the internet, one or more intranets, a Local Area Network (LAN), a wide area network (WLAN), a Storage Area Network (SAN), etc., or suitable combinations thereof. The memory controller may control access to memory by other components of the device, such as the CPU and peripheral interfaces. The memory optionally includes high-speed random access memory, and optionally also non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state memory devices. Access to memory is optionally controlled by other components of the device, such as the CPU and peripheral interfaces, through a memory controller. The Memory may also include Volatile Memory (RAM), such as random access Memory (Random Access Memory); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD).

The at least one processor includes an integrated circuit chip having signal processing capabilities; or comprises a general purpose processor, which may be, for example, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a discrete gate or transistor logic, a discrete hardware component, a processor (CPU) integrating at least one processor core, or the like. The at least one processor and the at least one memory may be in data communication via one or more communication buses or signal lines, for invoking and executing the at least one program to perform the information processing method.

The computer system further includes a hardware device comprising: at least one of man-machine interaction device, display device, network interface device, etc. The man-machine interaction device is used for being operated by a user (such as a technician) to enable the electric signals generated by operation to be transmitted to at least one processor, so that the corresponding program is called to conduct data processing on the numerical values/instructions/information represented by the received electric signals. Examples of the man-machine interaction device include at least one of the following: a mouse, keyboard, touch screen, etc.

The display device is used for displaying the visual content which is output after being processed by the at least one processor. The visual content includes, for example, text and table content of the identified chemical entity, systematic naming of the chemical entity, corresponding numbering, recognition of error cues, or converted results of chemical structure information in a preset format. The display device is exemplified by a display or the like.

The network interface device is for providing a plurality of network nodes comprising a computer system with data communication with each other. The network interface device includes, for example, a wired network interface such as a fiber optic interface, or a wireless network interface such as a WIFI interface.

The program in the information processing system includes a plurality of software modules, and each software module is executed in time sequence or synchronously by data, instructions, and the like. Each software module and its execution will be described in detail later.

Referring to fig. 1, a flowchart of an information processing method of a chemical entity according to an embodiment of the present application is shown, and the information processing method of the chemical entity includes the following steps:

step S10, acquiring a document containing chemical entities; the document comprises a text document or/and a picture document; in an embodiment, an information handling system of a computer system obtains a document containing a chemical entity; the document comprises a text document or/and a picture document; in an embodiment, the document is a pharmaceutical product description document, a pharmaceutical paper document, or a pharmaceutical patent document, a clinical trial document, an audit document, or a clinical study document, etc. from different data sources. The document format includes HTML format, XML format, TXT format, word format, PDF format, or the like.

It should be understood that the text document refers to a document in which a computer system directly reads text or character information in a line, for example, in HTML format, XML format, TXT format, word format, or editable PDF format, etc.; the picture document refers to a file in an image format, the format of a computer stored picture/image is common, and the common stored formats include a bmp format, a jpg format and a non-editable pdf format; png format, tif format, gif format, pcx format, tga format, exif format, fpx format, svg format, psd format, cdr format, pcd format, dxf format, ufo format, eps format, ai format, raw format, WMF format, webp format, avif format, apng format, and the like.

In an embodiment, the data source comprises a paper document or an electronic document, such as: a pharmaceutical product description document, a pharmaceutical paper document, or a pharmaceutical patent document, a clinical laboratory document, an examination document (e.g., a pharmaceutical examination or approval document, etc.), a clinical research document, or a molecular database, etc., for example, a pharmaceutical product description document, a pharmaceutical paper document, or a pharmaceutical patent document or a web interface (e.g., an html format file, etc.), in a format such as Word, pdf, jpg, TXT, etc., or an electronic file, for example, from a molecular database, etc.

Taking the document as an example, the patent document may be a document containing text data such as claims and specifications and graphic data such as a drawing of the specification and a drawing of the abstract. The patent document may be a chinese patent document, for example, a PDF-format document, or a foreign patent document, for example, a us patent document, a european patent document, a japanese patent document, or the like, and if the patent document is a us patent document, the us patent and trademark office also provides an additional image document in TIFF format, and if it is a chemical image, they also provide corresponding chemical structure documents in MOL and CDX format for the XML patent document.

In some embodiments, the document source may also be a journal, paper, or the like. Specifically, for example, chinese journals, magazines, such as journals, magazines or papers recorded in general school newspapers, provincial journals, core journals, and the like; foreign journals such as journals, magazines or papers in SCI (Science Citation Index), science. The journal, journal or paper may contain text information such as abstract, discussion, experimental, and conclusion, and image information such as drawings. In specific implementation, the file sources may be other types of files including text files and image files besides the types of files listed above, according to specific application scenarios. The specific types and contents of the sources of the above documents are not limited in this application.

In an example, where the document is in a pictorial format, the document is obtained, for example, by chemical structure on a captured pharmaceutical product specification. In another example, the screen is obtained from a displayed web interface, such as by a screen capture function of the electronic device. In yet another example, the document is again retrieved from a paper document, such as by scanning the document with a scanner, for example. In yet another example, an image in a molecular database pre-stored in an image database, for example.

In one embodiment, the information processing system of the computer system obtains the document in a manner of receiving local uploading or capturing from a network by using a crawler tool, for example, the computer system obtains the document in a manner of uploading a file by a user, obtains information in the document in a manner of editing (for example, using a file editor) input by the user, obtains the document containing the chemical entity information by using a web crawler technology, and obtains the document containing the chemical entity information by using a NER technology.

In one embodiment, the information processing system of the computer system may further obtain the document containing the chemical entity through a network search, for example, by a user inputting a search request on a search interface displayed on the computer system, and the search engine loaded in the computer system performs a document search on the network to obtain the document containing the chemical entity. In this embodiment, the search element is, for example, a keyword, a search formula, or a document number, for example, patent application number (International Application Number), publication number (International Publication Number), notice number, patent number (Patent No.), or the like.

For the convenience of explanation of the inventive concept and principle of the present application, in the following embodiments, a patent document is temporarily taken as an example of a document obtained by detecting by an information processing system of a computer system, for example, in the following descriptions of step S10 to step S13, the specific illustrative parts are referred to as the following application numbers: PCT/CN2015/071266; patent document publication No. WO2015110024 A1; PCT/EP2019/069744; patent document publication No. WO2020020858A1 describes an example; it should be specifically understood that the detailed description of specific examples should not be taken in a limiting sense.

Step S11, detecting the line text content in the document, the first type of table content containing the structural information of the chemical entity and the second type of table content containing the experimental information of the chemical entity to determine an object to be identified; the structural information of the chemical entity comprises a molecular structure diagram and/or a system naming; in an embodiment, an information processing system detects a line text content, a first type of table content, and a second type of table content in the document.

In an embodiment, the first type of table content contains structural information of the chemical entity, wherein the structural information includes a molecular structure diagram of the chemical entity, and/or a systematic naming of the chemical entity, and the systematic naming of the chemical entity is named as a compound name named according to IUPAC naming rules, which is also called IUPAC of the compound. In this embodiment, the second type of table may also be referred to as a compound structure table.

The second type of table contains experimental information of the chemical entity, which in this application is an activity value of the compound, and in an embodiment, the first type of table may also be referred to as an activity table of the compound.

In the present application, the molecular structure diagram refers to the structural formula of a molecule, and the topological structure of the molecule is presented in a diagram form. The structural formula (Structural Formula) represents the topological structure of the molecule, and represents the general structural formula or bond line type (carbon-hydrogen bond is generally omitted, carbon-carbon single bond is sometimes omitted, and broken line represents) of the organic compound. The molecular structure diagram is to describe the structure of a molecule by at least comprising a pattern with chemical definition; or the molecular structural diagram is a structure of a molecule described by a chemical element symbol representing an atom (or group) and a figure having a chemical definition. Wherein the pattern having a chemical definition is a chemical element symbol defined in a chemical discipline for representing atoms (or groups) in a molecule, a state in which the atoms (or groups) exist in the molecule, or a bonding relationship between the atoms (or groups), an azimuth relationship, or the like. Wherein the pattern having a chemical definition includes not only a chemical definition expressed by itself, but also a positional relationship with an atom (or group) or a pattern representing the atom (or group) expresses a chemical definition related to the corresponding atom (or group).

As exemplified in step S10 above, in practice, the acquired document may be a text document or a picture document, and thus, in step S11, the information processing system detects the content of the line text, the first type and the second type of table in the document to determine the object to be identified, and the detected document may be a text document or a picture document. The following

steps

12 and 13 will be described in detail.

Step S12, identifying chemical entities and numbered entities in the object to be identified to obtain structural information, experimental information and numbers corresponding to the structural information and the experimental information of the chemical entities. In some documents, the numbering of the chemical entities is numbered in the document with Compound-1, compound-2, compound-3 … … Compass-n, and so forth; alternatively, in another document, the chemical entity is numbered in the document with Example 1,Example 2,Example 3 … … Example n, and so on. It should be understood that the numbering of chemical entities is used in the document to denote marking different chemical entities having a fixed written or line format, such as "Compound-1" format or "Compound 1" format, in the same document, it being understood that the numbering of chemical entities is not limited to the above examples, and that the numbering of chemical entities typically has its fixed expression or written form or format in the same document.

In an embodiment, when the information processing system detects that the obtained document is a text document, the information processing system detects that the text document in the format of HTML format, XML format, TXT format, word format, etc. is a text document input, and the text document input is extracted according to different strategies.

In an embodiment, the locating the line text content in the document by text features is determining sentences or paragraphs to be identified by identifying character matches of fields. Specifically, in the step of locating the line text content in the document by the information processing system of the computer system through the character feature to determine the object to be identified, the information processing system instantiates the full text into a data structure similar to a linked list, each line is used as an element, then the regular expression of the identification field of the chemical entity is used for carrying out matching on each line, the line where the identification field is located is determined, so as to determine the object to be identified, and the object to be identified at the moment is a sentence or a paragraph.

In step S12 of identifying the chemical entity and the numbering entity in the object to be identified, when the acquired document is detected to be a text document, a preset line text identification model is called to identify the chemical entity and the numbering entity contained in the text information of the line text content, so as to obtain the system name and the numbering of the chemical entity respectively, and the system name and the corresponding numbering of each chemical entity are determined according to the positional relationship of the system name and the numbering of the chemical entity in the text information. In an embodiment, the context recognition model includes a trained deep learning model and/or a regular expression model.

Referring to fig. 2, a schematic diagram of the text content of a chemical entity in a text document according to an embodiment of the present application is shown, and in the embodiment shown in fig. 2, the text content of the chemical entity in a text document in Word format is shown in fig. 2 (a); the text content of the chemical entity in the text document in TXT format is shown in fig. 2 (b). In the embodiment shown in fig. 2 (a), the application number of the information processing system including the chemical entity is: PCT/CN2015/071266, document in Word format of patent document published as WO2015110024A 1; in the embodiment shown in fig. 2 (b), the application number of the information processing system including the chemical entity is: PCT/CN2015/071266, document in TXT format of patent document published as WO2015110024A 1.

Referring to FIG. 3, which is a flow chart illustrating the text information identifying the context in an embodiment of the present application, as shown, in an embodiment of identifying chemical entities and numbered entities contained in the text information of the context using a context identification model including a regular expression model, an information processing system performs the operations of:

step 201, full text instantiation processing; in this embodiment, the information handling system instantiates the text document in its entirety into a linked list-like data structure with each line as an element.

Step 202, matching each row by using an identification field regular expression of the chemical entity; in this embodiment, the information processing system uses the regular expression of the identification field of the chemical entity to match each line, determines the line where the identification field is located, and extracts the number information that may exist from the front and rear characters of the identification field.

Step 203, performing regular matching of system naming fields of the chemical entities on the front and rear rows of the identification fields; in this embodiment, the information processing system makes a regular match of the system naming (IUPAC) field of the chemical entity before and after the identification field, and determines whether the line belongs to the system naming (IUPAC) line of the chemical entity corresponding to the identification field in step 202.

Step 204, determining whether the system naming of the chemical entity is finished by regular matching of the termination identifier or the line feed identifier; in this embodiment, the information handling system determines whether the system naming of the chemical entity is complete by a canonical match of the system naming termination identifier or wrap identifier of the chemical entity; if the content is finished, the content is truncated and the rest content is discarded; if a line feed identifier is identified and the beginning or end of an adjacent line meets the splice condition (again by regular matching), then the adjacent line is spliced with the current line until a termination identifier is matched.

Step 205, performing regular matching of the number fields; in this embodiment, the information processing system performs regular matching of the number fields in the system naming character segments of the chemical entities obtained in step 204, so as to extract the number information that may exist, and separate the system naming of the complete chemical entities that do not carry other redundant information.

In this embodiment, the information processing system performs the processing from step 203 to step 205 on each identification field identified in step 202, to obtain the system names and the serial numbers of all chemical entities in the whole text. Taking the context shown in (a) or (b) of fig. 2 (the shaded section of the figure) as an example, that is, by the processing from step 201 to step 205, the system naming that the chemical entity is IUPAC can be obtained by identifying the context in (a) or (b) of fig. 2: "5-chloro-N- (((3S, 3 aS) -1-0xo0-8- (2-oxotridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroo-1H-benzol [ b ] oxazolo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide"; and IUPAC numbering of the chemical entity is: "sample 23".

Referring to fig. 4, which is a flowchart illustrating the text information for identifying the text content in another embodiment of the present application, as shown in the figure, in an embodiment of identifying the chemical entities and the number entities contained in the text information of the text content by using a text identification model including a deep learning model, the information processing system performs the following steps:

step 401, sentence processing is carried out on the whole text; in an embodiment, the information processing system performs sentence processing on the acquired text document.

Step 402, each sentence processed by the clause is sent into an NER (Named Entity Recognition, chinese is chemical named entity identification, NER for short) model; in an embodiment, the information processing system processes each sentence by invoking a pre-stored NER model locally or in the network.

Step 403, the ner model classifies each token of the sentence, and in an embodiment, the information processing system classifies the token in the sentence, for example, using the character "0" to represent the others; the number is denoted by the character "1"; the systematic naming of chemical entities is denoted by the character "2" (here illustrated by the systematic naming of chemical entities as IUPAC). In an embodiment, the token is: tokens represent key words, variable names, punctuation, brackets, etc. and the like.

Referring to FIG. 5, an exemplary diagram of identifying chemical entities and numbered entities in text messages according to one embodiment of the present application is shown, as exemplified: assuming that the sentence a "[00268]Example 23:5-chloro-N- (((3 s,3 as) -1-oxo-8- (2-oxoiridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroxy-1H)" shown in fig. 2 (a) is fed into the NER model of the information processing system, the information processing system classifies each token of the sentence a using the NER model, and thus can obtain the result a':

“0000000011111111110022222222222222222222222222222222222222222222222222222222222222222222222222222”。

further, assuming that a sentence B "-B ] oxazolo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide" is fed into the NER model of the information processing system, the NER model is utilized to classify the result B' of the classification processing for each token in the sentence B:

“2222222222222222222222222222222222222222222222222222222222222222222222222”。

step 403, performing post-processing on the recognition result to obtain a system name (IUPAC) of the chemical entity and corresponding number information thereof, where the system name of the chemical entity is, for Example, IUPAC, namely, sentence a "5-chloro-N- (((3 s,3 as) -1-0xo0-8- (2-oxypridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroxy-1H-benzozo [ b ] 3,4-d ] [1,4] oxazepin-3-yl) methyl-2-carbox mide" corresponding to "2" characters and corresponding number information of the sentence a, and the system name of the chemical entity is, for Example, field "Example 23" corresponding to "1" characters in sentence a, that is, in this Example, pac is "5-chloro-N- (2H) -1, 3,4, 5-tetrahydroxy-1H-benzozo [3, 4] methyl-2-carbox-2-hydroxy ] methyl-2-carbox-2"; the IUPAC number of the chemical entity is: "sample 23".

In another embodiment, when the information processing system detects that the document is a text document, the information processing system locates the first type of table content and the second type of table content in the document through the table features to determine the object to be identified when detecting that the acquired document is a text document. In this embodiment, when the information processing system detects that the acquired document is a text document, such as a text document in a format of HTML format, XML format, TXT format, word format, or the like,

in step S12 of identifying the chemical entity and the numbering entity in the object to be identified, when the obtained document is detected to be a text document, the information processing system invokes a preset table identification model to identify the chemical entity and the numbering entity in the text information of the first type of table content and the second type of table content, so as to obtain the structural information or experimental information of the chemical entity and the number corresponding to the structural information or experimental information respectively, and determines the system naming or experimental information of each chemical entity and the number corresponding to the structural information or experimental information according to the table attribute of the table content. In an embodiment, the table identification model comprises a regular expression model. The table attributes include one or more of table title content, table columns, or table rows. In an embodiment, the information handling system locates a first type of table content and a second type of table content in the document by table features to determine the object to be identified.

Referring to fig. 6 and 7, fig. 6 is a schematic diagram showing a first type of table contents of chemical entities in a text document according to an embodiment of the present application, and fig. 7 is a schematic diagram showing a second type of table contents of chemical entities in a text document according to an embodiment of the present application; in the embodiment shown in fig. 6, the first type of table content of chemical entities in patent document publication No. WO2020020858A1 is shown in fig. 6 (a) with application No. PCT/EP2019/069744 in Word format; the first type of table content, i.e. the table of compounds, for chemical entities in publication number WO2020020858A1, shown in fig. 6 (b) under application number PCT/EP2019/069744 in TXT format; in yet another embodiment as shown in FIG. 7, a second type of table content for chemical entities in publication number WO2015110024A1 is shown in FIG. 7 (a) in PCT/CN2015/071266 in Word format; the second type of table content for chemical entities, i.e., the activity table for the compounds, in the text document in TXT format is shown in fig. 7 (b).

Referring to FIG. 8, a flowchart of identifying text information of table contents in another embodiment of the present application is shown, as in the embodiment of identifying chemical entities and numbered entities contained in text information of table contents using a table identification model including a regular expression model, an information processing system performs the operations of:

Step 601, extracting all table elements in a document, and converting the table elements into table examples in a program; in an embodiment, the information handling system extracts form elements in a text document, such as Word, TXT, or HTML, and converts them into form instances in a program. Taking a table element in an HTML text document as an example, an HTML table is composed of a table element and one or more tr, th or td elements, wherein the tr element defines a table row, the th element defines a table header, and the td element defines a table unit; in addition, more complex HTML tables may also include caption, col, colgroup, thead, tfoot and tbody elements, etc. For example, "Compound No." and "Name" of the Word text document in FIG. 6 (a) are represented as the title contents of the table.

Step 602, performing regular matching of identifiers on the contents of the lattice topics to determine the cells which may be numbered column headings and/or chemical entity column headings; in an embodiment, the information handling system makes a canonical match of identifiers to the table header content to determine the cells of the numbered columns and the cells of the chemical entity (e.g., compound) columns by the headers of the columns. For example, "Compound No." of Word text document in FIG. 6 (a) "is identified as a cell of the numbered column, and" Name "is identified as a cell of the chemical entity (e.g., compound) column.

Step 603, performing regular matching on the table contents to determine a numbered column and a chemical entity column; in an embodiment, the information handling system performs a regular match on all cell contents in the corresponding column of the title cells determined in step 602 to determine whether the column meets the characteristics of a numbered column/chemical entity (e.g., IUPAC) column, thereby determining the correct number columns and chemical entity column.

Step 604, integrating the results; in an embodiment, the information processing system performs the operations of step 602-step 603 on all tables in the document, and integrates the results according to one or more of the table title content, the table column, or the table row in the table, to obtain the chemical entities and their corresponding numbers in all tables.

Taking the first type of table contents shown in fig. 6 as an example, the contents of the compound table displayed in the Word document and the TXT document shown in (a) and (b) of fig. 6. As shown in the example of the table contents (the shaded section of the table), in the embodiment exemplified by the patent document with publication number WO2020020858A1, by identifying the text information in the table contents in the above-mentioned steps 601 to 604 of fig. 8, the information processing system identifies the IUPAC number of one chemical entity in the text information in the table contents as "Compound-1", and the IUPAC of the chemical entity is: "[ (1S) -2- (7-methyl azofuran-3-yl) -1- [ (2-methyl-vinyl acetate) amino ] ethyl ] carbonic acid", thus, the systematic naming and corresponding number of the chemical entity in the table can be obtained. It should be understood that, in the text document shown in fig. 6 (a) and (b), the table contents of the chemical entities include IUPAC of 8 chemical entities and corresponding numbers thereof, and in actual implementation, the IUPAC of 8 chemical entities and corresponding numbers thereof may be identified by text information in the table contents in steps 601 to 604 in fig. 8, so as to obtain IUPAC of each chemical entity and corresponding numbers thereof, which are not described in detail herein.

It should be understood that, when the information processing system detects that the acquired document is a text document, the process of calling the preset table identification model to identify the structure information and the number entity of the chemical entity in the text information of the second type of table content is also to identify the text information in the table content in steps 601 to 604 in fig. 8, so that the information of the compound activity table in the second type of table content, that is, the activity value of the compound and the corresponding number thereof, can be obtained.

With the first one shown in FIG. 7In the Example of the two-type table contents, in the Example of the patent document with publication number WO2015110024A1, the information processing system recognizes that IUPAC of one chemical entity in the text information in the table contents is "Example 23" by recognizing the text information in the table contents in steps 601 to 604 of fig. 8 as described above, and the activity value IC of the compound ₅₀ (nM) is "2.43". As shown in FIG. 7, the activity values of the chemical entities in the table and their corresponding numbers can be obtained. It should be understood that, in the text document shown in fig. 7 (a) and (b), the table contents of the chemical entities include 16 activity values of the chemical entities and corresponding numbers thereof, and in actual implementation, the activity values of the 16 chemical entities and corresponding numbers thereof may be identified by the text information in the text document in steps 601 to 604 in fig. 7, so as to obtain the activity value of each chemical entity and corresponding number thereof, which are not described in detail herein.

In the step S11 of detecting the context content in the document, the first type of table content containing the structural information of the chemical entity, and the second type of table content containing the experimental information of the chemical entity to determine the object to be identified, when the obtained document is a picture document, for example, a bmp format, a jpg format, and pdf format; the method further comprises the steps that the information processing system divides the picture document into a plurality of pictures according to preset segmentation rules, and the pictures are stored as pictures to be identified after image enhancement processing is carried out on each picture.

In this embodiment, when the acquired document is a picture document, the method further includes splitting the picture document into a plurality of pictures according to a page number of the document as a splitting unit, for example, taking the acquired document as a pdf-format patent document as an example, and splitting the whole patent document into a plurality of pictures according to the page number of the pdf-format patent document as a splitting unit by the information processing system, where in a specific implementation process, the page number may be a page number located at a page header position or a page number located at a page header position in the document.

In the present embodiment, the information processing system performs image division processing on the received picture document based on at least one algorithm such as color space similarity, image feature extraction and classification, or image mask. Wherein each of the above algorithms may employ an image segmentation algorithm as generated via machine learning. For example, an image segmentation algorithm set based on the network layer architecture of the fast R-CNN algorithm, or an image segmentation algorithm set based on the network layer architecture of the VGGNet algorithm, or the like.

In order to improve the recognition rate of the line information and the table information in the divided pictures, the information processing system performs image preprocessing on the divided pictures. In practice, the image preprocessing mode includes at least one of the following: and adjusting the size of the segmented picture according to the preset image size, and adjusting the definition of the picture and the like by utilizing an image processing mode. For example, the information processing system performs equal scaling on the text content or the table content according to the preset image size; and/or supplementing a blank background image on the periphery of the line text content or the table content in the picture according to the preset image size. In another example, the information processing system performs noise reduction, sharpening, and other processes on the divided pictures to improve the image definition of the pictures.

In an embodiment, the step of identifying the context content in the picture to be identified includes:

extracting the line text content in the picture to be identified by utilizing OCR to obtain text information; in an embodiment, the information processing system uses an OCR recognition algorithm to recognize the line text content in the picture cut in the page number as a cutting unit, so as to obtain text information in the picture, for example, obtain a TXT text in the picture; in this embodiment, the information processing system identifies each segmented image in the picture document, so as to obtain the full text TXT (i.e., text information) in the picture document.

Invoking a preset line text recognition model to recognize chemical entities and numbered entities contained in text information of the line text content so as to respectively obtain system names and numbers of the chemical entities, and determining the system names or experimental information of each chemical entity and corresponding numbers according to the system names or experimental information of the chemical entities and the positional relationship of the numbers in the text information. In an embodiment, the information handling system invokes the context recognition model to include a trained deep learning model and/or a regular expression model.

In the embodiment in which the text recognition model including the regular expression model is used to recognize the chemical entities and the number entities included in the text information of the text content, the operations performed by the information processing system are as described in the embodiments described above with respect to the steps S201 to S205, which are not repeated herein.

In the embodiment in which the text recognition model including the deep learning model is used to recognize the chemical entities and the number entities included in the text information of the text content, the operations performed by the information processing system are as described in the embodiments described in the above for the steps S401 to S403, which are not repeated herein.

Referring to fig. 9, which is an exemplary diagram of chemical entities and numbers in the first and second types of table contents of the picture to be identified in an embodiment of the present application, fig. 9 (a), (b), and (c) are each a picture cut from a PDF document to have a table to be identified, where the first type of table contents in the patent document with application number PCT/EP2019/069744 and publication number WO2020020858A1 includes only IUPAC and numbers of chemical entities shown in fig. 9 (a); fig. 9 (b) shows a molecular structure diagram of the first type of table content in PCT/EP2019/069744, publication No. WO2020020858A1, including not only IUPAC and numbering but also chemical entities; the second type of table content in the patent document published under the publication number WO2020020858A1, shown in fig. 9 (c) and having application number PCT/EP2019/069744, includes experimental information.

Referring to fig. 10, a flowchart of identifying table contents in a picture to be identified in an embodiment of the present application is shown, where in an embodiment, the step of identifying first and second types of table contents in the picture to be identified includes:

step 801, inputting the plurality of pictures into a preset table detection model to detect the table pictures, and then cutting and storing the table pictures as the table pictures to be identified; in an embodiment, the information processing system inputs the plurality of pictures to a preset table detection model, and the table detection model cuts and saves the table pictures as the table pictures to be identified after detecting the table pictures. In this embodiment, the form detection model locates the form image in the picture by using a target detection model/algorithm such as yolo/fast-RCNN, and cuts and saves the form image as the form picture to be identified.

Step 802, detecting a first type table and a second type table in the table picture to be identified to locate chemical entity contents and numbered entity contents in the table picture; in an embodiment, the information processing system performs target detection on the first type and/or the second type of table contents in the table picture to be identified, and in this embodiment, the information processing system performs target detection on chemical entities in the first type and/or the second type of table contents in the table picture to be identified and performs target detection on numbering entities, so as to locate chemical entity contents and numbering entity contents in the first type and/or the second type of table contents. In a specific example, the location of each chemical entity content and numbered entity content is located using, for example, a target detection model/algorithm such as yolo/fast-RCNN. In this embodiment, the content of the chemical entity refers to the area of the chemical entity in the form image, and refers to the part of the form image containing the chemical entity; correspondingly, the content of the numbering entity refers to the area of the numbering entity in the form image, and refers to the part of the form image containing the numbering entity.

Step 803, extracting the position relation of the chemical entity content and the numbered entity content in the to-be-identified table picture, and extracting the chemical entity content and the numbered entity content by utilizing OCR to obtain text information; in this embodiment, the information processing system extracts the positional relationship of the chemical entity content and the numbered entity content in the to-be-identified table picture according to the table layout pattern, and identifies the image features of the chemical entity content and the image features of the numbered entity through images, and obtains the positional relationship of each chemical entity content and the corresponding numbered entity according to the image features of the chemical entity content and the coordinate relationship of the image features of the numbered entity in the images.

In this embodiment, the information processing system performs OCR recognition on the chemical entity content and the numbered entity content obtained through the target detection, that is, extracts the chemical entity content and the numbered entity content by using OCR to obtain text information.

Step 804, calling a preset text recognition model to recognize the chemical entities and the numbered entities contained in the text information, so as to obtain the system naming or experimental information and the number of the chemical entities respectively, and determining the system naming or experimental information of each chemical entity and the corresponding number thereof according to the position relationship. In an example, the information processing system invokes a preset context recognition model (for example, the embodiment described above with respect to step S201 to step S205 or the embodiment described with respect to step S401 to step S403) to recognize the chemical entities, the experimental information, and the numbering entities included in the text information, so as to obtain each chemical entity, the experimental information, and the corresponding number thereof in all tables.

As shown in fig. 9 (a) or (b), in an embodiment exemplified by the patent document with publication number PCT/EP2019/069744 and publication number WO2020020858A1, by identifying text information in table contents in steps 801 to 804 in fig. 9, the information processing system identifies text information in the table contents, and the IUPAC number of one chemical entity is "Compound-1", which is obtained by: "[ (1S) -2- (7-methylimidazofuran-3-yl) -1- [ (2-methylulfanylacetyl) amino ] ethyl ] boronic acid. It should be understood that, in the table contents of the chemical entities in the picture document displayed in fig. 9 (a) or (b), the IUPAC and the corresponding numbers of the chemical entities are included, and in the actual implementation, the IUPAC and the corresponding numbers of the chemical entities may be identified through the processing from step 801 to step 804 in fig. 10, and the information of the table contents is respectively obtained, so that the IUPAC and the corresponding numbers of each chemical entity are not described in detail herein.

As shown in fig. 9 (c), in the embodiment of the patent document with application number PCT/EP2019/069744 and publication number WO2020020858A1, by identifying text information in table contents in steps 801 to 804 in fig. 10, the information processing system identifies the text information in the table contents, obtains a chemical entity number "Compound-1", obtains experimental information LM7 IC50 (M) of the chemical entity number Compound-1, and it should be understood that the character "represents an experimental information, and in this embodiment, the character" x "represents a range of numerical values as known from the information described in the patent document: "0.5 mu M < IC ₅₀ ≤5μM”。

Taking the patent document with publication number WO2015110024A1 as an Example, please refer to fig. 11, which shows an exemplary diagram of chemical entities and numbers in the second type of table contents of the pictures to be identified in another embodiment of the present application, as shown, the text information in the second type of table contents is identified in steps 801 to 804 in fig. 10, and the information processing system identifies the text information in the table contents to obtain a chemical entity number "sample 23", which is compound experimental information of the number sample 23, i.e. the activity value IC ₅₀ (nM) was "2.43". The activity values of the chemical entities in the table and their corresponding numbers can be obtained as shown in fig. 10. It should be understood that, in the text document shown in fig. 10, the table contents of the chemical entities include the activity values of 16 chemical entities and the corresponding numbers thereof, and in the actual implementation, the activity values of 16 chemical entities and the corresponding numbers thereof may be identified by the text information in the second type of table contents in steps 801 to 804 in fig. 10, so that the activity values of each chemical entity and the corresponding numbers thereof are respectively obtained, which are not described in detail herein.

Step S13, converting the structural information of the chemical entity into chemical structural information in a preset data format, and storing and/or outputting the chemical structural information and experimental information thereof after being associated with the serial numbers; in an embodiment, the information processing system converts IUPAC of the chemical entity into chemical structure information in a preset data format, associates IUPAC of the chemical entity with experimental information through the number, and then stores the chemical structure information and the experimental information in a storage space of the computer system, or outputs the chemical structure information and the experimental information thereof through a display interface of a display device, so as to be used for operations of observation, interception, copying and the like by a user. In this embodiment, the data format is a data format that can be saved as text, which is exemplified by SMILES format (Simplified Molecular Input Line Entry Specification; simplified molecular Linear input Specification), or InChi format, MDL Molfile format, or SDF format, or other data formats that facilitate text searching, etc.

The application number shown in fig. 2 is: PCT/CN2015/071266; for example, the patent document with publication number WO2015110024A1, for example, the system of chemical entities is named IUPAC, the fields of which are: "5-chloro-N- (((3S, 3 aS) -1-oxo-8- (2-oxopyridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroo-1H-benzol [ b ] oxazolo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide".

In an embodiment, the information processing method of the present application is converted into the SMILES format by step S13, and is "clc1=cc=c (S1) C (=o) NC [ c@h ]1OC (n2c3=c (OCC [ C@H ] 21) c=c (c=c3) N3C (c=cc=c3) =o".

In another embodiment, the IUPAC is converted to the InChi format by step S13 of the information processing method of the present application, i.e. the information obtaining the InChi format is "inchi=1s/C22H 18ClN3O5S/C23-19-7-6-18 (32-19) 21 (28) 24-12-17-15-8-10-30-16-11-13 (25-9-2-1-3-20 (25) 27) 4-5-14 (16) 26 (15) 22 (29) 31-17/H1-7,9,11,15,17H,8,10,12H2 (H, 24, 28)/t 15-,17-/m0/S1".

In this application, after the system naming of the chemical entity is converted into the chemical structure information in the preset data format, the chemical structure information of the chemical entity is associated with the experimental information and the serial number thereof, for example, the chemical structure information of the chemical entity is integrated with the serial number corresponding to the chemical structure information and the experimental information corresponding to the chemical entity in a table manner, so that the serial number is displayed in one column of the table, the chemical structure information of the chemical entity is displayed in another column, the experimental information of the chemical entity is displayed in another column, and the chemical structure information, the experimental information and the serial number corresponding to each chemical entity are located in the same row of the table. In an embodiment, the table is, for example, an Excel table (xsl format data table) or a CSV table (CSV format data table), etc.

Please refer to fig. 12, which is a schematic diagram showing the chemical structure information, experimental information and numbers thereof of chemical entities integrated into a table according to an embodiment of the present application, and the application number is as follows: PCT/CN2015/071266, publication No. WO2015110024A1, for Example, in this table, the first column of the table is the chemical entity number "Example 23", and the second column of the table is the chemical structure information of the chemical entity, i.e., SMILES: "clc1=cc=c (S1) C (=o) NC [ c@h ]]1OC(N2C3＝C(OCC[C@H]21 C=c (c=c3) N3C (c=cc=c3) =o ", the third column of the table being experimental information of chemical entities, i.e. activity values: IC (integrated circuit) ₅₀ (nM) was "2.43". Thus, integrated structured data can be obtained, and the structured data is further helpful for drug discovery, research and development and other works.

Under the inspired of the inventive concept of the present application, structured data integrated with various information can be obtained by the method of the present application, please refer to fig. 13, which is a schematic diagram showing the chemical structure information, experimental information and numbers of chemical entities integrated into a table in another embodiment of the present application, and as shown in the figure, the application number is still as follows: PCT/CN2015/071266, publication No. WO2015110024A1, for Example, in this table, the first column of the table is the chemical entity number "Example 23", and the third column of the table is the systematic naming of the chemical entity, IUPAC: "5-chloro-N- (((3S, 3 aS) -1-oxo-8- (2-oxopyridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroo-1H-benzol [ b ] ]oxazolo[3,4-d][1,4]oxazepin-3-yl) methyl) thiophen-2-carboxamide "; the third column of the table is chemical structure information of the chemical entity, i.e., SMILES: "clc1=cc=c (S1) C (=o) NC [ c@h ]]1OC(N2C3＝C(OCC[C@H]21)C＝C(C＝C3)N3C(C＝CC＝C3)＝O) =o ", the fourth column of the table is experimental information of chemical entities, i.e. activity values: IC (integrated circuit) ₅₀ (nM) was "2.43".

In another embodiment, when the document acquired by the computer system is a picture document, the line content and the table content of the picture document may further include a molecular structure diagram of the chemical entity, for example, as shown in fig. 2 (a) and fig. 9 (b). Thus, the method for processing information of a chemical entity of the present application may further perform recognition of a molecular structure diagram by using an image recognition method to obtain chemical structure information of the chemical entity, and store and/or output the chemical structure information after associating with a corresponding number thereof, that is, perform recognition of the molecular structure diagram by using an image recognition method to obtain a SMILES format corresponding to the molecular structure diagram, or an InChi format, an MDL Molfile format, or an SDF format, etc., preferably, in an embodiment, obtain the SMILES format corresponding to the molecular structure diagram, and in an embodiment, the computer system extracts and recognizes chemical element symbols, charge information, chiral information, chemical bonds in the molecular structure diagram by using one or more image classifiers to obtain accurate chemical structure information; identifying chemical element symbols, charge information, chiral information, chemical bond implementations in the molecular structure map, for example, by one or more of the technical schemes described in chinese patents 202110526390.6, 202110526496.6, and/or 202110526490.9 filed earlier by the applicant; the entire contents of the above-mentioned chinese patent applications 202110526390.6, 202110526496.6, and 202110526490.9 are incorporated herein by reference.

In an embodiment, the information processing method of a chemical entity further includes a step of performing error correction processing on a system name of the chemical entity by using a preset error correction model, and the step is used for correcting the system name of the chemical entity with OCR recognition errors, reducing information loss caused by the OCR recognition errors, and improving the extraction rate of the system name of the chemical entity. In this embodiment, the step of performing error correction processing on the systematic naming of the chemical entity includes:

when failure information of converting the system naming of the chemical entity into chemical structure information in a preset data format is obtained, an error field in the system naming of the chemical entity is positioned; in this embodiment, the information processing system converts the IUPAC of the identified chemical entity into a SMILES format, if the conversion fails, it indicates that an error field exists in the IUPAC of the chemical entity, and locates the error field in the IUPAC of the chemical entity through a pre-stored error correction model. In this embodiment, the error correction model includes a preset field library, where a plurality of error fields and correct fields corresponding to the error fields are stored in the field library.

And correcting the error field according to the error correction model to obtain the correct systematic naming of the chemical entity. In this embodiment, the information processing system corrects the error field in IUPAC into the correct field through the error correction model pair to obtain the correct IUPAC of the chemical entity.

In this embodiment, the step of performing error correction processing on the system name of the chemical entity by the information processing system includes: first, the information handling system inputs the system name of the chemical entity into a field library for matching to locate the error field therein. Then, the information processing system corrects according to the correct field corresponding to the error field to obtain the correct system naming of the chemical entity.

In this embodiment, the error correction model includes a deep learning trained error correction model. In this embodiment, the error correction model includes an error detection model and an error correction model, specifically, the error correction model includes a seq2seq model (Sequence to Sequence model), and the step of obtaining the error detection model through deep learning training includes:

collecting data of the systematic naming of the correct chemical entities of the bulk of the compounds in advance; in an embodiment, the manner of collecting/gathering correct IUPAC data for a large batch of compounds may be by web crawler technology or by NER technology; of course, the system naming data of the correct chemical entity of the mass of the compound can also be obtained by manually sorting and inputting the data into a computer system.

Dividing all IUPAC by using a mode of 1-gram, 2-gram and 3-gram … … n-gram in sequence to obtain an IUPAC character segment library; in an embodiment, the computer system sequentially segments the obtained large quantity of IUPAC by using a mode of 1-gram, 2-gram and 3-gram … … n-gram to obtain an IUPAC character segment library, wherein the n-gram is an algorithm based on a statistical language model and is also called a first-order markov chain; the basic idea is to perform a sliding window operation of size N on the content in the text according to bytes, forming a sequence of byte fragments of length N. Each byte segment is called a gram, the occurrence frequency of all the grams is counted, and the key gram list, namely the vector feature space of the text, is formed by filtering according to a preset threshold value. Each gram in the list is a feature vector dimension.

All the fragments obtained in the steps are converted into pictures, then the pictures are recognized into characters by using different OCR tools, and the result of unsuccessfully restored into the original fragments is used as an error set corresponding to the fragments, so that an error field library is obtained.

And then, randomly constructing error IUPAC training data by utilizing an error set corresponding to the fragment library, training a character-level sequence labeling model, and further obtaining an error detection model, wherein in the process of positioning the error field in the system naming of the chemical entity, the system naming of the chemical entity is input into one error field library of the error detection model for matching so as to position the error field in the error field library.

In an embodiment, the step of deep learning training to obtain the error correction model comprises: training by utilizing the pre-collected correct IUPAC data, randomly masking partial IUPAC characters in the training process, and then restoring; then, training the incorrect IUPAC fragment to the seq2seq model of the correct IUPAC fragment, thereby obtaining the error correction model, and correcting the error field in the identified IUPAC to the correct field by using the error correction model to obtain the IUPAC of the correct chemical entity.

In one example, when the information processing system converts the IUPAC identified as the chemical entity into the SMILES format, the conversion fails, for example, the information processing system obtains an initial IUPAC, with the application number shown in fig. 2: PCT/CN2015/071266; for example, the patent document publication No. WO2015110024A1, the system of chemical entities is named IUPAC:

"5-chloro-N- (((3S, 3 aS) -1-0xo0-8- (2-oxolidin-1 (2H) -yl) -3,3a,4, 5-tetrahydro1H-benzol [ b ] oxazo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide", when the initial IUPAC is converted into SMILES format and not successfully converted, it is found that the characters in the above field "-0xo0-" should be "-oxo-" when the IUPAC is detected as an error via the above-mentioned preset error correction model, so that the correct IUPAC is obtained by correcting it: "5-chloro-N- (((3S, 3 aS) -1-oxo-8- (2-oxopyridin-1 (2H) -yl) -3,3a,4, 5-tetrahydroo-1H-benzol [ b ] oxazolo [3,4-d ] [1,4] oxazepin-3-yl) methyl) thiophen-2-carboxamide".

The information processing system converts the information into a SMILES format, and then obtains the SMILES information of the IUPAC as follows: "clc1=cc=c (S1) C (=o) NC [ c@h ]1OC (n2c3=c (OCC [ C@H ] 21) c=c (c=c3) N3C (c=cc=c3) =o).

In summary, according to the information processing method of the chemical entity, by detecting the obtained text document and the obtained picture document, determining the text content and/or the table content of the chemical entity, identifying, obtaining the system naming, the experimental information and the corresponding numbers of the chemical entity, converting the system naming of the chemical entity into the chemical structure information with the preset data format, associating the chemical structure information with the experimental information and the numbers, and storing and/or outputting the chemical structure information and the experimental information, each chemical entity simultaneously extracts the number information, the chemical structure information and the experimental information in the literature, and can obtain accurate, comprehensive and normative structured data after arrangement, so as to facilitate the work of drug discovery, research and development and the like.

The present application also provides an information processing system for a chemical entity, where in embodiments, the information processing system may be centrally located in a terminal computer or distributed between the terminal computer and a server (or server system). Examples of the terminal computer include a mobile phone, a tablet computer, a personal computer, and the like. The server (or server system) includes, for example, a single server, or a server cluster, etc. Referring to FIG. 14, a schematic diagram of an information processing system according to an embodiment of the present application is shown. The information processing system 20 of the chemical entity includes: the device comprises an acquisition module 201, a detection module 202, an identification module 203 and a conversion module 204.

The acquisition module 201 is configured to acquire a document containing a chemical entity; the documents include text documents or/and picture documents. In an embodiment, the implementation manner of the obtaining module 201 is the same as the related embodiment described in the step S10 in the foregoing example, which is not described herein, and the description of the step S10 is all incorporated herein.

The detection module 202 is configured to detect a context content in the document, a first type of table content including structural information of the chemical entity, and a second type of table content including experimental information of the chemical entity to determine an object to be identified; the structural information of the chemical entity comprises a molecular structure diagram and/or a system naming; in an embodiment, the implementation manner of the detection module 202 is the same as the related embodiment described in the step S11 in the foregoing example, which is not described herein, and the description of the step S11 is all incorporated herein.

The identifying module 203 is configured to identify a chemical entity and a numbering entity in the object to be identified to obtain structural information, experimental information, and a number corresponding to the structural information and the experimental information of the chemical entity; in an embodiment, the implementation manner of the identification module 203 is the same as the related embodiment described in the step S12 in the foregoing example, which is not described herein, and the description of the step S12 is all incorporated herein.

The conversion module 204 is configured to convert the structural information of the chemical entity into chemical structural information in a preset data format, and store and/or output the chemical structural information and experimental information thereof after associating the chemical structural information with the serial number; in an embodiment, the implementation manner of the conversion module 204 is the same as the related embodiment described in the step S13 in the foregoing example, which is not described herein, and the description of the step S13 is all incorporated herein.

Referring to FIG. 15, which is a system block diagram of a computer system according to an embodiment of the present application, as shown in the drawings, the present application provides a computer system 30, where hardware of the computer system 30 includes at least: a display 303, at least one memory 301, and at least one processor 302. Wherein the at least one memory 301 is used for storing at least one program. The processor 302 is connected to the at least one memory 301 and is configured to coordinate execution of the at least one processor 302 when the at least one program is invoked and executed to implement the information processing method of the chemical entity as described above with respect to any of the embodiments of fig. 1-13.

In some examples, the at least one memory may also include memory remote from the one or more processors, such as network-attached memory accessed via RF circuitry or external ports and a communication network, which may be the internet, one or more intranets, a local area network, a wide area network, a storage local area network, etc., or a suitable combination thereof. The memory controller may control access to memory by other components of the device, such as the CPU and peripheral interfaces. The memory optionally includes high-speed random access memory, and optionally also non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state memory devices. Access to memory is optionally controlled by other components of the device, such as the CPU and peripheral interfaces, through a memory controller. The memory may also include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk.

The at least one processor includes an integrated circuit chip having signal processing capabilities; or comprises a general purpose processor, which may be, for example, a digital signal processor, an application specific integrated circuit, a discrete gate or transistor logic, a discrete hardware component, a processor integrated with at least one processor core, or the like. The at least one processor and the at least one memory may be in data communication via one or more communication buses or signal lines for invoking and executing the at least one program to perform the information processing method of the chemical entity.

The computer system further includes a hardware device comprising: at least one of man-machine interaction device, display device, network interface device, etc. The man-machine interaction device is used for being operated by a user (such as a person skilled in the art) to enable the electric signals generated by operation to be transmitted to at least one processor, so that the corresponding program is called to conduct data processing on the numerical values/instructions/information represented by the received electric signals. Examples of the man-machine interaction device include at least one of the following: a mouse, keyboard, touch screen, etc.

The application also discloses a computer readable storage medium comprising a stored computer program, wherein the computer program, when being run by a processor of a computer, controls the computer to execute and implement the method for processing information of chemical entities as described above for any of the embodiments of fig. 1-13.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.

In the embodiments provided herein, the computer-readable storage medium may include read-only memory, random-access memory, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, U-disk, removable hard disk, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. In addition, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable and data storage media do not include connections, carrier waves, signals, or other transitory media, but are intended to be directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

In one or more exemplary aspects, the functions described by the image recognition method of the molecular structure diagram, the data entry method of the database, or the computer program of the retrieval method described in the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. The steps of a method or algorithm disclosed herein may be embodied in a processor-executable software module, which may be located on a tangible, non-transitory computer-readable and writable storage medium. Tangible, non-transitory computer readable and writable storage media may be any available media that can be accessed by a computer.

The present application also provides a computer program product comprising a computer program which, when run by a processor, executes and implements a method as described above; in embodiments, please refer to the related description of any one of the embodiments of fig. 1 to 13 for the information processing method of the chemical entity.

The flowcharts and block diagrams in the figures described above illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims

1. A method for processing information of a chemical entity, comprising the steps of:

acquiring a document containing a chemical entity; the document comprises a text document or/and a picture document;

detecting the line text content in the document, the first type of table content containing the structural information of the chemical entity and the second type of table content containing the experimental information of the chemical entity to determine an object to be identified; the structural information of the chemical entity comprises a molecular structure diagram and/or a system naming;

identifying chemical entities and numbered entities in the object to be identified to obtain structural information, experimental information and numbers corresponding to the structural information and the experimental information of the chemical entities;

and converting the structural information of the chemical entity into chemical structural information in a preset data format, and storing and/or outputting the chemical structural information and experimental information thereof after the chemical structural information and the experimental information are associated through the serial numbers.

2. The method for processing information of chemical entities according to claim 1, wherein in the step of obtaining a document containing chemical entities, the document is obtained by receiving a local upload, a web search, or a crawling from a web using a crawler tool, and the format of the document includes HTML format, XML format, TXT format, word format, or PDF format.

3. The method of claim 2, wherein the document is a pharmaceutical product description document, a pharmaceutical paper document, or a pharmaceutical patent document, a clinical trial document, an audit document, or a clinical study document.

4. The information processing method of chemical entity according to claim 1, wherein the step of detecting the content of the line text and/or the table of the chemical entity in the acquired document comprises: and locating the row text content in the document through the text features to determine the object to be identified, and locating the first type of table content and the second type of table content in the document through the table features to determine the object to be identified.

5. The method according to claim 4, wherein in the step of identifying the chemical entity and the numbering entity in the object to be identified, when the obtained document is detected as a text document, a preset line identification model is called to identify the chemical entity and the numbering entity included in the text information of the line content, so as to obtain the system name and the number of the chemical entity, respectively, and the system name and the corresponding number of each chemical entity are determined according to the positional relationship between the system name and the number of the chemical entity in the text information.

6. The method of claim 5, wherein the context recognition model comprises a trained deep learning model and/or a regular expression model.

7. The method according to claim 4, wherein in the step of identifying the chemical entity and the numbering entity in the object to be identified, when the obtained document is detected as a text document, a preset table identification model is called to identify the chemical entity and the numbering entity in the text information of the first type of table content and the second type of table content, so as to obtain the structural information or the experimental information of the chemical entity and the corresponding number of the structural information or the experimental information, respectively, and determine the system naming or the experimental information of each chemical entity and the corresponding number thereof according to the table attribute of the table content.

8. The method of claim 7, wherein the tabular recognition model comprises a regular expression model.

9. The method for processing information of chemical entities according to claim 4, wherein in the step of detecting the content of the line text, the content of the first type table, and the content of the second type table of the chemical entities in the obtained document, the step of dividing the obtained document into a plurality of pictures according to a preset dividing rule and storing each picture as a picture to be identified after performing image enhancement processing on the picture is further included.

10. The method for processing information of chemical entities according to claim 9, wherein when the acquired document is a picture document, further comprising dividing the picture document into a plurality of pictures in units of division according to page numbers of the document.

11. The method for processing information of chemical entities according to claim 10, further comprising the step of identifying a molecular structure map in the picture document to convert the molecular structure map into chemical structure information in a predetermined data format.

12. The method for processing information of chemical entities according to claim 10, wherein the step of identifying the context content in the picture to be identified comprises:

extracting the line text content in the picture to be identified by utilizing OCR to obtain text information;

invoking a preset line text recognition model to recognize chemical entities and numbered entities contained in text information of the line text content so as to respectively obtain system names and numbers of the chemical entities, and determining the system names or experimental information of each chemical entity and corresponding numbers according to the system names or experimental information of the chemical entities and the positional relationship of the numbers in the text information.

13. The method for processing information of chemical entities according to claim 12, wherein the step of identifying the first and second types of table contents in the picture to be identified comprises:

Inputting the pictures into a preset form detection model to detect form pictures, and then cutting and storing the form pictures as form pictures to be identified;

detecting a first type of table and a second type of table in the table picture to be identified so as to locate chemical entity contents and numbered entity contents in the table picture to be identified;

extracting the position relation of the chemical entity content and the numbering entity content in the form picture to be identified, and extracting the chemical entity content and the numbering entity content by utilizing OCR to obtain text information;

invoking a preset line text recognition model to recognize chemical entities and numbering entities contained in the text information so as to respectively obtain system naming or experimental information and numbering of the chemical entities, and determining the system naming or experimental information of each chemical entity and the corresponding numbering thereof according to the position relation.

14. The method for processing information of chemical entities according to claim 1, wherein the experimental information of the chemical entities includes activity values of the compounds.

15. The method for processing information of chemical entities according to claim 1, characterized in that the system name of the chemical entity is a compound name named according to IUPAC naming convention.

16. The method for processing information of chemical entities according to claim 1, wherein the chemical structure information in the preset data format is chemical structure information in SMILES format, inChi format, MDL Molfile format, or SDF format.

17. An information handling system for a chemical entity, comprising:

the acquisition module is used for acquiring the document containing the chemical entity; the document comprises a text document or/and a picture document;

the detection module is used for detecting the line text content in the document, the first type of table content containing the structural information of the chemical entity and the second type of table content containing the experimental information of the chemical entity so as to determine an object to be identified; the structural information of the chemical entity comprises a molecular structure diagram and/or a system naming;

the identification module is used for identifying the chemical entity and the numbering entity in the object to be identified so as to obtain the structural information, the experimental information and the numbering corresponding to the structural information and the experimental information of the chemical entity;

the conversion module is used for converting the structural information of the chemical entity into chemical structural information in a preset data format, and storing and/or outputting the chemical structural information and experimental information thereof after being associated with the serial numbers.

18. A computer system, comprising:

at least one memory for storing at least one program;

at least one processor, coupled to the at least one memory, for implementing the information processing method of the chemical entity of any one of claims 1-16 when the at least one program is called and executed from the at least one memory.

19. A computer readable storage medium, comprising a stored computer program, wherein the computer program, when run by a processor of a computer, controls the computer to perform and implement the method of information processing of a chemical entity as claimed in any one of claims 1-16.