CN112183076A - Substance name extraction method and device and storage medium - Google Patents

Substance name extraction method and device and storage medium Download PDF

Info

Publication number
CN112183076A
CN112183076A CN202010892360.2A CN202010892360A CN112183076A CN 112183076 A CN112183076 A CN 112183076A CN 202010892360 A CN202010892360 A CN 202010892360A CN 112183076 A CN112183076 A CN 112183076A
Authority
CN
China
Prior art keywords
text file
extracted
preset
module
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010892360.2A
Other languages
Chinese (zh)
Inventor
白芳
杨宇星
周杰龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wangshi Intelligent Technology Co ltd
Original Assignee
Beijing Wangshi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wangshi Intelligent Technology Co ltd filed Critical Beijing Wangshi Intelligent Technology Co ltd
Priority to CN202010892360.2A priority Critical patent/CN112183076A/en
Publication of CN112183076A publication Critical patent/CN112183076A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Abstract

The invention provides a method, a device and a storage medium for extracting a substance name, wherein the method comprises the following steps: acquiring a text file to be extracted; determining whether the text file to be extracted contains a preset identifier; and when the text file to be extracted contains the preset identification, extracting the name of the target substance from the text content corresponding to the preset identification by using a pre-trained substance name extraction model. By implementing the method and the device, whether the text content of the corresponding part of the preset identification contains the material name information or not is searched according to the preset identification, the preset identification can be preset according to needs, the flexibility of extracting the material name is improved, and when the preset identification is keywords such as detailed technical content introduction contained in the text content corresponding to the embodiment, the step and the like in the patent text, the pertinence of extracting the material name is improved, and meanwhile, the name extraction of the whole patent is not needed, so that the efficiency of extracting the material name is improved.

Description

Substance name extraction method and device and storage medium
Technical Field
The invention relates to the field of natural language processing, in particular to a method and a device for extracting a substance name and a storage medium.
Background
Patent data is important intellectual property data in the internet, and researches show that patents only account for 10 percent of the total amount of documents, but can provide 90 to 95 percent of new technical information all over the world. Particularly obvious in the field of drug research and development, and a large amount of small molecule data of potential drugs are distributed in documents such as documents, patents and the like. Among them, chemical patents are important starting points for understanding the use, properties and novelty of compounds. In general, new compounds were originally disclosed in patent documents, and the chemical literature mentions that these chemicals may take 1-3 years, indicating that patents are a valuable but underutilized resource. With the rapid increase of the number of new chemical patent applications each year, how to effectively extract the molecular name to find and utilize the molecule is a primary consideration in the business and academic circles. In the related art, the name of a substance can only be obtained by traversing and identifying the whole patent document, the name of a molecule cannot be screened according to different requirements, and the flexibility of extracting the name of the molecule is poor.
Disclosure of Invention
In view of the above, embodiments of the present invention provide a method, an apparatus, and a storage medium for extracting a substance name, so as to overcome the defect of poor flexibility in extracting a molecular name in the prior art.
According to a first aspect, an embodiment of the present invention provides a substance name extraction method, including the following steps: acquiring a text file to be extracted; determining whether the text file to be extracted contains a preset identifier; and when the text file to be extracted contains the preset identification, extracting the name of the target substance from the text content corresponding to the preset identification by using a pre-trained substance name extraction model.
Optionally, the pre-trained substance name extraction model includes: the system comprises a CNN convolution module, an LSTM neural network module, a linear module and a CRF conditional random field module; when the text file to be extracted contains the preset identification, extracting the name of the target substance from the text content corresponding to the preset identification by using a substance name extraction model, wherein the method comprises the following steps: performing convolution pooling on the character vectors of the words in each sentence in the text content by using the convolution module to obtain a character-level feature vector of each word; splicing the character-level feature vector of each word, a first preset word vector and a second preset word vector, inputting the splicing result to the LSTM neural network module to obtain the feature information of the sentence, wherein the first preset word vector is extracted from a text file of a different type from the text file to be extracted, and the second preset word vector is extracted from a historical text file of the same type as the text file to be extracted; and outputting the characteristic information of the statement to a linear module for calculation, and outputting the calculation result to a CRF conditional random field module for name extraction of the target substance.
Optionally, the preset identification is multiple, and when the text file to be extracted contains the preset identification, performing target substance name extraction on text content corresponding to the preset identification by using a pre-trained substance name extraction model includes: and sequentially extracting the target substance name of the text content of the target position corresponding to each identified preset identification until all the text contents are traversed.
Optionally, the obtaining of the text file to be extracted includes: and when the text file to be extracted is a non-editable text file, performing editable processing on the non-editable text file by using a target algorithm.
Optionally, the text file to be extracted is a patent text file, and the name of the target substance is a chemical substance name.
Optionally, the method further comprises: and converting the name of the extracted chemical substance into a target form.
According to a second aspect, an embodiment of the present invention provides a substance name extraction device, including: the text file acquisition module is used for acquiring a text file to be extracted; the preset identification determining module is used for determining whether the text file to be extracted contains a preset identification; and the material name extraction module is used for extracting the name of the target material from the text content corresponding to the preset identification by using a pre-trained material name extraction model when the text file to be extracted contains the preset identification.
Optionally, the pre-trained substance name extraction model includes: the system comprises a CNN convolution module, an LSTM neural network module, a linear module and a CRF conditional random field module; the substance name extraction module comprises: the character-level feature vector acquisition module is used for performing convolution pooling on the character vector of each word in each sentence in the text content by using the convolution module to obtain the character-level feature vector of each word; a sentence characteristic information obtaining module, configured to splice a character-level characteristic vector of each word, a first preset word vector, and a second preset word vector, and input a splicing result to the LSTM neural network module to obtain characteristic information of the sentence, where the first preset word vector is extracted from a text file of a different type from the text file to be extracted, and the second preset word vector is extracted from a history text file of the same type as the text file to be extracted; and the material name extraction submodule is used for outputting the characteristic information of the statement to the linear module for calculation and outputting the calculation result to the CRF conditional random field module for target material name extraction.
According to a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the substance name extraction method according to the first aspect or any one of the embodiments of the first aspect when executing the program.
According to a fourth aspect, an embodiment of the present invention provides a storage medium, on which computer instructions are stored, and the instructions, when executed by a processor, implement the steps of the substance name extraction method according to the first aspect or any one of the embodiments of the first aspect.
The technical scheme of the invention has the following advantages:
according to the method and the device for extracting the substance name, whether the text content of the corresponding part of the preset identification contains the substance name information or not is searched according to the preset identification, the preset identification can be set in advance according to needs, the flexibility of extracting the substance name is improved, and when the preset identification is that the text content corresponding to the embodiment, the step and the like in the patent text contains keywords such as detailed technical content introduction, the pertinence of extracting the substance name is improved, and meanwhile, the name extraction of the whole patent is not needed, so that the efficiency of extracting the substance name is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a specific example of a substance name extraction method in the embodiment of the present invention;
FIG. 2 is a diagram illustrating an exemplary method for extracting names of substances according to an embodiment of the present invention;
fig. 3 is a schematic block diagram of a specific example of a substance name extraction device according to an embodiment of the present invention;
fig. 4 is a schematic block diagram of a specific example of an electronic device in the embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The embodiment provides a method for extracting a substance name, as shown in fig. 1, including the following steps:
s101, acquiring a text file to be extracted.
Illustratively, the text file to be extracted may be a paper, a patent, or the like. The text file to be extracted may be obtained from a database or input by a user, and the type and the obtaining mode of the text file to be extracted are not limited, and can be determined by those skilled in the art as needed.
S102, determining whether the text file to be extracted contains a preset identification.
For example, the preset identifier may be a preset keyword, for example, when the text file to be extracted is a chinese patent file, the preset identifier may be: examples + forms of numbers, such as examples, example 1, example 2, compound 1, compound 2, step 1, step 2; when the text file to be extracted is an english patent file, the preset identifier may be: the form of Example/Example + number/Compound + number/Reference + number/Step + number, the specific content of the preset identifier is not limited in this embodiment, and those skilled in the art can determine the preset identifier as needed.
The determination method of whether the text file to be extracted contains the preset identifier may be to input the text file to be extracted into a pre-trained rule-based marking model, the model may recognize various preset identifiers, or may also be to convert all words in the text file to be extracted into word vectors, compare the converted word vectors with the word vectors of the preset identifiers in sequence, and determine that the text file to be extracted contains the preset identifier when the same word vectors appear.
S103, when the text file to be extracted contains the preset identification, extracting the name of the target substance from the text content corresponding to the preset identification by using a pre-trained substance name extraction model.
Illustratively, the target substance name may be a chemical substance name. The text content corresponding to the preset identifier may be the text content below the preset identifier. When the preset identifier is 'embodiment' or 'Example', when any character string is 'embodiment' or 'Example', judging whether the next character string is a chemical substance name, if so, marking the label of the character string as 1, and then judging the next character string until the chemical substance name is finished; if the name is not the chemical substance name, the label of the character string is marked as 0, and whether the next character string is the chemical substance name or not is judged. The string labeled 1 is integrated and the IUPAC of the chemical is output.
The method for extracting the target substance name from the text content corresponding to the preset identifier can be a method of circulating until the whole text file to be extracted is finished, or a method of extracting the first target substance name and then finishing the circulation, and when a plurality of preset identifiers exist, the next preset identifier is identified. The extraction method of the chemical substance name may be to extract the chemical substance name by using a pre-trained substance name extraction model, which is not limited in this embodiment, and can be determined by those skilled in the art as needed.
Taking as an example that the manner of extracting the name of the target substance from the text content corresponding to the preset identifier is to circulate until the whole text file to be extracted is finished, for the text file to be extracted including three names of substances in one embodiment, the target substance name extraction result may be:
example (b): chemical name A; chemical name B; chemical name C; or
Example:molecular name A;molecular name B;molecular name C。
According to the substance name extraction method provided by the embodiment, whether the text content of the corresponding part of the preset identification contains the substance name information or not is searched according to the preset identification, and the preset identification can be set in advance according to needs, so that the flexibility of substance name extraction is improved, and when the preset identification is a keyword such as detailed technical content introduction included in the text content corresponding to the embodiment, the step and the like in the patent text, the pertinence of substance name extraction is improved, and meanwhile, the name extraction of the whole patent is not required, so that the efficiency of substance name extraction is improved.
As an optional implementation manner of this embodiment, the pre-trained substance name extraction model includes: the system comprises a CNN convolution module, an LSTM neural network module, a linear module and a CRF conditional random field module; when the text file to be extracted contains the preset identification, extracting the name of the target substance from the text content corresponding to the preset identification by using a substance name extraction model, wherein the method comprises the following steps:
performing convolution pooling on the character vectors of the words in each sentence in the text content by using the convolution module to obtain a character-level feature vector of each word;
splicing the character-level feature vector of each word, a first preset word vector and a second preset word vector, inputting the splicing result to the LSTM neural network module to obtain the feature information of the sentence, wherein the first preset word vector is extracted from a text file of a different type from the text file to be extracted, and the second preset word vector is extracted from a historical text file of the same type as the text file to be extracted;
and outputting the characteristic information of the statement to a linear module for calculation, and outputting the calculation result to a CRF conditional random field module for name extraction of the target substance.
Illustratively, the CNN convolution module takes into account the characteristics of the chemical name itself, including the letters, numbers, punctuation marks, and special characters appearing in the chemical name. Thus, the CNN convolution module is trained based on 40 ten thousand libraries of chemical substance names and the frequency of occurrence of letters, numbers, punctuation marks, and special characters of thousands of plain text data.
When the type of the text file to be extracted is a patent document, due to the particularity of the patent document, the patent document comprises patent language content and natural language content, a word2vec word vector model needs to be trained according to a large amount of non-patent documents such as Baidu encyclopedia and the like, so that a pre-established word vector table is obtained, and a first pre-established word vector is obtained by inquiring the word vector table established by the non-patent documents; and constructing a 300-dimensional word embedding model according to hundreds of thousands of patent texts such as the United states patent office, the European patent office, the world patent organization and the like, thereby forming a pre-established ELMo-based patent word vector table, and obtaining a second preset word vector by inquiring the word vector table established by the patent literature. Some nonsense stop words may be removed when training the word vector model, for example: of, the, a, etc., to achieve the effect of further improving the training efficiency of the model.
When the pre-trained substance name extraction model is used for extracting the target substance name from the text content corresponding to the preset identification, firstly, the input sentence is converted into a corresponding word vector sequence by inquiring a word vector table. And then for each word in the sentence, acquiring a character vector of each character by inquiring a character vector table, forming a character vector matrix of the word by the character vectors, and inputting the character vectors into a CNN convolution module to perform convolution and pooling on the character vector matrix to acquire a character-level feature vector of each word because the character vectors are too sparse. And adding and splicing the character-level feature vector of each word, the word vector inquired by the word and the ELMo vector to obtain a word vector with more comprehensive information, inputting the spliced word vector into an LSTM neural network module for recognition, transmitting the output of the LSTM neural network module into a linear module for calculation, and decoding the output of the linear module into an optimal marking sequence through a CRF conditional random field module to obtain a target text, namely the name of the chemical substance.
According to the substance name extraction method provided by the embodiment, for different text files, a plurality of word vectors are fused and spliced, the obtained word vector with more comprehensive information is used as the input of the LSTM neural network module, and the substance name extraction accuracy is improved.
As an optional implementation manner of this embodiment, the preset identifier is multiple, and when the text file to be extracted includes the preset identifier, performing target substance name extraction on text content corresponding to the preset identifier by using a pre-trained substance name extraction model includes: and sequentially extracting the target substance name of the text content of the target position corresponding to each identified preset identification until all the text contents are traversed.
For example, the target position may be a position where a first complete target substance name after the preset identification is located, as shown in fig. 2, embodiments 1 and 2 are the preset identifications, and the position where the target substance name is located thereafter is the target position. And when the corresponding preset identification is identified, carrying out target substance name extraction on the text content corresponding to the preset identification until all the text contents are traversed. The way of identifying the next preset identifier and extracting the name of the target substance is described in the above embodiments, and is not described herein again.
Taking the preset identifiers "embodiment 1", "embodiment 2" … "and" embodiment n "as examples, the target substance name extraction result may be:
example 1: chemical name A; example 1: molecular name A;
example 2: chemical name B; example 2: molecular name B;
example n: chemical name N; example n: molecular name N.
In a large number of patent text documents, the writing mode of the patent text documents directly indicates the names of chemical substances needing protection after the examples, and detailed descriptions such as synthetic processes and using modes are written in specific implementation modes.
As an optional implementation manner of this embodiment, acquiring a text file to be extracted includes: and when the text file to be extracted is a non-editable text file, performing editable processing on the non-editable text file by using a target algorithm.
The target algorithm may be an OCR technology, and the present embodiment does not limit the target algorithm, and those skilled in the art may determine the target algorithm according to the needs. The type of the acquired text file to be extracted may be pdf, txt, rtf or xml information. The editable text file can be processed without any treatment, and the non-editable pdf and the picture information can be converted into the editable text information by using an OCR technology.
As an optional implementation manner of this embodiment, the text file to be extracted is a patent text file, and the target text is a chemical substance name.
As an optional implementation manner of this embodiment, the method further includes: and converting the name of the extracted chemical substance into a target form.
Illustratively, the target form may be SMILES, stdinchii, StdInChIKey, CML, and chemical structure picture, etc. The way to convert the extracted chemical name into the target form may be to call the open source OPSIN package to convert the chemical name. The embodiment does not limit the specific target form and the transformation manner, and those skilled in the art can determine the target form and the transformation manner according to the needs.
An embodiment of the present invention provides a substance name extraction device, as shown in fig. 3, including:
a text file obtaining module 201, configured to obtain a text file to be extracted; for details, reference is made to the above method embodiments, which are not described herein again.
A preset identifier determining module 202, configured to determine whether the text file to be extracted includes a preset identifier; for details, reference is made to the above method embodiments, which are not described herein again.
And the substance name extraction module 203 is configured to, when the text file to be extracted includes the preset identifier, extract a target substance name from text content corresponding to the preset identifier by using a pre-trained substance name extraction model. For details, reference is made to the above method embodiments, which are not described herein again.
The substance name extraction device provided by this embodiment finds whether the text content of the corresponding portion of the preset identifier contains substance name information according to the preset identifier, and the preset identifier can be set in advance as needed, so that flexibility of substance name extraction is improved, and when the preset identifier is a keyword such as detailed technical content introduction included in the text content corresponding to the embodiment, the step and the like in the patent text, the pertinence of substance name extraction is improved, and meanwhile, name extraction of the whole patent is not required, so that efficiency of substance name extraction is improved.
As an optional implementation manner of this embodiment, the pre-trained substance name extraction model includes: the system comprises a CNN convolution module, an LSTM neural network module, a linear module and a CRF conditional random field module; the substance name extraction module comprises:
the character-level feature vector acquisition module is used for performing convolution pooling on the character vector of each word in each sentence in the text content by using the convolution module to obtain the character-level feature vector of each word; for details, reference is made to the above method embodiments, which are not described herein again.
A sentence characteristic information obtaining module, configured to splice a character-level characteristic vector of each word, a first preset word vector, and a second preset word vector, and input a splicing result to the LSTM neural network module to obtain characteristic information of the sentence, where the first preset word vector is extracted from a text file of a different type from the text file to be extracted, and the second preset word vector is extracted from a history text file of the same type as the text file to be extracted; for details, reference is made to the above method embodiments, which are not described herein again.
And the material name extraction submodule is used for outputting the characteristic information of the statement to the linear module for calculation and outputting the calculation result to the CRF conditional random field module for target material name extraction. For details, reference is made to the above method embodiments, which are not described herein again.
As an optional implementation manner of this embodiment, the preset identifier is multiple, and the substance name extraction module 203 further includes:
the traversal module is used for sequentially extracting the target substance name of the text content of the target position corresponding to each identified preset identification until all the text content is traversed; for details, reference is made to the above method embodiments, which are not described herein again.
As an optional implementation manner of this embodiment, the text file obtaining module includes:
and the file transferring module is used for carrying out editable processing on the non-editable text file by utilizing a target algorithm when the text file to be extracted is the non-editable text file. For details, reference is made to the above method embodiments, which are not described herein again.
As an optional implementation manner of this embodiment, the text file to be extracted is a patent text file, and the name of the target substance is a chemical substance name. For details, reference is made to the above method embodiments, which are not described herein again.
As an optional implementation manner of this embodiment, the substance name extraction device further includes: and the target form conversion module is used for converting the extracted chemical substance name into a target form. For details, reference is made to the above method embodiments, which are not described herein again.
The embodiment of the present application also provides an electronic device, as shown in fig. 4, including a processor 310 and a memory 320, where the processor 310 and the memory 320 may be connected by a bus or in another manner.
Processor 310 may be a Central Processing Unit (CPU). The Processor 310 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or any combination thereof.
The memory 320, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the substance name extraction method in the embodiments of the present invention. The processor executes various functional applications and data processing of the processor by executing non-transitory software programs, instructions, and modules stored in the memory.
The memory 320 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 320 may optionally include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 320 and, when executed by the processor 310, perform a substance name extraction method as in the embodiment shown in fig. 1.
The details of the electronic device may be understood with reference to the corresponding related description and effects in the embodiment shown in fig. 1, and are not described herein again.
This embodiment also provides a computer storage medium storing computer-executable instructions that can perform any of the above-described method embodiments 1 for substance name extraction. The storage medium may be a magnetic Disk, an optical Disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. A substance name extraction method is characterized by comprising the following steps:
acquiring a text file to be extracted;
determining whether the text file to be extracted contains a preset identifier;
and when the text file to be extracted contains the preset identification, extracting the name of the target substance from the text content corresponding to the preset identification by using a pre-trained substance name extraction model.
2. The method of claim 1, wherein the pre-trained substance name extraction model comprises: the system comprises a CNN convolution module, an LSTM neural network module, a linear module and a CRF conditional random field module; when the text file to be extracted contains the preset identification, extracting the name of the target substance from the text content corresponding to the preset identification by using a substance name extraction model, wherein the method comprises the following steps:
performing convolution pooling on the character vectors of the words in each sentence in the text content by using the convolution module to obtain a character-level feature vector of each word;
splicing the character-level feature vector of each word, a first preset word vector and a second preset word vector, inputting the splicing result to the LSTM neural network module to obtain the feature information of the sentence, wherein the first preset word vector is extracted from a text file of a different type from the text file to be extracted, and the second preset word vector is extracted from a historical text file of the same type as the text file to be extracted;
and outputting the characteristic information of the statement to a linear module for calculation, and outputting the calculation result to a CRF conditional random field module for name extraction of the target substance.
3. The method according to claim 1, wherein the preset identifier is a plurality of identifiers, and when the text file to be extracted contains the preset identifier, performing target material name extraction on text content corresponding to the preset identifier by using a pre-trained material name extraction model, includes:
and sequentially extracting the target substance name of the text content of the target position corresponding to each identified preset identification until all the text contents are traversed.
4. The method according to claim 1, wherein the obtaining the text file to be extracted comprises: and when the text file to be extracted is a non-editable text file, performing editable processing on the non-editable text file by using a target algorithm.
5. The method according to claim 1, wherein the text file to be extracted is a patent text file, and the target substance name is a chemical substance name.
6. The method of claim 4, further comprising: and converting the name of the extracted chemical substance into a target form.
7. A substance name extraction device, characterized by comprising:
the text file acquisition module is used for acquiring a text file to be extracted;
the preset identification determining module is used for determining whether the text file to be extracted contains a preset identification;
and the material name extraction module is used for extracting the name of the target material from the text content corresponding to the preset identification by using a pre-trained material name extraction model when the text file to be extracted contains the preset identification.
8. The material name extraction device according to claim 7, wherein the pre-trained material name extraction model includes: the system comprises a CNN convolution module, an LSTM neural network module, a linear module and a CRF conditional random field module; the substance name extraction module comprises:
the character-level feature vector acquisition module is used for performing convolution pooling on the character vector of each word in each sentence in the text content by using the convolution module to obtain the character-level feature vector of each word;
a sentence characteristic information obtaining module, configured to splice a character-level characteristic vector of each word, a first preset word vector, and a second preset word vector, and input a splicing result to the LSTM neural network module to obtain characteristic information of the sentence, where the first preset word vector is extracted from a text file of a different type from the text file to be extracted, and the second preset word vector is extracted from a history text file of the same type as the text file to be extracted;
and the material name extraction submodule is used for outputting the characteristic information of the statement to the linear module for calculation and outputting the calculation result to the CRF conditional random field module for target material name extraction.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the substance name extraction method according to any one of claims 1 to 6 are implemented when the program is executed by the processor.
10. A storage medium having stored thereon computer instructions, which when executed by a processor, carry out the steps of the substance name extraction method of any one of claims 1 to 6.
CN202010892360.2A 2020-08-28 2020-08-28 Substance name extraction method and device and storage medium Pending CN112183076A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010892360.2A CN112183076A (en) 2020-08-28 2020-08-28 Substance name extraction method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010892360.2A CN112183076A (en) 2020-08-28 2020-08-28 Substance name extraction method and device and storage medium

Publications (1)

Publication Number Publication Date
CN112183076A true CN112183076A (en) 2021-01-05

Family

ID=73925317

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010892360.2A Pending CN112183076A (en) 2020-08-28 2020-08-28 Substance name extraction method and device and storage medium

Country Status (1)

Country Link
CN (1) CN112183076A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955773A (en) * 2011-08-31 2013-03-06 国际商业机器公司 Method and system for identifying chemical names in Chinese document
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
US20190005020A1 (en) * 2017-06-30 2019-01-03 Elsevier, Inc. Systems and methods for extracting funder information from text
CN109933795A (en) * 2019-03-19 2019-06-25 上海交通大学 Based on context-emotion term vector text emotion analysis system
CN110781299A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Asset information identification method and device, computer equipment and storage medium
CN111008523A (en) * 2019-11-21 2020-04-14 中科鼎富(北京)科技发展有限公司 Information extraction method and device and server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955773A (en) * 2011-08-31 2013-03-06 国际商业机器公司 Method and system for identifying chemical names in Chinese document
US20190005020A1 (en) * 2017-06-30 2019-01-03 Elsevier, Inc. Systems and methods for extracting funder information from text
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN109933795A (en) * 2019-03-19 2019-06-25 上海交通大学 Based on context-emotion term vector text emotion analysis system
CN110781299A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Asset information identification method and device, computer equipment and storage medium
CN111008523A (en) * 2019-11-21 2020-04-14 中科鼎富(北京)科技发展有限公司 Information extraction method and device and server

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
朱靖波 等: ""中文信息自动抽取"", 《东北大学学报(自然科学版)》, vol. 19, no. 01, 15 February 1998 (1998-02-15), pages 52 - 54 *
杨露菁 等: "《智能图像处理及应用》", 31 March 2019, 中国铁道出版社, pages: 272 - 276 *
邓擘 等: ""汉语信息抽取中事件的定位与分类"", 《情报理论与实践》, vol. 32, no. 10, 30 October 2009 (2009-10-30), pages 104 - 107 *

Similar Documents

Publication Publication Date Title
CN109416705B (en) Utilizing information available in a corpus for data parsing and prediction
US11675977B2 (en) Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
US9411790B2 (en) Systems, methods, and media for generating structured documents
US9836526B2 (en) Selecting a structure to represent tabular information
Khusro et al. On methods and tools of table detection, extraction and annotation in PDF documents
US20140280256A1 (en) Automated data parsing
CN110569332B (en) Sentence feature extraction processing method and device
CN110704626A (en) Short text classification method and device
JP2020191075A (en) Recommendation of web apis and associated endpoints
KR101724398B1 (en) A generation system and method of a corpus for named-entity recognition using knowledge bases
CN113536795B (en) Method, system, electronic device and storage medium for entity relation extraction
CN113158652B (en) Data enhancement method, device, equipment and medium based on deep learning model
CN113672736B (en) Text multi-label classification method and system
CN110851609A (en) Representation learning method and device
CN111539383B (en) Formula knowledge point identification method and device
CN112765985A (en) Named entity identification method for specific field patent embodiment
CN112183076A (en) Substance name extraction method and device and storage medium
US9792263B2 (en) Human input to relate separate scanned objects
US11354485B1 (en) Machine learning based classification and annotation of paragraph of resume document images based on visual properties of the resume document images, and methods and apparatus for the same
US20230282322A1 (en) System and method for anonymizing medical records
CN114997167A (en) Resume content extraction method and device
Stokes Modelling Multigraphism: the digital representation of multiple scripts and alphabets
CN115859984B (en) Medical named entity recognition model training method, device, equipment and medium
CN115618874A (en) Method and device for extracting named entity in patent and electronic equipment
KR102575752B1 (en) Examination data classification device and classification method using ensemble classification model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination