CN111382571B - Information extraction method, system, server and storage medium - Google Patents
Information extraction method, system, server and storage medium Download PDFInfo
- Publication number
- CN111382571B CN111382571B CN201911088563.XA CN201911088563A CN111382571B CN 111382571 B CN111382571 B CN 111382571B CN 201911088563 A CN201911088563 A CN 201911088563A CN 111382571 B CN111382571 B CN 111382571B
- Authority
- CN
- China
- Prior art keywords
- entities
- relationship
- entity
- name
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 56
- 238000004458 analytical method Methods 0.000 claims abstract description 43
- 238000000034 method Methods 0.000 claims description 33
- 230000014509 gene expression Effects 0.000 claims description 26
- 230000008520 organization Effects 0.000 claims description 18
- 230000015654 memory Effects 0.000 claims description 16
- 238000009937 brining Methods 0.000 claims 1
- 230000008569 process Effects 0.000 description 11
- 238000003058 natural language processing Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 230000011218 segmentation Effects 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012916 structural analysis Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an information extraction method, which comprises the following steps: s110, extracting one or more entities in the text based on a preset entity identification rule; s120, carrying out keyword association on the one or more entities based on a preset first relationship rule, and taking the keyword association result of the one or more entities as a first relationship between the entities; s130, carrying out syntactic analysis on the text based on a preset second relation rule, and taking a result of the syntactic analysis as a relation between second entities; s140, generating a third entity relationship according to the union of the first entity relationship and the second entity relationship; and S150, storing the relation among the third entities as an information extraction result. The invention also discloses an information extraction system, a server and a terminal readable storage medium. By adopting different recognition rules to extract the relationships of the entities and solving the union set of the obtained relationships of the entities, the recognition error can be reduced and the efficiency can be improved when grammar errors or text errors occur in the input text.
Description
Technical Field
The embodiment of the invention relates to the field of natural language processing, in particular to an information extraction method, an information extraction system, a server and a storage medium.
Background
Natural Language Processing (NLP) is the field of computer science, artificial intelligence, linguistics focus on interactions between computers and human (natural) languages. The research of the theory and method can realize effective communication between people and computers by natural language, and the research content relates to natural language, namely language and text used by people in daily life, and also relates to a computer system and method capable of realizing natural language identification. In the field of natural language processing, it is sometimes necessary to recognize entities having a specific meaning from natural language text and recognize relationships between the entities.
In general, the extraction entity uses a regular expression or a neural network model alone, the regular expression is a rule pre-written for the manual, the process of constructing the rule is time-consuming and labor-consuming, and portability is poor. The method based on the neural network model depends on a corpus, only people names, place names and organization names can be identified at present, the method is easy to be interfered, and poor corpus can cause low accuracy. Relation extraction typically employs semi-supervised sentence classification, training a classifier in advance using existing labeled samples, and predicting the values of unlabeled samples in accordance with the classifier. The method is easy to miss the needed key entities, so that errors occur in information extraction.
Disclosure of Invention
The invention provides an information extraction method, an information extraction system, a server and a storage medium, so as to improve the accuracy of relationship identification between entities.
In a first aspect, the present invention provides an information extraction method, including:
extracting one or more entities in the text based on a preset entity identification rule;
and carrying out keyword association on the one or more entities based on a preset first relationship rule, and taking the keyword association result of the one or more entities as a relationship between the first entities.
Carrying out syntactic analysis on the text based on a preset second relation rule, and taking the syntactic analysis result as a relation between second entities;
generating a third entity relationship according to the union of the first entity relationship and the second entity relationship;
and storing the relation among the third entities as an information extraction result.
In a second aspect, the present invention also discloses an information extraction system, including:
the entity recognition module is used for extracting one or more entities in the text based on a preset entity recognition rule;
and the first relation module is used for carrying out keyword association on the one or more entities based on a preset first relation rule, and taking the keyword association result of the one or more entities as a relation between the first entities.
The second relation module is used for carrying out syntactic analysis on the text based on a preset second relation rule, and taking a result of the syntactic analysis as a relation between second entities;
a third relationship module, configured to generate a third relationship between entities according to a union of the relationship between the first entities and the relationship between the second entities;
and the storage module is used for storing the relation among the third entities as an information extraction result.
In a third aspect, the present invention also discloses a server, including a memory, a processor, and a program stored in the memory and capable of running on the processor, where the processor implements the information extraction method according to any one of the above when executing the program.
In a fourth aspect, the present invention also discloses a terminal readable storage medium, on which a program is stored, the program being capable of implementing an information extraction method as described in any one of the above when executed by a processor.
The invention adopts different recognition rules to extract the relationship of the entities and calculates the union of the obtained entity relationships, thereby realizing the recognition of the entities and the entity relationships in the text information with high efficiency and low error rate.
Drawings
Fig. 1 is a flowchart of an information extraction method according to an embodiment of the present invention.
Fig. 2 is a flow chart of an alternative embodiment of the first embodiment of the present invention.
Fig. 3 is a flowchart of an information extraction method in a second embodiment of the present invention.
Fig. 4 is a flowchart of an information extraction method in the third embodiment of the present invention.
Fig. 5 is a schematic diagram of a syntax tree of three examples of embodiments of the present invention.
Fig. 6 is a schematic diagram of a syntax tree of a third example of an embodiment of the present invention.
Fig. 7 is a block diagram of an information extraction system according to a fourth embodiment of the present invention.
Fig. 8 is a block diagram of an alternative embodiment in a fourth embodiment of the present invention.
Fig. 9 is a schematic diagram of a server in a fifth embodiment of the present invention.
Description of the embodiments
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Furthermore, the terms "first," "second," and the like, may be used herein to describe various directions, acts, steps, or elements, etc., but these directions, acts, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. For example, a first speed difference may be referred to as a second speed difference, and similarly, a second speed difference may be referred to as a first speed difference, without departing from the scope of the present application. Both the first speed difference and the second speed difference are speed differences, but they are not the same speed difference. The terms "first," "second," and the like, are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
The english abbreviations and proper nouns referred to in the following examples have the following meanings:
entity: also referred to as "named entity" or "special name" refers to words in text that have a particular meaning, typically name, place name, organization name, proper noun, etc.
Entity identification: also called named entity recognition, (Named Entity Recognition, NER) is mainly to recognize proper names and meaningful number of phrases appearing in text and classify, generally including person names, place names, organization names, proper nouns, time expressions (date, time) and numerical expressions (amount, percentage, etc.). From the whole process of language analysis, named entity recognition belongs to the category of recognition of unknown words in lexical analysis. Named entity recognition is essentially a pattern recognition task, i.e., given a sentence, recognizing the boundaries of entities and the type of entities in the sentence is an important and fundamental task in natural language processing.
Relationship between entities: also called entity relationship, in the field of natural language identification, there is a semantic relationship with logical association between entities, namely, the relationship between entities. At least two entities with the relationship between the entities are called binary relationship, and more than two are multiple relationship. The relationship has a symmetrical relationship and an asymmetrical relationship, the symmetrical relationship needs to consider the sequence, the entity of the asymmetrical relationship needs to consider the sequence, and different sequences among the entities express different relationships among the entities.
BiLSTM+CRF model: bi-LSTM and CRF are two layers in the named entity recognition model for completing the sequence labeling problem in the natural language recognition process. The LSTM (Long Short-Term Memory) refers to a Long-Term Memory network, is a time Recurrent Neural Network (RNN), and is mainly used for solving the problems of gradient elimination and gradient explosion in the Long-sequence training process. LSTM based systems can learn translation language, control robots, image analysis, document summarization, speech recognition image recognition, handwriting recognition, control chat robots, predict illness, click rate and stock, synthesize music, and so forth.
Syntax trees, also called syntax trees, in computer science, abstract syntax trees (abstract syntax tree or abbreviated AST), or syntax trees (syntax tree), are tree-like representations of abstract syntax structures of source code. Each node on the tree represents a structure in the source code. The grammar is "abstract" in that the grammar herein does not represent every detail that appears in the real grammar. For these three simple languages, they simply traverse the grammar tree in different rules. The codes of the three look very different, but in practice the tree structure used is identical. Syntactic analysis is one of the key underlying technologies in natural language processing (natural language processing, NLP), the basic task of which is to determine the syntactic structure of a sentence or the dependency between words in a sentence. Syntactic analysis is divided into syntactic structure analysis and dependency analysis. Syntactic analysis, which aims to acquire the syntactic structure or the complete phrase structure of the whole sentence, is called component structure analysis or phrase structure analysis; another type of syntax analysis for the purpose of acquiring local components is called dependency analysis.
Jieba segmentation: in the natural language processing process, in order to better process sentences, the sentences are often required to be disassembled and divided into words, so that the characteristics of the sentences can be better analyzed, the process is called jieba word segmentation, and the resultant word segmentation supports three word segmentation modes: an accurate mode, which attempts to cut the sentence most accurately, is suitable for text analysis; the full mode scans all words which can form words in the sentence, so that the speed is very high, but ambiguity cannot be resolved; and the search engine mode is used for segmenting the long word again on the basis of the accurate mode, so that the recall rate is improved, and the method is suitable for word segmentation of the search engine.
The foregoing terms are used merely to describe the general meaning of the terms mentioned in the embodiments, and in the embodiments, the terms may be further limited or expanded within the general scope of the terms so that they can be adapted to the technical solutions described in the embodiments.
Examples
Fig. 1 is a flowchart of an information extraction method according to a first embodiment of the present invention, which is applicable to an information extraction process of various text. The embodiments and the following embodiments take the entity and entity relationship in the extraction decision book as examples. The method specifically comprises the following steps:
s110, extracting one or more entities in the text based on a preset entity identification rule;
in this step, taking information of the decision book as an example, one or more entities to be extracted are: one or more of time, name, industry, organization name, job name, reason, and/or amount.
After the text information is obtained, the recognition system traverses the whole text by adopting a preset entity recognition rule, and extracts the entity from the text. The preset entity rules in this step include, but are not limited to: keyword matching is carried out by using a surname library, an industry classification library, an organization name word library and a position name word library; and/or building rules using regular expressions; and/or using the bilstm+crf model. In actual operation, different entity rules can be selected according to different entity pairs, so that the accuracy of entity identification is improved.
In this step, the preset entity recognition rules refer to rules used when matching text, including but not limited to regular expressions, neural network models, and the like. Meanwhile, the preset entity rule in the step can refer to one or more matching rules, and the matching rules and the number of the matching rules can be adaptively adjusted according to the number of the extracted entities and the identification precision.
In alternative embodiments, the text may optionally also be split in sentence, segment, and/or chapter units prior to extracting the entity, or according to keyword positioning. Through the text segmentation modes, the characteristic sentences containing the entity to be identified can be obtained, and the effect of improving the identification efficiency is achieved. Illustratively, when the input text is a decision book, as shown in fig. 2, S110 has the following alternative embodiments:
s111, acquiring a text, and extracting a characteristic statement containing time from the text;
s112, extracting one or more entities in the characteristic sentences containing time based on a preset entity identification rule.
In an embodiment, in the text of the decision book, paragraphs appearing in the information sets of the name, the place name, the organization name and the like begin with time information, and for example, the following sentence patterns often appear in the decision book: "XXXX year, X month, XX day, XX villages XXX, XX, etc. are suspected to be retained by the XX department of XX land. Their family members delegate XXX to find someone to handle this, giving XXX cash X ten thousand yuan. Therefore, when the text of the judgment book is identified, firstly, sentences containing time in the text are segmented to be used as characteristic sentences, and then the characteristic sentences are identified and the entity is extracted, so that the efficiency can be improved, and the entity extraction time can be shortened.
In the field of natural language recognition, a regular expression is a logic formula for operating a character string, namely, a rule character string is formed by using a plurality of specific characters defined in advance and combinations of the specific characters, the rule character string expresses matching logic of a text to be recognized, the rule character string is matched with the text to be recognized, and words or characters conforming to the matching logic can be recognized.
S120, carrying out keyword association on the one or more entities based on a preset first relationship rule, and taking the keyword association result of the one or more entities as a relationship between the first entities.
Among the extracted entities, different entities have an association relationship, for example, an organization name and a job name may be spliced, and a name of a person to be notified, a name of a person to be related, and an amount of money have an association relationship. The purpose of extracting the entity and the entity relation is to select the entity related to the case, generate structured data such as a table and the like, and intuitively restore the case, so that the archiving or case judgment of the court is more efficient. The relationship between the entities mentioned in this step may be any association manner capable of restoring a case, and preferably, the relationship between the first entity, the relationship between the second entity and the relationship between the third entity mentioned below all refer to the industry, organization name, job name, reason and/or amount corresponding to the name in the time.
In this step, the preset first relationship rule refers to a method for matching keywords by using a regular expression. Illustratively, based on the one or more entities in the step S112, the corresponding entities are related by using the regular expression, for example, a regular structure of "a person acts as a position" may be used to extract the association relationship between the name of the person to be notified and the corresponding position name, and "a person in a position" may be extracted to the association relationship between the name of the person to be related and the corresponding position name.
S130, carrying out syntactic analysis on the characteristic statement based on a preset second relation rule, and taking a result of the syntactic analysis as a relation between second entities;
in this step, the syntactic analysis refers to structural analysis of the text, describing the complex natural language text by using a syntactic tree or other semi-structured text, and performing keyword matching on the semi-structured text by using the entity obtained in the step S112 to obtain an entity under the second relationship rule, where the entity under the second relationship rule is part or all of the entity in the step S112.
And acquiring the relation between the second entities according to the structural relation between the entities and the semi-structured text under the second relation rule. The method of the step clearly expresses statement content through a semi-structured text by means of syntactic analysis, and then obtains the relationship between the entities by combining regular expression matching, so that the matching degree between the entities and the accuracy of information extraction are improved.
S140, generating a third entity relationship according to the union of the first entity relationship and the second entity relationship;
when an input text error occurs, the relationship between entities may not be obtained by matching keywords only using regular expressions; when a grammar error of an input text occurs, relationships between entities cannot be obtained using syntactic analysis. Therefore, in order to improve the recognition rate of entity relationships, the step adopts mixed relationship extraction, and simultaneously acquires the relationship extraction between the first entities based on the regular expression and the relationship between the second entities based on the syntactic analysis, and the result is a union of the two relationships.
S150, storing the relation among the third entities as an information extraction result.
The result of the information extraction is output in a format suitable for saving, such as a data text or a table. The method can extract and store information of a plurality of texts in a period of time. By extracting the text of the judgment book, the table can be output, namely, the industry, the organization name, the job name, the reason and/or the amount corresponding to the name, and the manager can adjust and acquire the content of the entity according to the needs, so that the table can intuitively indicate the case information, and the archiving, the arrangement and the judgment of the cases are convenient.
According to the embodiment, the relation between the entities is obtained in two ways at the same time, and the relation between the entities obtained in two times is combined, so that omission of relation extraction caused by text errors or text grammar errors is avoided, and recognition efficiency and recognition accuracy are improved.
Examples
As shown in fig. 3, the present embodiment refines the step of extracting one or more entities in the text based on the preset entity recognition rule on the basis of the above embodiment, so that each entity is recognized by using the specific entity recognition rule, thereby improving the accuracy.
The method comprises the following specific steps:
s210, acquiring time, amount and/or reason by using preset regular expression, and/or
Acquiring the name based on a preset regular expression, surname library and/or BiLSTM+CRF model, and/or
Acquiring the industry using keyword matching based on a preset industry classification library, and/or
Obtaining the tissue name based on BiLSTM+CRF model, and/or
And acquiring the name of the job by using keyword matching based on a preset job name word library.
Wherein the entities include, but are not limited to, the following: one or more of time, name, industry, organization name, job name, reason, and/or amount. According to the requirements of users, motivations, fields, modes and other contents can be added as entities to be extracted.
In the judgment, the time format is standard 'X years, X months and X days', so in the step, the years, the months and the days are used as keywords, and the regular expression is used for keyword matching to obtain a time entity;
the name entity typically contains the names of the persons being advertised and the names of the persons involved. Three words of the interviewee or the prosecutor usually appear before the first appearing interviewee name in the judgement book, so that the first appearing interviewee name can be obtained by using a regular expression to match keywords; in order to prevent later omission of the interviewee or prosecution of the figure, the extracted interviewee name is stored in a temporary memory (cache is usually selected); when keywords such as a interviewee or a prosecutor do not appear in the text, the names in the temporary storage are matched using regular expressions. The name of the related person is usually replaced by a 'certain', and the right side of the name of the related person generally follows verbs such as 'send', so that the range of the name of the related person is reduced to a text segment with a 'certain' character by a regular expression, and the character before the 'certain' character is extracted by using a surname library as the surname of the related person; if the word "certain" does not appear in the text, the name of the related person is not protected, and the name entity in the text is identified through the BiLSTM+CRF model.
The industry classification library stores all industry classes that may occur in the decision book. In this embodiment, the industry is preferably subdivided into 22 major classes, each major class containing a plurality of minor classes, by using the classification standard defined by the current industry definition and industry evolution rule in "wealth china", and all the minor classes are classified into an industry classification library for storage. Based on the industry classification library, the subclasses appearing in the text can be directly used as the acquired industry entities.
The recognition rate of the existing BiLSTM+CRF model on the tissue names and the place names reaches more than 87%, so that the existing BiLSTM+CRF model marked with the tissue names can be directly used for extracting the tissue names in the text.
In this step, the job name refers to the nouns such as "courtyard," "locale," and "senior lecturer," and because the job name has a single structure and unchanged content, the embodiment constructs a job name word library, and uses keyword matching to obtain a job name entity.
S220, carrying out keyword association on the one or more entities based on a preset first relationship rule, and taking the keyword association result of the one or more entities as a relationship between the first entities.
S230, carrying out syntactic analysis on the text based on a preset second relation rule, and taking a result of the syntactic analysis as a relation between second entities;
s240, generating a third entity relationship according to the union of the first entity relationship and the second entity relationship;
s250, storing the relation among the third entities as an information extraction result.
In an alternative embodiment, after the name is obtained based on the preset surname library and the BiLSTM+CRF model, the method further includes:
comparing the names, and judging whether two names are contained and contained;
if yes, the contained name is taken as the acquired name.
In this embodiment, all information extraction uses a pipeline operation manner, i.e. the entities are first obtained and then the relationships between the entities are extracted according to the entities, so that errors in entity extraction may cause errors in the relationships between the entities. Further identification and processing of name entities can reduce error accumulation in subsequent steps.
Based on the above embodiment, the present embodiment refines the step of extracting one or more entities in the text based on the preset entity recognition rule, so that each entity is recognized by using the specific entity recognition rule, and the accuracy of entity recognition is improved.
Examples
As shown in fig. 4, based on the first embodiment, the present embodiment refines "performing syntactic analysis on the text based on the preset second relationship rule" in step s130 by adopting a manner of constructing a syntactic tree, which is specifically as follows:
s310, extracting one or more entities in a text based on a preset entity identification rule;
s320, carrying out keyword association on the one or more entities based on a preset first relationship rule, and taking the keyword association result of the one or more entities as a relationship between the first entities.
S330, carrying out syntactic analysis on the text to obtain a core verb and one or more argument, wherein the one or more argument corresponds to an entity which has a logical connection relationship with the core verb in the one or more entities.
In this embodiment, a method of constructing a syntax tree is selected to decompose a text, and in this step, firstly, a simplified structure of "predicate+argument" is generated through syntax analysis, and argument refers to a word with a name part of speech in the text. Predicates are generally verbs, and other arguments and core verbs form semantic relations as cores of original text logic, so that the structure can keep logic information of the original text.
S340, using the core verb as a root node, and using the argument as a child node to generate a syntax tree.
The syntax tree is called as a syntax dependency tree and consists of root nodes and child nodes, the argument of each child node has semantic association relation with the root node of the core verb, and meanwhile, a plurality of arguments of the child node on one branch have association relation.
For example, if a sentence text in a unit file is "public security office locale Wang Mou to issue a notice", as shown in fig. 5, a formed syntax tree uses "issue" as a core verb, and one of the child nodes is a person name "Wang Mou", and the child node of "Wang Mou" is a job name "locale" and an organization name "public security office", so that a triplet of person name+organization name+job name can be constructed, and the "locale" and "Wang Mou" have a modification relationship, and the "public security office" and the "locale" have a modification relationship, so that the job name, the organization name and the person name on the child node on the branch have a logical association, and association errors when the person name, the organization name and/or the job name occur in other child nodes are avoided.
Illustratively, a sentence of text in the decision book is "the interviewee Li Mou uses the computer system information of the hospital, which is in charge of managing the hospital, to provide the medical representatives Wu Moujia, lansome, wu Mou with the usage data of the medicines sold by their agents in the hospital. "the names of the above text occurrences are 4, wherein 3 names are associated with the title of the medical representative. At this time, as shown in fig. 6, a syntax tree is constructed by using "utilization" as a core verb, it can be intuitively seen that words such as "Wu Mou a", "lan-certain", "Wu Mou b" and "medicine" represent "are in the same branch, which indicates that the above 3 name entities have an association relationship with the job title, while another name" Li Mou "is not in the same branch and has no association relationship with the job title.
S350, using the one or more entities as keywords, performing keyword matching on the argument in the syntax tree, and using the argument corresponding to the entity as an entity acquired under a second relation rule.
And (3) extracting the parts corresponding to the one or more entities in the argument by taking the one or more entities acquired in the step (S310) as matching keywords, and taking the parts as the entities acquired under the second relation rule. It should be noted that the entity acquired under the second relationship rule is included in the entity set acquired in step S310.
S360, acquiring the association relation between the entities acquired under the second relation rule based on the syntax tree.
S370, recording the association relationship among the entities acquired under the second relationship rule of the preset grammar relationship as the relationship among the second entities when the association relationship is the preset grammar relationship.
The preset grammar relation comprises the following steps: parallel relationships, additional relationships, centering relationships, and/or master-guest relationships. And when the grammar relation appears, the fact that the corresponding entities have semantic and/or logical relevance is indicated, the corresponding entities are obtained, and the association relation between the entities is recorded as a relation between the second entities.
S380, generating a third entity relationship according to the union of the first entity relationship and the second entity relationship;
preferably, the relationship between the first entities, the relationship between the second entities and the relationship between the third entities mentioned below refer to the industry, organization name, job name, reason and/or amount corresponding to the name in the time.
S390, the relation among the third entities is stored as an information extraction result.
In the embodiment, the association relation between the entities is obtained by constructing the syntax tree and using the syntax association exhibition, so that the relation between the entities is identified more accurately.
Examples
As shown in fig. 7, the present embodiment provides an information extraction system, which can execute the information extraction method provided in the above embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.
The method specifically comprises the following steps:
an entity recognition module 410, configured to extract one or more entities in the text based on a preset entity recognition rule;
the first relationship module 420 is configured to perform keyword association on the one or more entities based on a preset first relationship rule, and take a keyword association result of the one or more entities as a relationship between the first entities.
A second relationship module 430, configured to perform a syntactic analysis on the text based on a preset second relationship rule, and use a result of the syntactic analysis as a relationship between second entities;
a third relationship module 440, configured to generate a third relationship between entities according to the union of the first relationship between entities and the second relationship between entities;
and a storage module 450, configured to store the third entity relationship as an information extraction result.
As shown in fig. 8, in an alternative embodiment, the entity identification module 410 further includes:
a first identifying unit 411 for acquiring time, amount and/or reason using a preset regular expression, and/or
A second identifying unit 412, configured to obtain the name based on a preset regular expression, surname library and/or BiLSTM+CRF model, and/or
A third identifying unit 413 for acquiring the industry using keyword matching based on a preset industry classification library, and/or
A fourth identifying unit 414, configured to obtain the tissue name based on the BiLSTM+CRF model, and/or
And a fifth identifying unit 415 configured to obtain the job name using keyword matching based on a job name word stock set in advance.
As shown in fig. 8, in another alternative embodiment, the entity identification module 410 further includes:
a name comparing unit 416, configured to compare the names after the second identifying unit 412 identifies the name entity, and determine whether two of the names are included and contained; if yes, the contained name is taken as the acquired name.
As shown in fig. 8, in another alternative embodiment, the second relationship module 430 further includes:
a syntax analysis unit 431, configured to perform syntax analysis on the text, and obtain a core verb and one or more argument, where the one or more argument corresponds to an entity having a logical connection relationship with the core verb in the one or more entities;
a syntax tree generating unit 432, configured to generate a syntax tree by using the core verb as a root node and the argument as a child node;
a second entity obtaining unit 433, configured to perform keyword matching on the argument in the syntax tree by using the one or more entities as keywords, and use the argument corresponding to the entity as an entity obtained under a second relationship rule;
an entity relationship acquiring unit 434, configured to acquire an association relationship between the entities acquired under the second relationship rule based on the syntax tree; when the association relationship is a preset grammar relationship, recording the association relationship between the entities acquired under the second relationship rule of the preset grammar relationship as a second entity relationship.
The information extraction system provided by the fourth embodiment of the invention can execute the information extraction method provided by any embodiment of the invention, and has the corresponding execution method and beneficial effects of the functional module.
Examples
Fig. 9 is a schematic structural diagram of an information extraction apparatus according to a fifth embodiment of the present invention, as shown in fig. 9, the apparatus includes a processor 501, a memory 502, an input device 503, and an output device 504; the number of processors 501 in the device may be one or more, fig. 9 taking one processor 501 as an example; the processor 501, memory 502, input means 503 and output means 504 in the device may be connected by a bus or other means, in fig. 9 by way of example.
The memory 502 is used as a computer readable storage medium for storing a software program, a computer executable program, and modules, such as modules corresponding to an information extraction method in the first embodiment of the present invention (e.g., the entity identification module 410, the first relationship module 420, the second relationship module 430, etc. in the fourth embodiment). The processor 501 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 502, i.e., implements one of the information extraction methods described above.
Examples
A sixth embodiment provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing an information extraction method comprising:
extracting one or more entities in the text based on a preset entity identification rule;
and carrying out keyword association on the one or more entities based on a preset first relationship rule, and taking the keyword association result of the one or more entities as a relationship between the first entities.
Carrying out syntactic analysis on the text based on a preset second relation rule, and taking the syntactic analysis result as a relation between second entities;
generating a third entity relationship according to the union of the first entity relationship and the second entity relationship;
and storing the relation among the third entities as an information extraction result.
Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform the related operations of the information extraction method provided in any embodiment of the present invention.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.
It should be noted that, in the embodiment of the search apparatus, each module included is only divided according to the functional logic, but not limited to the above division, so long as the corresponding function can be implemented; in addition, the specific names of the functional modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the above embodiments, but may include many other equivalent embodiments without departing from the spirit of the invention, the scope of which is determined by the scope of the appended claims.
Claims (7)
1. An information extraction method, comprising:
extracting one or more entities in the text based on a preset entity identification rule;
performing keyword association on the one or more entities based on a preset first relationship rule, and taking the keyword association result of the one or more entities as a relationship between the first entities;
carrying out syntactic analysis on the text based on a preset second relation rule, and taking the syntactic analysis result as a relation between second entities;
generating a third entity relationship according to the union of the first entity relationship and the second entity relationship;
the relation among the third entities is used as an information extraction result to be stored;
wherein the extracting one or more entities in the text based on the preset entity recognition rule comprises:
acquiring time, amount and/or reason for brining using a preset regular expression, and/or
Acquiring the name based on a preset regular expression, surname library and/or BiLSTM+CRF model, and/or
Acquiring the industry using keyword matching based on a preset industry classification library, and/or
Obtaining the tissue name based on BiLSTM+CRF model, and/or
Acquiring the name of the job by using keyword matching based on a preset job name word library;
after the name is obtained based on the preset regular expression, surname library and/or BiLSTM+CRF model, the method further comprises the following steps:
comparing the names, and judging whether two names are contained and contained;
if yes, taking the contained name as the acquired name;
the syntactic analysis of the text based on the preset second relation rule comprises the following steps:
carrying out syntactic analysis on the text to obtain a core verb and one or more argument units, wherein the one or more argument units correspond to an entity which has a logical connection relationship with the core verb in the one or more entities;
the core verb is taken as a root node, the argument is taken as a child node, and a syntax tree is generated;
performing keyword matching on the argument in the syntax tree by taking the one or more entities as keywords, and taking the argument corresponding to the entity as an entity acquired under a second relation rule;
based on the syntax tree, acquiring an association relationship between the entities acquired under the second relationship rule;
when the association relationship is a preset grammar relationship, recording the association relationship between the entities acquired under the second relationship rule of the preset grammar relationship as a second entity relationship.
2. The information extraction method according to claim 1, wherein the entity comprises:
one or more of time, name, industry, organization name, job name, brined reason, and/or amount.
3. The information extraction method according to claim 2, wherein the first entity-to-entity relationship, the second entity-to-entity relationship, and the third entity-to-entity relationship include:
and in the time, the industry, the organization name, the job name, the brined reason and/or the amount corresponding to the name.
4. The information extraction method according to claim 2, wherein the preset entity identification rule includes:
keyword matching is carried out by using a surname library, an industry classification library, an organization name word library and a position name word library; and/or
Constructing rules by using regular expressions; and/or
The BiLSTM+CRF model is used for identification.
5. An information extraction system, comprising:
the entity recognition module is used for extracting one or more entities in the text based on a preset entity recognition rule;
and the first relation module is used for carrying out keyword association on the one or more entities based on a preset first relation rule, and taking the keyword association result of the one or more entities as a relation between the first entities. The second relation module is used for carrying out syntactic analysis on the text based on a preset second relation rule, and taking a result of the syntactic analysis as a relation between second entities;
a third relationship module, configured to generate a third relationship between entities according to a union of the relationship between the first entities and the relationship between the second entities;
the storage module is used for storing the relation among the third entities as an information extraction result;
wherein, the entity identification module further comprises:
a first identification unit for acquiring time, amount and/or reason by using preset regular expression, and/or
A second recognition unit for acquiring the name based on a preset regular expression, surname library and/or BiLSTM+CRF model, and/or
A third recognition unit, configured to obtain the industry using keyword matching based on a preset industry classification library, and/or
A fourth identification unit for acquiring the tissue name based on BiLSTM+CRF model, and/or
A fifth recognition unit, configured to obtain the job name using keyword matching based on a preset job name word library;
wherein, the entity identification module further comprises:
a name comparing unit, configured to compare the names after the second identifying unit identifies a name entity, and determine whether two of the names are included and contained; if yes, taking the contained name as the acquired name;
wherein the second relationship module further comprises:
a syntactic analysis unit, configured to perform syntactic analysis on the text, and obtain a core verb and one or more argument, where the one or more argument corresponds to an entity having a logical connection relationship with the core verb in the one or more entities;
a syntax tree generating unit, configured to generate a syntax tree by using the core verb as a root node and the argument as a child node;
the second entity obtaining unit is used for carrying out keyword matching on the argument in the syntax tree by taking the one or more entities as keywords, and taking the argument corresponding to the entity as an entity obtained under a second relation rule;
an entity relation acquisition unit, which acquires the association relation between the entities acquired under the second relation rule based on the syntax tree; when the association relationship is a preset grammar relationship, recording the association relationship between the entities acquired under the second relationship rule of the preset grammar relationship as a second entity relationship.
6. A server comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor implements the information extraction method according to any one of claims 1-4 when executing the program.
7. A terminal-readable storage medium, on which a program is stored, characterized in that the program, when executed by a processor, is capable of implementing the information extraction method according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911088563.XA CN111382571B (en) | 2019-11-08 | 2019-11-08 | Information extraction method, system, server and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911088563.XA CN111382571B (en) | 2019-11-08 | 2019-11-08 | Information extraction method, system, server and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111382571A CN111382571A (en) | 2020-07-07 |
CN111382571B true CN111382571B (en) | 2023-06-06 |
Family
ID=71218524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911088563.XA Active CN111382571B (en) | 2019-11-08 | 2019-11-08 | Information extraction method, system, server and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111382571B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111897883B (en) * | 2020-07-15 | 2023-09-05 | 中国工商银行股份有限公司 | Entity model construction method, device, electronic equipment and medium |
CN112015876A (en) * | 2020-08-27 | 2020-12-01 | 北京智通云联科技有限公司 | Time analysis method and device, electronic equipment and storage medium |
CN112612907A (en) * | 2021-01-04 | 2021-04-06 | 上海明略人工智能(集团)有限公司 | Knowledge graph generation method and device, electronic equipment and computer readable medium |
CN112784574B (en) * | 2021-02-02 | 2023-09-15 | 网易(杭州)网络有限公司 | Text segmentation method and device, electronic equipment and medium |
CN114490756A (en) * | 2022-01-12 | 2022-05-13 | 中广核工程有限公司 | Generation method and device of association checking model, computer equipment and storage medium |
CN114997398B (en) * | 2022-03-09 | 2023-05-26 | 哈尔滨工业大学 | Knowledge base fusion method based on relation extraction |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763195A (en) * | 2018-05-02 | 2018-11-06 | 武汉烽火普天信息技术有限公司 | A kind of non-limiting type relation excavation method based on interdependent syntax and pattern rules |
CN109241538A (en) * | 2018-09-26 | 2019-01-18 | 上海德拓信息技术股份有限公司 | Based on the interdependent Chinese entity relation extraction method of keyword and verb |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7689411B2 (en) * | 2005-07-01 | 2010-03-30 | Xerox Corporation | Concept matching |
-
2019
- 2019-11-08 CN CN201911088563.XA patent/CN111382571B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763195A (en) * | 2018-05-02 | 2018-11-06 | 武汉烽火普天信息技术有限公司 | A kind of non-limiting type relation excavation method based on interdependent syntax and pattern rules |
CN109241538A (en) * | 2018-09-26 | 2019-01-18 | 上海德拓信息技术股份有限公司 | Based on the interdependent Chinese entity relation extraction method of keyword and verb |
Also Published As
Publication number | Publication date |
---|---|
CN111382571A (en) | 2020-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111382571B (en) | Information extraction method, system, server and storage medium | |
Höffner et al. | Survey on challenges of question answering in the semantic web | |
US9836453B2 (en) | Document-specific gazetteers for named entity recognition | |
Danenas et al. | Natural language processing-enhanced extraction of SBVR business vocabularies and business rules from UML use case diagrams | |
US9292490B2 (en) | Unsupervised learning of deep patterns for semantic parsing | |
US20140250045A1 (en) | Authoring system for bayesian networks automatically extracted from text | |
Mahajani et al. | A comprehensive survey on extractive and abstractive techniques for text summarization | |
Mills et al. | Graph-based methods for natural language processing and understanding—A survey and analysis | |
Brown et al. | VerbNet class assignment as a WSD task | |
Nguyen et al. | From POS tagging to dependency parsing for biomedical event extraction | |
Li et al. | A relation extraction method of Chinese named entities based on location and semantic features | |
Rodrigues et al. | Advanced applications of natural language processing for performing information extraction | |
Baruni et al. | Keyphrase extraction from document using RAKE and TextRank algorithms | |
US20220366135A1 (en) | Extended open information extraction system | |
Paulheim | Machine learning with and for semantic web knowledge graphs | |
Singh et al. | Words are not equal: Graded weighting model for building composite document vectors | |
Malik et al. | NLP techniques, tools, and algorithms for data science | |
RU2563148C2 (en) | System and method for semantic search | |
Zhu et al. | Causality extraction model based on two-stage GCN | |
Bellandi et al. | An entity-centric approach to manage court judgments based on Natural Language Processing | |
Celikyilmaz et al. | An empirical investigation of word class-based features for natural language understanding | |
Li et al. | Feature-specific named entity recognition in software development social content | |
Žitnik et al. | SkipCor: Skip-mention coreference resolution using linear-chain conditional random fields | |
Sawant et al. | Deriving requirements model from textual use cases | |
Mongiovì et al. | Semantic reconciliation of knowledge extracted from text through a novel machine reader |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |