CN112507108A - Knowledge extraction method and system based on json rule file and rule analysis engine - Google Patents

Knowledge extraction method and system based on json rule file and rule analysis engine Download PDF

Info

Publication number
CN112507108A
CN112507108A CN202011341041.9A CN202011341041A CN112507108A CN 112507108 A CN112507108 A CN 112507108A CN 202011341041 A CN202011341041 A CN 202011341041A CN 112507108 A CN112507108 A CN 112507108A
Authority
CN
China
Prior art keywords
entity
rule
json
original text
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011341041.9A
Other languages
Chinese (zh)
Inventor
刘伟利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202011341041.9A priority Critical patent/CN112507108A/en
Publication of CN112507108A publication Critical patent/CN112507108A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a knowledge extraction method, a knowledge extraction system and a rule analysis engine based on a json rule file, wherein the method comprises the following steps: a json rule file compiling step, namely compiling the json rule file according to the entity rule and the relation rule; a named entity extraction step, namely traversing the original text and the text processed by the original text according to the json rule file, and outputting a named entity list; a relation extraction step, namely receiving the named entity list, traversing the named entity list and the original text according to the json rule file, and outputting an entity relation; and a knowledge integration step, namely obtaining structural data comprising the named entities and the entity relations according to the named entity list and the entity relations. The rule file in the json format is analyzed through the rule analysis engine, the problem that the knowledge extraction cannot be rapidly and accurately carried out at present is solved, and the named entities, entity relations and entity attributes in the unstructured text data are rapidly extracted.

Description

Knowledge extraction method and system based on json rule file and rule analysis engine
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a knowledge extraction method, system and rule parsing engine based on json rule files.
Background
In the era of big data explosion, a large amount of valuable fragment knowledge is contained in non-structural data, and when the artificial intelligence technology is applied to the industry, the fragment knowledge needs to be extracted quickly and accurately to form a knowledge graph and data decision analysis is carried out according to the knowledge graph. Wherein the knowledge extraction includes Named Entity Recognition (NER), entity Relationship Extraction (RE), and attribute extraction.
However, in the related art, due to the irregularity of unstructured data and the difference of data among industries, the difficulty of knowledge extraction is high, and especially in an application scenario lacking a large amount of manually labeled data, how to quickly and accurately extract knowledge is a basic stone for performing downstream tasks.
At present, an effective solution is not provided aiming at the technical problem that the knowledge extraction cannot be rapidly and accurately carried out in the related technology.
Disclosure of Invention
The embodiment of the application provides a knowledge extraction method and system based on json files, and in an application scene lacking a large amount of data, rule files in the json format are analyzed through a rule analysis engine, so that the knowledge extraction of unstructured text data is completed. At least solving the problem that the knowledge extraction can not be carried out quickly and accurately in the related technology.
In a first aspect, an embodiment of the present application provides a knowledge extraction method based on json rule files, including the following steps:
a json rule file compiling step, namely compiling the json rule file according to the entity rule and the relation rule;
a named entity extraction step, namely traversing the original text and the text processed by the original text according to the json rule file, and outputting a named entity list;
a relation extraction step, namely traversing the named entity list and the original text according to the json rule file and outputting an entity relation;
and a knowledge integration step, namely obtaining structural data comprising the named entities and the entity relations according to the named entity list and the entity relations.
In some of these embodiments, the json rule file includes the entity rule and the relationship rule, wherein:
the entity rule comprises an entity name, an entity Chinese name, a word segmentation label and a regular expression;
the relationship rule comprises a relationship name, a relationship Chinese name, a subject name, an object name and a rule matching sequence.
In some embodiments, the named entity extracting step specifically includes:
a preliminary processing step, performing word segmentation by using a word segmentation device according to the original text, outputting word segmentation and the word segmentation label, and performing part-of-speech tagging on the word segmentation;
and a traversal step, namely detecting the entities in the original text according to the word segmentation labels and the regular expression, and outputting the named entity list.
In some embodiments, the traversing step specifically includes:
when the word segmentation label is empty, detecting an entity in the original text according to the regular expression, and outputting an entity result;
and when the word segmentation label is not empty, judging the regular expression, when the regular expression is empty, corresponding to the word segmentation as the entity, and when the regular expression is not empty, detecting the regular expression in a later text of the word segmentation and outputting an entity result.
In some embodiments, the elements of the rule matching sequence include, but are not limited to, a subject element, an object element, and a keyword string, and the relationship extracting step specifically includes the following steps:
a subject-object matching step, wherein a subject and an object in the named entity list are detected according to the subject element and the object element in the rule matching sequence, and the matching of the subject and the object is completed;
and an entity relationship matching step, namely matching the original text with the elements and outputting the entity relationship.
In some embodiments, the entity relationship matching step specifically includes the following steps:
selecting and matching the original text according to the elements, and acquiring the corresponding termination positions of the subject elements, the object elements or the keyword character strings in the original text;
receiving and judging a matching result according to the termination position;
and when the matching is successful, traversing the whole rule matching sequence step by step according to the subsequent element recursion.
In a second aspect, an embodiment of the present application provides a knowledge extraction system based on a json rule file, including:
the json rule file compiling module is used for compiling a json rule file according to the entity rule and the relation rule;
the named entity extraction module is used for traversing the original text and the text processed by the original text according to the json rule file and outputting a named entity list;
the relation extraction module is used for traversing the named entity list and the original text according to the json rule file and outputting an entity relation;
and the knowledge integration module is used for obtaining structural data comprising the named entities and the entity relations according to the named entity list and the entity relations.
In some of these embodiments, the entity rules include entity names, entity Chinese names, participle tags, and regular expressions; the relationship rule comprises a relationship name, a relationship Chinese name, a subject name, an object name and a rule matching sequence, and the named entity analyzing module comprises:
the initial processing unit is used for segmenting words by using a word segmentation device according to the original text, outputting the segmented words and the segmented word labels, and labeling the part of speech of the segmented words;
and the traversal unit detects the entities in the original text according to the word segmentation labels and the regular expression and outputs the named entity list.
In some embodiments, the traversal unit determines that the word segmentation label:
when the word segmentation label is empty, detecting an entity in the original text according to the regular expression, and outputting an entity result;
and when the word segmentation label is not empty, judging the regular expression, when the regular expression is empty, corresponding to the word segmentation as the entity, and when the regular expression is not empty, detecting the regular expression in a later text of the word segmentation and outputting an entity result.
In some embodiments, the elements of the rule matching sequence include, but are not limited to, a subject element, an object element, and a keyword string, and the relationship extraction module specifically includes the following steps:
the subject-object matching unit is used for detecting the subject and the object in the named entity list according to the subject element and the object element in the rule matching sequence to complete the matching of the subject and the object;
and the entity relationship matching unit is used for matching the original text with the elements and outputting the entity relationship.
In a third aspect, an embodiment of the present application provides a rule parsing engine, configured to implement the knowledge extraction method based on a json rule file in the first aspect, where the method includes:
the word segmentation annotator is used for segmenting words according to the original text by utilizing the segmentation words, outputting the segmentation words and the segmentation word labels, and performing part-of-speech annotation on the segmentation words;
the named entity resolver receives the word segmentation, the word segmentation tag and the part of speech tagging, and outputs a named entity list based on a json rule file with entity rules, which is stored in the named entity resolver, in combination with the original text;
and the relationship extraction rule parser receives the named entity list, and outputs entity relationships based on a json rule file with relationship rules, which is stored in the relationship extraction rule parser, in combination with the original text.
Compared with the related art, the knowledge extraction method and system based on the json file, provided by the embodiment of the application, the rule file in the json format is analyzed through the rule analysis engine, the knowledge extraction of the unstructured text data is completed, the problem that the knowledge extraction cannot be performed quickly and accurately at present is solved, and the named entities, entity relations and entity attributes in the unstructured text data are extracted quickly.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow diagram of a knowledge extraction method based on json rule documents according to an embodiment of the application;
FIG. 2 is a flowchart of named entity extraction steps according to an embodiment of the present application;
FIG. 3 is a flow chart of a relationship extraction step according to an embodiment of the present application;
FIG. 4 is a block diagram of a knowledge extraction system based on json rule documents according to an embodiment of the present application
FIG. 5 is a block diagram of a rule parsing engine according to an embodiment of the present application;
fig. 6 is a flow chart of practical application according to an embodiment of the present application.
Description of the drawings:
1. a json rule file compiling module; 2. A named entity extraction module;
3. a relationship extraction module; 4. A knowledge integration module;
21. a primary processing unit; 22. A traversing unit;
31. a subject-object matching unit; 32. An entity relationship matching unit;
5. a word segmentation annotator; 6. A named entity resolver;
7. and a relation extraction rule parser.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
JSON files are files used to store simple data structures and objects that can be exchanged in web applications.
JSON is known as JavaScript Object Notation, and is a method for storing information which is organized and easy to access. It provides us with a readable collection of data that we can access in a reasonable way. JSON files can store simple data structures and objects. JSON files are supported in many different programming APIs. JSON is now used for data exchange in many Web applications and they do not actually save ". JSON" files on hard drives, data exchange between internet connected computers is possible. Some applications allow the user to save it in a ". JSON" file.
The rule engine is developed by an inference engine, is a component embedded in an application program, and realizes the separation of business decisions from application program codes and the writing of the business decisions by using a predefined semantic module. And receiving data input, interpreting business rules, and making business decisions according to the business rules.
The VisualRules rules engine obtains rsc files compiled corresponding to the rule packages according to the names of the rule packages. Rsc is then loaded into memory, generating a rule package execution context. And simultaneously, the rule engine transfers the transferred parameters to the rule package execution context and then starts to execute the rule package. And after the execution is finished, returning the data in the rule package execution context to the application program calling the rule package. The whole execution principle is very simple, so that the stability and the optimal performance of the regular operation platform are ensured to the maximum extent.
Regular expressions, also known as regular expressions. (English: Regular Expression, often abbreviated in code as regex, regexp or RE), a concept of computer science. Regular expressions are typically used to retrieve, replace, text that conforms to a certain pattern (rule).
Many programming languages support string operations using regular expressions. For example, a powerful regular expression engine is built into Perl. The concept of regular expressions was originally popularized by tool software in Unix (e.g., sed and grep). Regular expressions are often abbreviated as "regex", with regex p, regex in the singular and regexps, regexes, regexen in the plural.
Given a regular expression and another string, the following objectives can be achieved:
1. whether a given string conforms to the filtering logic of a regular expression (referred to as "matching"):
2. the specific part that we want can be obtained from the character string by regular expression.
The embodiment provides a knowledge extraction method based on json rule files. Fig. 1 is a flowchart of a knowledge extraction method based on json rule files according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:
a json rule file compiling step S1, compiling json rule files according to the entity rules and the relationship rules;
a named entity extraction step S2, traversing the original text and the text processed by the original text according to the json rule file, and outputting a named entity list;
a relation extraction step S3, receiving the named entity list, traversing the named entity list and the original text according to the json rule file, and outputting an entity relation;
and a knowledge integration step S4, obtaining structural data including the named entities and the entity relations according to the named entity list and the entity relations.
In practical applications, the structured data includes named entities, relationships, and attributes, where: the named entity comprises entity content, an entity category and a starting and ending position; the relationship comprises a relationship category, subject content, a subject category and a subject starting and ending position; the attributes comprise an attribute category, entity attribute content, an entity attribute category and an entity attribute starting and ending position.
Through the steps, the json rule file is compiled according to the entity rule and the relation rule, so that matching can be conveniently carried out on the json rule file and the original text according to the entity rule and the relation rule, the named entity and the entity relation can be obtained, structural data about the named entity and the entity relation can be further output, and extraction of knowledge can be completed.
In some of these embodiments, the json rule file includes entity rules and relationship rules, wherein:
the entity rule comprises an entity name, an entity Chinese name, a word segmentation label and a regular expression;
the relationship rule includes a relationship name, a relationship Chinese name, a subject name, an object name, and a rule matching sequence.
In practical application, the entity rule writing for the json rule file includes label, label _ cn, pos and reg, where: label denotes the entity name: for example, label _ cn of Name, Sex, etc. represents the Name of the entity Chinese: such as name, gender, etc.; pos represents a label that the entity plays in the participle; reg represents a regular expression used for entity identification.
It should be noted that pos and reg can be used simultaneously, and preferably matching pos, and comma separation is only needed when multiple pos are needed to identify entities, such as 'nr, PER'.
Writing a relation rule for the json rule file, wherein the relation rule comprises label, label _ cn, sub, obj and match, and the relation rule comprises the following steps: label represents the name of a relationship, such as: person _ sex _ relationship, etc.; label _ cn represents the relational Chinese name: such as human gender, etc.; sub denotes label _ bs of the subject, as: a Name; obj denotes label _ bs of the guest as: sex; match represents a rule matching sequence, such as: [0, -1,20, -2,0].
It should be noted that, the above match writing specification:
(1) each match is written in the form of list;
(2) list [0], list [ -1] must be 0;
(3) the subject is identified by-1 and the object is identified by-2;
(4) the keywords are identified in the form of character strings;
(5) numbers other than 0, -1, -2, str may be adjusted according to the specific data, representing the maximum separation distance (in words) of the fields around the number;
(6) each match may represent a language schema.
Fig. 2 is a flowchart of a named entity extracting step according to an embodiment of the present application, and as shown in fig. 3, in some embodiments, the named entity extracting step S2 specifically includes:
a primary processing step S21, performing word segmentation by using a word segmentation device according to the original text, outputting word segmentation and word segmentation labels, and performing part-of-speech tagging on the words;
step S22 is traversed, entities in the original text are detected according to the word segmentation labels and the regular expressions, and a named entity list is output.
In practical application, the original text is subjected to preliminary word segmentation and part-of-speech tagging through a word segmentation device, and the processing result and the original text are input into a named entity parser together, wherein an entity rule is stored in the named entity parser.
Fig. 3 is a flowchart of a relationship extraction step according to an embodiment of the present application, and as shown in fig. 3, in some embodiments, the traversing step S22 specifically includes:
when the word segmentation label is empty, detecting an entity in the original text according to the regular expression, and outputting an entity result;
and when the word segmentation label is not empty, judging the regular expression, when the regular expression is empty, corresponding to the word segmentation as an entity, when the regular expression is not empty, detecting the regular expression in a text after the word segmentation, and outputting an entity result.
In actual practice, in addition to outputting the named entity, the starting position of the named entity in the original text is returned.
In some embodiments, the elements of the rule matching sequence include, but are not limited to, a subject element, an object element, and a keyword string, and the relationship extracting step S3 specifically includes the following steps:
a subject-object matching step S31, detecting a subject and an object in the named entity list according to the subject element and the object element in the rule matching sequence, and completing matching of the subject and the object;
and an entity relationship matching step S32, matching the original text with the elements, and outputting the entity relationship.
It should be noted that the keyword string is a string that is summarized from the original text and stored in the match list, and the keyword string may be different according to different practical applications, and other elements in the rule matching sequence may also be changed according to practical application scenarios.
In some embodiments, the entity relationship matching step specifically includes the following steps:
selecting and matching an original text according to the elements, and acquiring the corresponding termination positions of the subject elements, the object elements or the keyword character strings in the original text;
receiving and judging a matching result according to the termination position;
and when the matching is successful, the whole rule matching sequence is traversed step by step according to the subsequent element recursion.
In practical applications, the original text and the parsed and identified named entity list are used as the input of a relationship parser, in which relationship rules are stored.
Firstly, detecting a subject (-1 mark) and an object (-2 mark) in a match list in a relation rule;
detecting whether the host object exists in the named entity list or not, if any host object does not exist in the named entity list, failing to match the rules, and returning false;
if the host and the object are the same, the rule matching fails, and false is returned;
the elements in the match list in the relationship rule are matched in the original text.
Marking the matching progress of the original text by text _ start, wherein the initial value is 0, and the original text of the keyword character string starts from the text _ start position;
marking the position of the element to be judged in the match list by using a key, setting the initial value to be 2, and sequentially adding 2 in the recursion process;
matching from the key-th element of the match list (default 1 st and last 1 is 0), and recording the end of the element entity if the element is equal to-1 (subject) or-2 (object); if the element is a character string, matching the element in the original text by taking the character string as a regular expression, and if the matching is successful, recording the end position end of the element in the original text sequence; otherwise, the rule matching fails and false is returned;
comparing end smaller than text _ start, or end larger than text _ start plus previous element value (representing entity or key word space) in match, then rule matching fails, returning false;
key plus 2, text start equal to end, recursively traverses the entire sequence.
If false is not returned in the steps and the value of the next element in the match list is 0, the rules in the match list are completely matched, true is returned, and the matching is successful.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The embodiment also provides a knowledge extraction system based on a json rule file, which is used for implementing the foregoing embodiments and preferred embodiments, and the description of the system is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
Fig. 4 is a block diagram of a knowledge extraction system based on json rule file according to an embodiment of the present application, and as shown in fig. 4, the system includes:
a json rule file compiling module 1 for compiling json rule files according to the entity rules and the relation rules;
the named entity extraction module 2 is used for traversing the original text and the text processed by the original text according to the json rule file and outputting a named entity list;
the relation extraction module 3 receives the named entity list, traverses the named entity list and the original text according to the json rule file and outputs the entity relation;
and the knowledge integration module 4 is used for obtaining structural data comprising the named entities and the entity relations according to the named entity list and the entity relations.
In some embodiments, the entity rules include an entity name, an entity Chinese name, a participle tag, and a regular expression; the relationship rule comprises a relationship name, a relationship Chinese name, a subject name, an object name and a rule matching sequence, and the named entity analysis module 2 comprises:
the initial processing unit 21 is used for performing word segmentation by using a word segmentation device according to the original text, outputting word segmentation and word segmentation labels, and performing part-of-speech tagging on the words;
and the traversal unit 22 detects the entities in the original text according to the word segmentation labels and the regular expressions, and outputs a named entity list.
In some embodiments, traversal unit 22 determines that the word segmentation label:
when the word segmentation label is empty, detecting an entity in the original text according to the regular expression, and outputting an entity result;
and when the word segmentation label is not empty, judging the regular expression, when the regular expression is empty, corresponding to the word segmentation as an entity, when the regular expression is not empty, detecting the regular expression in a text after the word segmentation, and outputting an entity result.
In some embodiments, the elements of the rule matching sequence include, but are not limited to, a subject element, an object element, and a keyword string, and the relationship extraction module 3 specifically includes the following steps:
the subject-object matching unit 31 detects the subject and the object in the named entity list according to the subject element and the object element in the rule matching sequence, and completes the matching of the subject and the object;
and an entity relationship matching unit 32 for matching the original text with the elements and outputting the entity relationship.
In practical application, the entity relationship matching module selects and matches an original text according to elements, and obtains corresponding termination positions of subject elements, object elements or keyword character strings in the original text; receiving and judging a matching result according to the termination position; and when the matching is successful, the whole rule matching sequence is traversed step by step according to the subsequent element recursion.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
The present embodiment provides a rule parsing engine, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the rule parsing engine is omitted for brevity. Fig. 5 is a block diagram of a rule parsing engine according to an embodiment of the present application, and as shown in fig. 5, the rule parsing engine includes:
the word segmentation annotator 5 is used for segmenting words according to the original text by utilizing the segmentation words, outputting the segmentation words and the segmentation word labels, and performing part-of-speech annotation on the segmentation words;
the named entity resolver 6 is used for receiving word segmentation, word segmentation tags and part-of-speech tagging, combining an original text and outputting a named entity list based on a json rule file with entity rules, which is stored in the named entity resolver;
the relation extraction rule parser 7 receives the named entity list, combines the original text, and outputs the entity relation based on the json rule file with the relation rule stored in the relation extraction rule parser
Fig. 6 is a flow chart of practical application proposed according to an embodiment of the present application, and as shown in fig. 6, the flow chart of practical application includes the following steps:
assuming that there is an unstructured text sequence text 0,
"clockwork, Han nationality, age 20, not married, she performed well, was studied during university, and entered Beijing university for reading. "
A json rule file is first written.
Figure RE-GDA0002910985280000111
And inputting the text sequence text _0 into a rule analysis engine.
Firstly, a named entity resolver carries out named entity recognition rule resolution:
(1) performing preliminary word segmentation and part-of-speech tagging on the text sequence text _0 through a word segmentation device, wherein the result text _1 is as follows:
Figure RE-GDA0002910985280000112
(2) this result text _1 is input into the named entity parser together with the original text _0:
1. loading an entry rule in the json file, and traversing the rule and the text _1/text _0 in sequence;
2. rule name entity clause 1, pos is not null, reg is null, traverse field in text _1, return field with pos tag, e.g., (clockwork ', ' nr ',0,3), which is name entity;
3. and 2, if the rule age entity is in the 2 nd rule, pos is null, reg is a regular expression, whether matching can be performed in the original text _0 is directly detected through the regular expression in reg, and fields with successful matching, such as ('20','m',9,11), ('age','m', 11,12), are returned.
(3) Inputting the named entity recognition result and the original text _0 into the relation extraction rule analysis:
1. loading a relation rule in the json file, and traversing the relation rule and the text _0 in sequence;
2. detecting a subject Name (-1 mark) and an object Age (-2 mark) in a match list in the relation rule;
3. detecting whether the host object exists in the named entity list or not, if so, continuing the following steps, otherwise, returning to false;
4. detecting whether the objects are the same, if so, continuing the following steps, otherwise, returning to false;
5. elements in the match list [0, -1,20, "age | this year", 20, -2,0] in the relationship rule are matched in the original text _ 0.
Text _ start is 0, and the original text of keyword matching text _0 is' Zhongming, Han nationality, age 20, not married, excellent in her performance, and is researched during university and entered Beijing university for reading. ";
2, the 2 nd element in [0, -1,20, "age | this year", 20, -2,0] is-1, i.e. the subject (Name);
matching a Name entity in the original text _0, wherein the matching of 'clockwork', 'nr',0 and 3) is successful, recording the end of the element entity as 3, and if the matching fails, returning false;
comparing whether the end (3) is smaller than the text _ start (0) or not, or whether the end (3) is larger than 0 (the text _ start (0) is added with the value (0) of the previous element in match) or not, if not, continuing the following steps, and if yes, failing to match, and returning false;
fifthly, the above steps do not return false and the value of the next element in the match list is not 0, and the following steps are continued;
key +2 ═ 4, text _ start ═ 3, original text to be matched text _0: ", Han nationality, age 20, not married, her excellent performance, research during university, enter Beijing university for reading. "continue the above steps;
the 4 th element in the key 4, [0, -1,20, [ age | this year ",20, -2,0] is" age | this year ", namely the keyword;
matching Age entities in the original text _0, ('Age', 'n',7,9), and if matching is successful, recording end of the element entity as 9;
comparing whether end (9) is smaller than text _ start (3) or whether end (9) is larger than 23 (text _ start (3) plus the value (20) of the previous element in match), if not, continuing the following steps;
key +2 ═ 6, text _ start ═ 9, original text to match text _0: "20 years old, not married, she performed well, research during the university, entered beijing university for reading. "continue the above steps;
Figure RE-GDA0002910985280000131
key 6, [0, -1,20, "age | this year", 20, -2,0]The 6 th element is-2, namely, object (Age);
Figure RE-GDA0002910985280000132
matching Age entities, ('20','m',9,11), ('year of life','m', 11,12) in the original text _0, and if matching is successful, recording end of the element entity as 12;
Figure RE-GDA0002910985280000133
comparing whether end (12) is smaller than text _ start (9) or whether end (12) is larger than 29 (text _ start (9) plus the value (20) of the previous element in match), if not, continuing the following steps;
Figure RE-GDA0002910985280000134
none of the above steps returns a false and the value of the next element in match is 0, then the match rule completely matches, returning the relationship { 'people age' [ ('clomamine', 'nr',0,3, 'Name'), ('20 year', 9,12, 'A')ge'))]}。
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A knowledge extraction method based on json rule files is characterized by comprising the following steps:
a json rule file compiling step, namely compiling the json rule file according to the entity rule and the relation rule;
a named entity extraction step, namely traversing the original text and the text processed by the original text according to the json rule file, and outputting a named entity list;
a relation extraction step, namely traversing the named entity list and the original text according to the json rule file and outputting an entity relation;
and a knowledge integration step, namely obtaining structural data comprising the named entities and the entity relations according to the named entity list and the entity relations.
2. The json rule file-based knowledge extraction method of claim 1, wherein the json rule file comprises the entity rule and the relationship rule, wherein:
the entity rule comprises an entity name, an entity Chinese name, a word segmentation label and a regular expression;
the relationship rule comprises a relationship name, a relationship Chinese name, a subject name, an object name and a rule matching sequence.
3. The json rule file-based knowledge extraction method of claim 2, wherein the named entity extraction step specifically comprises:
a preliminary processing step, performing word segmentation by using a word segmentation device according to the original text, outputting word segmentation and the word segmentation label, and performing part-of-speech tagging on the word segmentation;
and a traversal step, namely detecting the entities in the original text according to the word segmentation labels and the regular expression, and outputting the named entity list.
4. The json rule file-based knowledge extraction method of claim 3, wherein the traversing step specifically comprises:
when the word segmentation label is empty, detecting an entity in the original text according to the regular expression, and outputting an entity result;
and when the word segmentation label is not empty, judging the regular expression, when the regular expression is empty, corresponding to the word segmentation as the entity, and when the regular expression is not empty, detecting the regular expression in a later text of the word segmentation and outputting an entity result.
5. The json rule file-based knowledge extraction method of claim 2, wherein the elements of the rule matching sequence include, but are not limited to, a subject element, an object element and a keyword string, and the relationship extraction step specifically includes the steps of:
a subject-object matching step, wherein a subject and an object in the named entity list are detected according to the subject element and the object element in the rule matching sequence, and the matching of the subject and the object is completed;
and an entity relationship matching step, namely matching the original text with the elements and outputting the entity relationship.
6. The json rule file-based knowledge extraction method of claim 5, wherein the entity relationship matching step specifically comprises the steps of:
selecting and matching the original text according to the elements, and acquiring the corresponding termination positions of the subject elements, the object elements or the keyword character strings in the original text;
receiving and judging a matching result according to the termination position;
and when the matching is successful, traversing the whole rule matching sequence step by step according to the subsequent element recursion.
7. A knowledge extraction system based on json rule documents, comprising:
the json rule file compiling module is used for compiling a json rule file according to the entity rule and the relation rule;
the named entity extraction module is used for traversing the original text and the text processed by the original text according to the json rule file and outputting a named entity list;
the relation extraction module is used for traversing the named entity list and the original text according to the json rule file and outputting an entity relation;
and the knowledge integration module is used for obtaining structural data comprising the named entities and the entity relations according to the named entity list and the entity relations.
8. The json rule file-based knowledge extraction system of claim 7, wherein the entity rules include entity names, entity chinese names, word segmentation labels, and regular expressions; the relationship rule comprises a relationship name, a relationship Chinese name, a subject name, an object name and a rule matching sequence, and the named entity analyzing module comprises:
the initial processing unit is used for segmenting words by using a word segmentation device according to the original text, outputting the segmented words and the segmented word labels, and labeling the part of speech of the segmented words;
and the traversal unit detects the entities in the original text according to the word segmentation labels and the regular expression and outputs the named entity list.
9. The json rule file-based knowledge extraction system of claim 7, wherein the elements of the rule matching sequence include, but are not limited to, subject elements, object elements and keyword strings, and the relationship extraction module specifically comprises the steps of:
the subject-object matching unit is used for detecting the subject and the object in the named entity list according to the subject element and the object element in the rule matching sequence to complete the matching of the subject and the object;
and the entity relationship matching unit is used for matching the original text with the elements and outputting the entity relationship.
10. A rule parsing engine for implementing the json rule-based document knowledge extraction method of any one of claims 1-6, comprising:
the word segmentation annotator is used for segmenting words according to the original text by utilizing the segmentation words, outputting the segmentation words and the segmentation word labels, and performing part-of-speech annotation on the segmentation words;
the named entity resolver receives the word segmentation, the word segmentation tag and the part of speech tagging, and outputs a named entity list based on a json rule file with entity rules, which is stored in the named entity resolver, in combination with the original text;
and the relationship extraction rule parser receives the named entity list, and outputs entity relationships based on a json rule file with relationship rules, which is stored in the relationship extraction rule parser, in combination with the original text.
CN202011341041.9A 2020-11-25 2020-11-25 Knowledge extraction method and system based on json rule file and rule analysis engine Pending CN112507108A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011341041.9A CN112507108A (en) 2020-11-25 2020-11-25 Knowledge extraction method and system based on json rule file and rule analysis engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011341041.9A CN112507108A (en) 2020-11-25 2020-11-25 Knowledge extraction method and system based on json rule file and rule analysis engine

Publications (1)

Publication Number Publication Date
CN112507108A true CN112507108A (en) 2021-03-16

Family

ID=74959861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011341041.9A Pending CN112507108A (en) 2020-11-25 2020-11-25 Knowledge extraction method and system based on json rule file and rule analysis engine

Country Status (1)

Country Link
CN (1) CN112507108A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758749A (en) * 2022-03-23 2022-07-15 清华大学 Nutritional diet management map creation method and device based on gestation period

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060047500A1 (en) * 2004-08-31 2006-03-02 Microsoft Corporation Named entity recognition using compiler methods
CN110046351A (en) * 2019-04-19 2019-07-23 福州大学 Text Relation extraction method under regular drive based on feature
CN111104524A (en) * 2019-12-25 2020-05-05 航天云网科技发展有限责任公司 Method for identifying television end user set
CN111401058A (en) * 2020-03-12 2020-07-10 广州大学 Attribute value extraction method and device based on named entity recognition tool

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060047500A1 (en) * 2004-08-31 2006-03-02 Microsoft Corporation Named entity recognition using compiler methods
CN110046351A (en) * 2019-04-19 2019-07-23 福州大学 Text Relation extraction method under regular drive based on feature
CN111104524A (en) * 2019-12-25 2020-05-05 航天云网科技发展有限责任公司 Method for identifying television end user set
CN111401058A (en) * 2020-03-12 2020-07-10 广州大学 Attribute value extraction method and device based on named entity recognition tool

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758749A (en) * 2022-03-23 2022-07-15 清华大学 Nutritional diet management map creation method and device based on gestation period
CN114758749B (en) * 2022-03-23 2023-08-25 清华大学 Nutritional diet management map creation method and device based on gestation period

Similar Documents

Publication Publication Date Title
US10169354B2 (en) Indexing and search query processing
US7958444B2 (en) Visualizing document annotations in the context of the source document
US8504553B2 (en) Unstructured and semistructured document processing and searching
US8005819B2 (en) Indexing and searching product identifiers
Khusro et al. On methods and tools of table detection, extraction and annotation in PDF documents
JP6849741B2 (en) How and systems to perform model-driven domain-specific searches
WO2005124599A2 (en) Content search in complex language, such as japanese
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
US20130159346A1 (en) Combinatorial document matching
US12013903B2 (en) System and method for search discovery
Beheshti et al. Big data and cross-document coreference resolution: Current state and future opportunities
CN112860879A (en) Code recommendation method based on joint embedding model
JP2010262577A (en) System, method and program for creation of extraction rule
Neysiani et al. Automatic interconnected lexical typo correction in bug reports of software triage systems
Guo et al. Reference metadata extraction from scientific papers
CN112507108A (en) Knowledge extraction method and system based on json rule file and rule analysis engine
CN109902162B (en) Text similarity identification method based on digital fingerprints, storage medium and device
KR101983477B1 (en) Method and System for zero subject resolution in Korean using a paragraph-based pivotal entity identification
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
Chou et al. On the Construction of Web NER Model Training Tool based on Distant Supervision
Zhong et al. TOMN: constituent-based tagging scheme
Tarawneh et al. a hybrid approach for indexing and searching the holy Quran
JP2001101184A (en) Method and device for generating structurized document and storage medium with structurized document generation program stored therein
JPWO2020157887A1 (en) Sentence structure vectorization device, sentence structure vectorization method, and sentence structure vectorization program
US20240046039A1 (en) Method for News Mapping and Apparatus for Performing the Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination