CN117273137A - Knowledge graph construction method and device based on dependency syntax rules - Google Patents

Knowledge graph construction method and device based on dependency syntax rules Download PDF

Info

Publication number
CN117273137A
CN117273137A CN202311331571.9A CN202311331571A CN117273137A CN 117273137 A CN117273137 A CN 117273137A CN 202311331571 A CN202311331571 A CN 202311331571A CN 117273137 A CN117273137 A CN 117273137A
Authority
CN
China
Prior art keywords
entity
word
candidate
dependency syntax
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311331571.9A
Other languages
Chinese (zh)
Inventor
王涛
林木
王维平
李小波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202311331571.9A priority Critical patent/CN117273137A/en
Publication of CN117273137A publication Critical patent/CN117273137A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a knowledge graph construction method and a knowledge graph construction device based on a dependency syntax rule, firstly, word segmentation processing is carried out on an original text to obtain word elements, candidate center words are screened out according to the part of speech of the word elements, and meanwhile, the word elements are used as nodes to construct and obtain a dependency syntax structure; then, judging whether the candidate center word needs to be expanded and how to expand by matching the expansion rule set and the dependency syntax structure to obtain a candidate entity; then, considering that the extended entities may have cross, fusion and recombination are needed to obtain an entity list; and finally, reconstructing the dependency syntax structure according to the entity list to obtain the upper and lower relationship among the entities, generating triples according to the upper and lower relationship, and constructing the intelligent question-answering system. The method is suitable for extracting the professional knowledge in the specific field to construct the intelligent question-answering system on the premise of not having a pre-training model, so that the construction cost of the intelligent question-answering system is greatly reduced, and meanwhile, the accuracy and the instantaneity of the intelligent question-answering system are considered.

Description

Knowledge graph construction method and device based on dependency syntax rules
Technical Field
The application relates to the technical field of knowledge graph construction, in particular to a knowledge graph construction method and device based on dependency syntax rules.
Background
Along with the development of artificial intelligence and big data technology, the knowledge graph becomes a main stream storage mode of data because of good searching performance and higher storage quality, and the knowledge graph is subdivided into a general domain knowledge graph and a vertical domain knowledge graph according to the included knowledge category.
The intelligent question-answering system is a software system which is realized through a related programming language and can be used for carrying out dialogue with human beings and solving problems based on a knowledge graph formed by a large amount of corpus data. The intelligent question-answering system is required to have higher search precision, and real answering and question-answering are realized.
In the construction of knowledge maps in the vertical field, the automated construction of knowledge maps still faces many challenges. Firstly, the data volume of the vertical field is limited, and compared with the general field, the prior knowledge required by knowledge extraction of the vertical field cannot be obtained from the general field data, and a model trained by the general field is not suitable for building a knowledge graph of the vertical field. Secondly, the field specialization is strong, different vertical fields have different specialized vocabularies, concepts, terms and the like, and special processing and extraction are needed. The above problems result in lower levels of automated construction of vertical domain intelligent question-answering systems, which also results in their construction costs being prohibitive, and the costs of manpower and time to implement are not matched to the rapid development of information technology. Meanwhile, if knowledge is not extracted in place, the search result in the intelligent question-answering process in the vertical field is not accurate enough, a large amount of similar but inaccurate contents can be returned, a searcher is required to further screen, the question can not be well answered, and the accuracy and instantaneity of the intelligent question-answering result are affected to a certain extent.
Disclosure of Invention
Based on the above, it is necessary to provide a knowledge graph construction method and device based on the dependency syntax rules, which can improve the automation construction level of the intelligent question-answering system and simultaneously consider the accuracy and real-time of the intelligent question-answering result.
A knowledge graph construction method based on dependency syntax rules, the method comprising:
obtaining a dependency syntax structure of an original text; the dependency syntax structure is constructed by taking a word element obtained after word segmentation processing is carried out on an original text as a node; the dependency syntax structure comprises a plurality of candidate center words; the candidate center word is obtained by selecting according to the part of speech of the word element;
acquiring a pre-constructed expansion rule set, and expanding the current candidate center word to obtain a corresponding candidate entity if the current candidate center word meets any expansion rule in the expansion rule set;
if the candidate entities have cross or adjacency, fusing and recombining the corresponding candidate entities to obtain an entity list;
reconstructing the dependency syntax structure of the original text according to the entity list to obtain the upper and lower relation of the entity, generating a plurality of triples according to the upper and lower relation, further constructing to obtain a corresponding knowledge graph, and constructing an intelligent question-answering system according to the knowledge graph.
A knowledge graph construction apparatus based on a dependency syntax rule, the apparatus comprising:
the dependency syntax structure acquisition module is used for acquiring the dependency syntax structure of the original text; the dependency syntax structure is constructed by taking a word element obtained after word segmentation processing is carried out on an original text as a node; the dependency syntax structure comprises a plurality of candidate center words; the candidate center word is obtained by selecting according to the part of speech of the word element;
the entity expansion module is used for acquiring a pre-constructed expansion rule set, and expanding the current candidate center word to obtain a corresponding candidate entity if the current candidate center word meets any expansion rule in the expansion rule set;
the entity reorganization module is used for fusing and reorganizing the corresponding candidate entities if the candidate entities are crossed or adjacent to each other, so as to obtain an entity list;
the triplet generation module is used for reconstructing the dependency syntax structure of the original text according to the entity list to obtain the upper and lower relation of the entity, generating a plurality of triples according to the upper and lower relation, further constructing to obtain a corresponding knowledge graph, and constructing an intelligent question-answering system according to the knowledge graph.
According to the knowledge graph construction method and device based on the dependency syntax rules, firstly, word segmentation is carried out on an original text to obtain word elements, candidate center words are screened out according to the part of speech of the word elements, and meanwhile, the word elements are used as nodes to construct the dependency syntax structure; then, judging whether the candidate center word needs to be expanded and how to expand by matching the expansion rule set and the dependency syntax structure to obtain a candidate entity; then, considering that the expanded entities may have cross, fusing and reorganizing the entities to obtain an entity list; and finally, reconstructing the dependency syntax structure of the original text according to the entity list to obtain the upper and lower relation among the entities, generating a triplet according to the upper and lower relation, and constructing an intelligent question-answering system. The method does not need a large amount of data training models, is particularly suitable for extracting professional knowledge in a specific field to construct an intelligent question-answering system on the premise of not having a pre-training model, greatly reduces the construction cost of the intelligent question-answering system, and simultaneously gives consideration to the accuracy and the instantaneity of the intelligent question-answering system.
Drawings
FIG. 1 is a flow diagram of a knowledge graph construction method based on dependency syntax rules;
FIG. 2 is a schematic illustration of entity extensions based on dependency syntax structure rules;
FIG. 3 is a schematic diagram of an entity extraction and relationship extraction framework based on dependency syntax structure rules;
FIG. 4 is a diagram of an example of dependency analysis based upper and lower relationship extraction;
FIG. 5 is a schematic diagram of a knowledge association ontology structure in one embodiment;
FIG. 6 is an exemplary diagram of an RDFS file in ttl format in one embodiment;
FIG. 7 is a schematic diagram of a SPARQL query statement with superside constraints in one embodiment;
FIG. 8 is a diagram of SPARQL query results, under one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a knowledge graph construction method based on a dependency syntax rule is provided, which includes the following steps:
step 102, obtaining the dependency syntax structure of the original text.
The dependency syntax structure is constructed by taking a word element obtained by word segmentation of the original text as a node, and comprises a plurality of candidate center words which are obtained by selecting according to the part of speech of the word element.
Dependency analysis is a method commonly used in entity extraction and relation extraction, also called dependency syntax analysis, and is a process of finding out related words and relation types thereof by analyzing the grammar structure of sentences. Dependency analysis assumes that statements are typically made up of binary asymmetric word relationships, referred to as dependencies. Each relationship has a head and a dependency modifying the head, and is marked according to the nature of the dependency between the head and the dependency.
After a sentence is segmented, a token (token) of the sentence can be obtained, here, a spaCy can be used for word segmentation, the spaCy is a natural language processing open source tool library, self learning and training are performed based on marked data by using a neural network, the relation among words in the sentence is analyzed, and a dependency relation tree among each word is established. The dependency analyzer of spaCy calculates probability distributions based on a partial parse tree (partial parse trees) and predicts the dependency of each word, the model predicts the dependency between each word using an algorithm called "transfer", or "move-in-reduce", by comparing the next word to the word at the top of the stack in the current state, deciding whether the word will build a dependency with the word at the top of the stack or add it as an independent component to the stack, then the algorithm will reduce the elements in the stack according to the dependency type, and build a dependency tree.
Based on the word part of speech principle candidate center words, for example, for an information technology project text, nouns and noun phrases can be mainly selected as candidate center words in consideration of a large number of technical nouns and phrases contained therein.
Step 104, obtaining a pre-constructed expansion rule set, and if the current candidate center word meets any expansion rule in the expansion rule set, expanding the current candidate center word to obtain a corresponding candidate entity.
For each candidate center word, judging whether the candidate center word can be expanded by matching the dependency syntax structure with an expansion rule set, when the expansion is needed, taking the candidate center word as the center, determining forward or backward expansion according to the expansion rule, and if some rules need to be expanded forward and backward at the same time, searching the expanded phrase boundary and taking the expanded entity as a candidate entity.
And 106, if the candidate entities have cross or adjacency, fusing and reorganizing the corresponding candidate entities to obtain an entity list.
Considering that the expanded entities may have crossover or adjacency, they need to be fused and recombined to obtain crossover-free and repeated entity information.
And step 108, reconstructing the dependency syntax structure of the original text according to the entity list to obtain the upper and lower relation of the entity, generating a plurality of triples according to the upper and lower relation, further constructing to obtain a corresponding knowledge graph, and constructing an intelligent question-answering system according to the knowledge graph.
The knowledge graph of the intelligent question-answering system is constructed according to a large amount of corpus data in the intelligent question-answering system. The intelligent question-answering system may be, but is not limited to, a tourist attraction question-answering system, an online medical question-answering system, a knowledge question-answering system, etc.
The knowledge graph construction method based on the dependency syntax rules comprises the following steps: firstly, word segmentation processing is carried out on an original text to obtain word elements, candidate center words are screened out according to the part of speech of the word elements, and meanwhile, the word elements are used as nodes to construct a dependency syntax structure; then, judging whether the candidate center word needs to be expanded and how to expand by matching the expansion rule set and the dependency syntax structure to obtain a candidate entity; then, considering that the expanded entities may have cross, fusing and reorganizing the entities to obtain an entity list; and finally, reconstructing the dependency syntax structure of the original text according to the entity list to obtain the upper and lower relation among the entities, generating a triplet according to the upper and lower relation, and constructing an intelligent question-answering system. The method does not need a large amount of data training models, is particularly suitable for extracting professional knowledge in a specific field to construct an intelligent question-answering system on the premise of not having a pre-training model, greatly reduces the construction cost of the intelligent question-answering system, and simultaneously gives consideration to the accuracy and the instantaneity of the intelligent question-answering system.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one embodiment, each extension rule in the set of extension rules is provided with its corresponding priority;
and if the current candidate center word meets a plurality of expansion rules in the expansion rule set, selecting the expansion rule with the highest priority to expand the current candidate center word.
To solve the problem of conflict between rules, each rule combination is given a certain priority, and the larger the value, the lower the priority is, starting from 0. When the entity simultaneously satisfies a plurality of rule combinations, the entity is expanded according to the rule combination with the highest priority. The use of rule priorities provides flexibility for further expansion of rule sets. When creating a new rule combination, it is only necessary to adjust their priorities.
In one embodiment, the method further comprises:
predefining dependency syntax structure parameters based on dependency analysis; the dependent syntax structure parameters include:
id parameter; the id parameters include: current word id, current word upper level word id;
part-of-speech parameters; the part-of-speech parameters include: current word part of speech, superior word part of speech;
relying on arc parameters; the dependent arc parameters include: the dependent arc from the current word to the upper level word and the dependent arc from the upper level word to the upper level word;
as shown in Table 1, a dependency syntax structure parameter definition is provided.
TABLE 1 dependency syntax structure parameter definition
In one embodiment, constructing the extended rule set specifically includes:
defining simple rules; the simple rules include:
relationship rules; the relation rule describes the grammar dependence relationship between the current word and the corresponding upper level word and upper level word; the syntax relationship is defined by dependent arc parameters in the dependent syntax parameters, as defined by ij and jk parameters in table 1. The grammar components of the lements in the sentences can be judged through the relation rules.
Part-of-speech rules; the part-of-speech rule describes the part of speech of the current word and the corresponding superior word and superior word; part-of-speech rules are defined by part-of-speech parameters in the dependency syntax parameters, as defined by the ii, jj and kk parameters in table 1; as shown in table 2, dependency definitions in spaCy are provided, which also include commonly used part-of-speech definitions.
TABLE 2 dependency definition in spaCy
A position rule; the position rule records the positions of the current word, the corresponding superior word and the superior word in the original text; the location rule is defined by an id parameter in the dependency syntax parameters, as defined by the i, j and k parameters in table 1; the tokens of each sentence may be numbered starting from 0. The position of the lemmas in the sentence and the distance between the lemmas can be judged by the position rule.
Special rules; special rules are used to define specific phrases or fixed matches; for example, the "is_of" rule is used to extract a phrase of preposition of. The reason for designing this class of rules is that there are sometimes specific requirements for the entity being extracted, and more specific rules need to be designed.
And combining the simple rules according to actual requirements to obtain a plurality of expansion rules so as to form an expansion rule set.
Rule combinations the combination of the 4 simple rules described above is used to determine if and how an entity needs to be expanded. Each rule combination is used to extract entities having a specific syntax structure.
As shown in FIG. 2, an entity extension schematic based on dependency syntax structure rules is provided.
In one embodiment, the method further comprises:
selecting whether a new expansion rule needs to be added or not by manually checking the generated entity and the upper and lower relationship thereof;
if the RDF format is not required to be added, generating a triplet of the RDF format by an entity meeting the requirements through an entity relation analyzer;
an entity that does not meet the requirements is one that does not meet all of the rules currently in use, and such entities are recorded and manually checked to determine if a new rule needs to be established.
As shown in FIG. 3, an entity extraction and relationship extraction framework schematic based on dependency syntax structure rules is provided.
In FIG. 3, "with" refers to "videos" by the "prep" relationship, indicating that it is a preposition of a higher level word; whereas "videos" points to "the project" by the "nsubj" relationship, indicating that its superordinate word is a noun subject.
In one embodiment, as shown in fig. 3, if there is a crossover or adjacency between candidate entities, fusion and reorganization are performed on the corresponding candidate entities to obtain an entity list, including:
acquiring a candidate entity list; the candidate entity list is obtained by sequencing the starting point position of each candidate entity in the original text;
acquiring a current entity list; in the entity list, the entities are sequenced according to the starting point positions; initializing an entity list by adopting a first candidate entity in the candidate entity list;
if the starting point position of the candidate entity of the current round is not greater than the final point position of the last entity in the current entity list and the final point position of the candidate entity of the current round is greater than the final point position of the last entity, updating the final point position of the last entity by using the final point position of the candidate entity of the current round to realize the fusion and recombination of the candidate entities, otherwise, adding the candidate entity of the current round to the end of the current entity list.
A filtering algorithm based on segment ordering is used herein. The algorithm regards the sentence S as a line segment and the extracted entity S is a sub-line segment of length on the line segment S. Each sub-line segment si is defined by its start and end positions, denoted (starti, endi). Firstly, all the line segments in the S are ordered according to the starting point, and the ordered line segments are stored in a list Ssorted. Then, the list Smerged is initialized with the first segment s0 in Ssorted as an element. The algorithm traverses each line segment si in the list Ssorted, compares with the last line segment st in the Smerged, and if the starting point starti of si is smaller than or equal to the terminal point endt of st, and the terminal point endi of si is larger than the terminal point endt of st, it is indicated that si is intersected with st, and the terminal point of st is updated by endi. Otherwise, si is added to the end of the Smerged as a new st. Finally, the text of each entity is saved in the entity list E according to the start point and the end point of each element in the Smerged.
As shown in table 3, a candidate entity fusion algorithm is provided.
Table 3 candidate entity fusion algorithm
In one embodiment, reconstructing the dependency syntax structure of the original text according to the entity list to obtain the context of the entity, including:
reconstructing a dependency syntax structure of the original text according to the entity list, and acquiring a root word of the original text according to the obtained reconstructed dependency syntax structure; invoking ROOT in sentence units will obtain the center word of the sentence, and only one center word of the sentence is the ROOT word. Each entity also has a center word, called by root, but not a root word.
Defining the hop count of the root word of the original text as 0, wherein the hop count of the subject of the original text as-1, and the hop count of other entities is obtained from the distance from the corresponding center word to the root word; wherein, the jump value of 1 indicates that the center word of the entity can reach the root word only through one-time dependency relationship; the grammatical upper and lower relation of the word elements which are in parallel relation in grammar belongs to the same layer;
and obtaining the upper and lower relation of the entity according to the hop count value of the entity.
The upper and lower relationship of the fusion entities in sentences is determined according to the grammar dependency relationship between the fusion entities. Any term can be associated with the ROOT word (ROOT) of a sentence through a limited number of dependencies, according to the definition of the dependency analysis. For a fusion entity, a central word (root) of the entity can be obtained by using an interface provided by the spaCy, and the upper and lower relationship can be reproduced by taking the central word as a word element. Thus, whether individual tokens or fused entities, their distance relationships to the root term can be established through dependencies.
As shown in FIG. 4, a context extraction example graph based on dependency analysis is provided. Taking a single-step dependency relationship between two word elements as a jump, recording the jump number required by establishing the connection between each word element and the root word as hops, and taking hops as the basis for judging the upper and lower relationship. Wherein a value of 1 for hops represents that the center word can reach the root word only through one dependency relationship. Meanwhile, the value of the hots of the root word of each sentence is specified to be 0, the value of the hots of the subject of the sentence is specified to be-1, and the value of the hots of other entities is determined by the distance from the center word to the root word. The smaller the value of the hops of an entity, the closer the entity is to the grammar of the root word, and the larger the grammar is. To calculate the hops value of the entity-centric word, it is first initialized to 0, incremented by 1 every hop. It should be noted that if two terms are in parallel relation in grammar, the grammatical upper and lower relation of the two terms should belong to the same layer. Therefore, the algorithm needs to additionally determine whether the dependency relationship of each hop is a parallel relationship (e.g. conj in table 2), if so, the value of hops does not need to be incremented.
In one embodiment, generating a plurality of triples from a context includes:
if the jump value of the entity is 1, taking the item name corresponding to the original text as the head entity of the entity; an entity with a jump value of 1 is taken as a tail entity in the triplet, and predicates in the triplet are the head of the root word of the entity, namely the upper level word of the root of the entity.
If the hop count value of the entity is greater than 1, sequentially constructing triples according to the upper and lower relationship of the entity; the upper entity of the entity is a head entity, and the head of the root word of the entity is a predicate of a triplet.
A triplet such as < subject, pre, object > is generated based on the resulting context between the entities.
First, the upper and lower relation of the entity is determined according to hops and expressed by a parameter level. The level of the subject is 0, other entities are numbered in sequence according to the grammar distance, and the levels of the entities at the same level are the same.
Then, for elements in entity list E, a triplet is created starting from level 1. Each entity ei in the extended entity list E is given a unique identifier uri. If the level of the entity is 1, taking the project name as a head entity; for the entities with the level greater than 1, the triples are sequentially constructed according to the upper-lower relation, each entity takes the upper entity eupper as a head entity, and the head of the root word of the entity is taken as the predicate of the triples.
As shown in table 4, a triplet generation algorithm is provided.
Table 4 triplet generation algorithm
In one embodiment, generating a plurality of triples according to the context relation, and further constructing to obtain a corresponding knowledge graph, including:
judging corresponding predicates according to the entity and the upper-lower relation thereof, marking the predicates as supersides and storing related entities and predicates connected with the predicates if the predicates accord with corresponding rules; judging according to relation rules describing grammatical relations between entities in the text, e.g. a predicate v 2 Is v 1 Is required to supplement v 2 And v 1 And (5) connecting by using an overrun.
After generating a plurality of initial triples according to the upper-lower relationship, instantiating predicates marked as supersides to obtain instantiated sides, connecting the relevant initial triples to the instantiated sides, and constructing to obtain new triples.
In information technology project text, complex multi-element relationships, rather than simple binary relationships, typically exist between the extracted entities. Binary relations refer to relations where there are only two entities. Conventional triplet forms can model binary relationships in a compact language form, but for multiple relationships, this approach faces various challenges. For example, item A studies a key technique B that can be used to solve problem C, where three entities make up a set of polynomials. The conventional triplet form can describe only two-by-two relationships between A, B and C, and it is difficult to describe such a multiple relationship of < a→b > →c.
The problem of modeling the multiple relationships of knowledge maps has been increasingly emphasized in recent years. Aiming at the problem of modeling the multivariate relation of the knowledge graph, the following processing modes are generally adopted at present: 1) Modeling is performed using an attribute map model. 2) A hypergraph model is introduced: the hypergraph model is utilized to process the multivariate relationship. In the hypergraph model, entities and relationships are treated as nodes, and the relationships between them are treated as hyperedges. Thus, the multiple relationship may be directly represented as a superside. 3) Using a naming Graph (Named Graph): a naming map is a named graphic structure based on RDF, also called quads, which expands triples into quads by adding syntax and semantics of asserting extended RDF to the triples, adding an asserting attribute to the RDF triples that can describe the context or topic. The above methods are focused on knowledge graph construction and application, and different methods need to be selected according to specific problems.
Taking mining of information technology project text and knowledge graph construction as an example, consider modeling already extracted knowledge entities with hyperedges based on hypergraphs. In hypergraph, a Hyperedge (Hyperedge) is an edge that connects multiple vertices. Unlike a traditional binary edge, a superedge may connect any number of vertices (including two vertices). Supersides can be used to represent a large number of non-binary relationships, capturing complex structures and relationships. Meanwhile, in order to clearly express various entities extracted from the text of the information technology project, a multi-architecture meta-model is adopted to establish the ontology of the knowledge graph.
In information technology project text, there are a number of relationship types between entities. To better model these relationships, a well-defined, well-classified ontology needs to be designed. Based on the domain knowledge characteristics of the information technology project itself, enterprise architecture is generally adopted internationally for conceptual modeling and description. Commonly used enterprise architecture frameworks such as FEA architecture, TOGAF architecture and DoDAF architecture are all based on metamodel to provide abstract descriptions of things in the real world. These metamodels define abstract concepts of things and relationships between things by using ontology-like structures.
An integrated information technology project ontology structure MI-CRM (Meta-model Integration Conceptual Reference Model) is built by referencing Meta-models of various architectures. The abstractions in the ontology are taken from a variety of enterprise architecture metamodels and are divided into four conceptual groups, a method/means domain, an organization domain, a capability domain, and a mission domain, respectively. Meanwhile, in order to reduce the complexity of the ontology structure as much as possible, concepts are merged by strictly defining relationships between the concepts. The final results are shown in FIG. 5. There are seven kinds of connection relations (or predicates) in the figure, and these connection relations are strictly distinguished by a head entity and a tail entity. In addition, MI-CRM defines a "define" relationship to extend the relationship of parent and child classes to handle abstract description issues inside the homogeneous set of concepts. The goal of designing MI-CRM is to minimize the need to extend the ontology. When MI-CRM cannot provide a concept body conforming to the corresponding concept, first, it is checked whether a new concept can be connected to the corresponding concept type by using the above seven predicates, and a merging concept is added on the basis of the original concept. Custom concept types related to entities are introduced through the has_type predicate instead of generating new concepts to ensure that the knowledge graph is maximally backward compatible.
After the MI-CRM ontology is built, simple binary relationships between entities can be modeled. Here, the RDFS specification is used to build the multi-element relationships between entities based on supersides as a carrier. Under the RDFS specification framework, predicates of one triplet may appear as the subject of another triplet. By instantiating predicates and introducing a superside to further connect them to other entities, the multivariate relational modeling problem can be effectively solved. Establishing a knowledge hypergraph based on the upper and lower relation provided by the dependency analysis:
1) And constructing a hypergraph based on knowledge of the hyperedges. First, the hypergraph is further defined on the basis of the basic graph structure of the existing MI-CRM ontology. In the conventional graph structure, each edge can only connect two entity nodes, which is expressed as a triplet < object, edge, object >, while in the hypergraph, a new hyperedge is introduced, which can connect an entity node and one edge, which is expressed as a triplet < edge, hyperedge, object >. In the relation extraction link, the identified predicates are judged according to the entity and the upper and lower relation thereof, and if the identified predicates accord with the corresponding rules, the identified predicates are marked as overtlimit during extraction, and the related entity and the predicates connected with the entity and the upper and lower relation are stored.
2) And generating RDFS graph data based on the superside. This stage converts triples from the same sentence into RDFS diagrams containing superedges, where RFDS triples are automatically generated by a superedge parser for triples that need to be rewritten. The superside parser first determines the type of the head entity of an edge, if an edge is connected, then first instantiates the edge, and then builds a triplet. For example, assuming a triplet is < edge, super edge, object >, then first edge is instantiated as edge1, i.e. < edge1, typeOf, edge >, then a new triplet is added to the data < edge1, super edge, object > replaces < edge, super edge, object >.
For example, for a syntax structure like "A support B by providing C", the upper and lower relationships between three entities can be obtained by dependency analysis as (levea=0, levb=1, levc=2). Thus, the following triples may be established: < A, support, B >, < support, byProviding, C >. Obviously, the head entity of the triplet < support, byProviding, C > is the edge of < A, support, B >, which needs to be instantiated. Since RDF format allows any resource to be represented using uri, instantiation can be accomplished by only giving the support a unique uri, so that the triplet can be rewritten as: < a, B >, < support1, byprovision, C >, < support1, typeOf, support >.
The following experimental setup and performance evaluations were performed:
in the experimental section, case studies were performed using project text as an example. The experiment collects the abstract text of the legislation of the information technology project as a data source for knowledge extraction and association. To illustrate the proposed method, the knowledge extraction process is first demonstrated and a specific example is given. Then, the knowledge hypergraph of this example is constructed using the context. And finally, inquiring the generated knowledge graph by using an SPARQL inquiry statement, and judging the correctness of graph generation through a returned result.
1. Knowledge extraction experiment and analysis
This section gives experiments and analyses of the knowledge extraction section, such as the following text taken from a certain aviation information system project: "This project provides the analyst with the ability to rapidly find and fuse multiple intelligence sources of battlespace information for improved situational awareness, and to better detect and find anomalies". According to the dependency structure defined in table 1, its dependency syntax structure is obtained by text processing as shown in table 5.
TABLE 5 dependency syntax structure
The last column in table 5 lists the rule numbers that each entity meets, according to which the entity labeled-1 will not be a candidate word. For an entity that meets multiple rules at the same time, we determine the specific principle of its extension according to the rule priority. For example, "analysis" meets both rules numbered 6 and 20, and rule 6 has a higher priority than rule 20, so the entity will expand according to rule 6.
According to the rule design, the algorithm will expand the entity in different directions. For example, the rule corresponding to "battlespace information" in Table 5 has a priority of 2, and the rule will find "multiple intelligence sources" forward as an expanded boundary, thereby extracting "multiple intelligence sources of battlespace information" as a new entity. The initial entities extracted in this step are shown in table 6.
Table 6 shows the candidate entity extraction results, which include the position parameters of the extracted 7 entities in the sentence, and by using the position parameters, it can be determined whether the extracted entities have adjacencies or intersections. According to algorithm 1, entities having adjacencies or intersections are fused, and the resulting entities are shown in table 7. Table 7 also lists the center word (ROOT) of each entity, and the distance of the center word from the ROOT of the sentence (ROOT), from which we will determine the context of that entity in the sentence.
Table 6 candidate entity extraction results
TABLE 7 candidate entity fusion results
2. Knowledge graph construction and query
In the knowledge graph construction link, a hypergraph with hyperedges is constructed according to the upper-lower relationship. Taking entity "better detect and find anomalies" as an example, according to algorithm 3, the following triples will be generated:
<This projects,provide,the analyst>
<the analyst,with,the ability>
<the ability,to,better detect and find anomalies>
the upper and lower relation of the three triple head entities is 0, 1 and 2 in sequence, a 'provide with' superside is introduced, and the three triple head entities are connected to an instantiated side 'provider 1', so that the following new triple is constructed:
<This projects,provide1,the analyst>
<provide1,provideWith,the ability>
<the ability,to,better detect and find anomalies>
finally, a ttl format RDFS file is generated, as shown in FIG. 6. The superside "provider" is defined as an attribute of RDF, while the side "provider 1" is defined as a class of entities labeled as a relationship. In this way, the instantiated edges may be distinguished from other entities in the query using different ontology concepts (using the onto definition).
At query time, the desired results may be obtained by defining the grammatical structure between the entities. Here we use the relation: the predicate of a provider defined triplet? provide, indicating that the predicate provides a capability through a superside? Cap to meet the demand? And Req. Specific query statements and results are shown in fig. 7 and 8. It can be seen that by defining the grammatical structure of the query, the query statement is able to correctly return the required entity information.
In summary, the invention provides a knowledge extraction and knowledge map construction method based on a dependency syntax rule for English information technology project text. The knowledge extraction method based on the syntactic structure and the rule template does not need a large amount of data training models, and is particularly suitable for extracting the professional knowledge in the specific field on the premise of not training the models in advance. Based on the method, after a certain number of entities and relations are acquired, the supervised learning and the semi-supervised learning methods are combined, and an extraction model suitable for the characteristics of the field is trained, so that the automation level of knowledge graph construction can be further improved. In addition, aiming at the field characteristics of the information technology project, an integrated information technology project knowledge ontology structure MI-CRM is constructed, the ontology structure has good expandability and backward compatibility, and case experiments show that the semantics of query results can be enriched by using the ontology in combination with the multivariate relation modeling of RDF. In one embodiment, there is provided a knowledge graph construction apparatus based on a dependency syntax rule, the apparatus including:
the dependency syntax structure acquisition module is used for acquiring the dependency syntax structure of the original text; the dependency syntax structure is constructed by taking a word element obtained after word segmentation processing is carried out on an original text as a node; the dependency syntax structure comprises a plurality of candidate center words; the candidate center word is obtained by selecting according to the part of speech of the word element;
the entity expansion module is used for acquiring a pre-constructed expansion rule set, and expanding the current candidate center word to obtain a corresponding candidate entity if the current candidate center word meets any expansion rule in the expansion rule set;
the entity reorganization module is used for fusing and reorganizing the corresponding candidate entities if the candidate entities are crossed or adjacent to each other, so as to obtain an entity list;
the triplet generation module is used for reconstructing the dependency syntax structure of the original text according to the entity list to obtain the upper and lower relation of the entity, generating a plurality of triples according to the upper and lower relation, further constructing to obtain a corresponding knowledge graph, and constructing an intelligent question-answering system according to the knowledge graph.
For the specific definition of the knowledge graph construction apparatus based on the dependency syntax rule, reference may be made to the definition of the knowledge graph construction method based on the dependency syntax rule hereinabove, and the description thereof will not be repeated. The respective modules in the knowledge graph construction apparatus based on the dependency syntax rule described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (10)

1. A knowledge graph construction method based on a dependency syntax rule, the method comprising:
obtaining a dependency syntax structure of an original text; the dependency syntax structure is constructed by taking a word element obtained after word segmentation processing is carried out on the original text as a node; the dependency syntax structure comprises a plurality of candidate center words; the candidate center word is obtained by selecting according to the part of speech of the word element;
acquiring a pre-constructed expansion rule set, and expanding the current candidate center word to obtain a corresponding candidate entity if the current candidate center word meets any expansion rule in the expansion rule set;
if the candidate entities are crossed or adjacent, fusing and recombining the corresponding candidate entities to obtain an entity list;
reconstructing the dependency syntax structure of the original text according to the entity list to obtain the upper and lower relation of the entity, generating a plurality of triples according to the upper and lower relation, further constructing to obtain a corresponding knowledge graph, and constructing an intelligent question-answering system according to the knowledge graph.
2. The method of claim 1, wherein each expansion rule in the set of expansion rules is provided with its corresponding priority;
and if the current candidate center word meets a plurality of expansion rules in the expansion rule set, selecting the expansion rule with the highest priority to expand the current candidate center word.
3. The method according to claim 1, wherein the method further comprises:
predefining dependency syntax structure parameters based on dependency analysis; the dependent syntax structure parameters include:
id parameter; the id parameters include: current word id, current word upper level word id;
part-of-speech parameters; the part-of-speech parameters include: current word part of speech, superior word part of speech;
relying on arc parameters; the dependent arc parameters include: the dependent arc of the current word to the upper level word and the dependent arc of the upper level word to the upper level word.
4. The method of claim 3, wherein constructing the extended rule set specifically comprises:
defining simple rules; the simple rule includes:
relationship rules; the relation rule describes grammar dependence relation between the current word and the corresponding superior word and superior word; the grammar relation is defined by dependent arc parameters in the dependent syntax parameters;
part-of-speech rules; the part-of-speech rule describes the part of speech of the current word and the corresponding superior word and superior word; the part-of-speech rule is defined by part-of-speech parameters in the dependency syntax parameters;
a position rule; the position rule records the positions of the current word, the corresponding superior word and the superior word in the original text; the position rule is defined by an id parameter in the dependency syntax parameters;
special rules; the special rules are used for defining specific phrases or fixed collocations;
and combining the simple rules according to actual requirements to obtain a plurality of expansion rules, thereby forming an expansion rule set.
5. The method of claim 1, wherein if there is a crossover or adjacency between the candidate entities, fusing and reorganizing the corresponding candidate entities to obtain an entity list, including:
acquiring a candidate entity list; the candidate entity list is obtained by sequencing the starting point positions of each candidate entity in the original text;
acquiring a current entity list; in the entity list, the entities are sequenced according to the starting point positions; the entity list is initialized by adopting a first candidate entity in the candidate entity list;
if the starting point position of the candidate entity of the current round is not greater than the final point position of the last entity in the current entity list and the final point position of the candidate entity of the current round is greater than the final point position of the last entity, updating the final point position of the last entity by using the final point position of the candidate entity of the current round to realize the fusion and recombination of the candidate entities, otherwise, adding the candidate entity of the current round to the end of the current entity list.
6. The method of claim 1, wherein reconstructing the dependency syntax structure of the original text from the entity list to obtain the context of the entity comprises:
reconstructing the dependency syntax structure of the original text according to the entity list, and acquiring the root word of the original text according to the obtained reconstructed dependency syntax structure;
defining the hop count of the root word of the original text as 0, wherein the hop count of the subject of the original text as-1, and the hop count of other entities is obtained from the distance from the corresponding center word to the root word; wherein, the jump value of 1 indicates that the center word of the entity can reach the root word only through one-time dependency relationship; the grammatical upper and lower relation of the word elements which are in parallel relation in grammar belongs to the same layer;
and obtaining the upper and lower relation of the entity according to the hop count value of the entity.
7. The method of claim 6, wherein generating a plurality of triples from the context comprises:
if the jump value of the entity is 1, taking the item name corresponding to the original text as the head entity of the entity;
if the hop count value of the entity is greater than 1, sequentially constructing triples according to the upper and lower relationship of the entity; the upper entity of the entity is a head entity, and the head of the root word of the entity is a predicate of a triplet.
8. The method of claim 7, wherein the method further comprises:
selecting whether a new expansion rule needs to be added or not by manually checking the generated entity and the upper and lower relationship thereof;
if the RDF format triples are not needed to be added, the entities meeting the requirements are generated through the entity relation analyzer.
9. The method of claim 1, wherein generating a plurality of triples according to the context, and further constructing to obtain a corresponding knowledge graph, comprises:
judging corresponding predicates according to the entity and the upper and lower relation thereof, marking the predicate as an overtlimit and storing related entities and predicates connected with the predicate if the predicate accords with the corresponding rule;
after generating a plurality of initial triples according to the upper-lower relationship, instantiating predicates marked as supersides to obtain instantiated sides, connecting the relevant initial triples to the instantiated sides, and constructing to obtain new triples.
10. A knowledge graph construction apparatus based on a dependency syntax rule, the apparatus comprising:
the dependency syntax structure acquisition module is used for acquiring the dependency syntax structure of the original text; the dependency syntax structure is constructed by taking a word element obtained after word segmentation processing is carried out on the original text as a node; the dependency syntax structure comprises a plurality of candidate center words; the candidate center word is obtained by selecting according to the part of speech of the word element;
the entity expansion module is used for acquiring a pre-constructed expansion rule set, and expanding the current candidate center word to obtain a corresponding candidate entity if the current candidate center word meets any expansion rule in the expansion rule set;
the entity reorganization module is used for fusing and reorganizing the corresponding candidate entities if the candidate entities are crossed or adjacent to each other, so as to obtain an entity list;
the triplet generation module is used for reconstructing the dependency syntax structure of the original text according to the entity list to obtain the upper and lower relation of the entity, generating a plurality of triples according to the upper and lower relation, further constructing and obtaining a corresponding knowledge graph, and constructing an intelligent question-answering system according to the knowledge graph.
CN202311331571.9A 2023-10-13 2023-10-13 Knowledge graph construction method and device based on dependency syntax rules Pending CN117273137A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311331571.9A CN117273137A (en) 2023-10-13 2023-10-13 Knowledge graph construction method and device based on dependency syntax rules

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311331571.9A CN117273137A (en) 2023-10-13 2023-10-13 Knowledge graph construction method and device based on dependency syntax rules

Publications (1)

Publication Number Publication Date
CN117273137A true CN117273137A (en) 2023-12-22

Family

ID=89215850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311331571.9A Pending CN117273137A (en) 2023-10-13 2023-10-13 Knowledge graph construction method and device based on dependency syntax rules

Country Status (1)

Country Link
CN (1) CN117273137A (en)

Similar Documents

Publication Publication Date Title
EP1522930B1 (en) Method and apparatus for identifying semantic structures from text
US9323747B2 (en) Deep model statistics method for machine translation
JP4625178B2 (en) Automatic recognition of discourse structure of text body
US5966686A (en) Method and system for computing semantic logical forms from syntax trees
US20140250047A1 (en) Authoring system for bayesian networks automatically extracted from text
US20060253275A1 (en) Method and apparatus for determining unbounded dependencies during syntactic parsing
US20080086300A1 (en) Method and system for translating sentences between languages
US20090070099A1 (en) Method for translating documents from one language into another using a database of translations, a terminology dictionary, a translation dictionary, and a machine translation system
EP0473864A1 (en) Method and apparatus for paraphrasing information contained in logical forms
US20080086298A1 (en) Method and system for translating sentences between langauges
RU2679988C1 (en) Extracting information objects with the help of a classifier combination
JP3781561B2 (en) Natural language analysis device, system and recording medium
CN112306497A (en) Method and system for converting natural language into program code
Kamalabalan et al. Tool support for traceability of software artefacts
Abbas et al. A review of nlidb with deep learning: findings, challenges and open issues
CN112818092A (en) Knowledge graph query statement generation method, device, equipment and storage medium
CN109857458B (en) ANTLR-based AltaRica3.0 flattening transformation method
CN111291573A (en) Phrase semantic mining method driven by directed graph meaning guide model
CN113868382A (en) Method and device for extracting structured knowledge from Chinese natural language
US6879950B1 (en) System and method of decoding a packed representation of multiple parses
CN117273137A (en) Knowledge graph construction method and device based on dependency syntax rules
CN115935943A (en) Analysis framework supporting natural language structure calculation
US7143027B2 (en) Sentence realization system for use with unification grammars
Bohn et al. Constructing Deterministic Parity Automata from Positive and Negative Examples
Hirakawa Semantic dependency analysis method for Japanese based on optimum tree search algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination