CN116150407A - Method and system for constructing domain knowledge graph based on seed subset expansion - Google Patents

Method and system for constructing domain knowledge graph based on seed subset expansion Download PDF

Info

Publication number
CN116150407A
CN116150407A CN202310443200.3A CN202310443200A CN116150407A CN 116150407 A CN116150407 A CN 116150407A CN 202310443200 A CN202310443200 A CN 202310443200A CN 116150407 A CN116150407 A CN 116150407A
Authority
CN
China
Prior art keywords
knowledge
entity
seed set
data
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310443200.3A
Other languages
Chinese (zh)
Inventor
刘淇
冯彬
阮书岚
陈恩红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202310443200.3A priority Critical patent/CN116150407A/en
Publication of CN116150407A publication Critical patent/CN116150407A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a field knowledge graph construction method and system based on seed set expansion. The method comprises the following steps: acquiring at least one initial document related to the target professional field, and performing format arrangement and document merging processing on the at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data; performing de-duplication treatment on the initial seed set file, and performing data preprocessing on knowledge point data in the de-duplication treated initial seed set file to obtain a knowledge pair format file; according to the knowledge pair format file, extracting triples containing knowledge point entities in the universal knowledge graph in a traversing and searching mode to obtain a plurality of subgraphs with knowledge-entity-knowledge; and carrying out data cleaning and data screening on the plurality of subgraphs, and carrying out spectrum merging processing on the plurality of treated subgraphs to generate a knowledge spectrum in the target professional field.

Description

Method and system for constructing domain knowledge graph based on seed subset expansion
Technical Field
The invention relates to the fields of natural language processing, data mining, information retrieval and the like, in particular to a seed set expansion-based field knowledge graph construction method, a seed set expansion-based field knowledge graph construction system, electronic equipment and a storage medium.
Background
Knowledge Graph (knowledgegraph) is a form of representing and organizing Knowledge in the computer field, and refers to the use of a large-scale semantic network of directed Graph structures to present concepts and related semantic links between concepts. If there is a Relationship between two nodes, they are connected together by an edge, then the node is called an Entity (Entity), and the edge between them is called a Relationship (Relationship). The basic unit of the knowledge graph is a triplet formed by an entity-relation-entity, which is also the core of the knowledge graph. The knowledge graph has the advantages of large scale, high knowledge quality, exquisite structure and the like, and has wide application prospect in numerous fields such as intelligent education, intelligent judicial, financial wind control, intelligent recommendation and the like at present by carrying out structured combing on massive knowledge and assisting downstream reasoning, mining and other tasks.
In the prior art, methods related to knowledge graph construction mainly comprise a knowledge extraction-based method, an ontology-based method and a statistical-based method. However, the above method may have a certain difficulty and limitation in extracting the specialized domain knowledge graph, and the degree of expertise and automation of the domain knowledge graph constructed based on the above method are very limited without considering the domain specificity and complexity of the specialized domain knowledge.
Disclosure of Invention
In view of the above problems, the present invention provides a method, a system, an electronic device, and a storage medium for constructing a domain knowledge graph based on seed set expansion, so as to solve at least one of the above problems.
According to a first aspect of the present invention, there is provided a method for constructing a domain knowledge graph based on seed set expansion, including:
acquiring at least one initial document related to the target professional field, and performing format arrangement and document merging processing on the at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data;
performing de-duplication treatment on the initial seed set file, and performing data preprocessing on knowledge point data in the de-duplication treated initial seed set file to obtain a knowledge pair format file;
according to the knowledge pair format file, extracting triples containing knowledge point entities in the universal knowledge graph in a traversing and searching mode to obtain a plurality of subgraphs with knowledge-entity-knowledge;
and carrying out data cleaning and data screening on the plurality of subgraphs, and carrying out spectrum merging processing on the plurality of treated subgraphs to generate a knowledge spectrum in the target professional field.
According to an embodiment of the present invention, the foregoing performing a deduplication process on an initial seed set file, and performing a data preprocessing on knowledge point data in the initial seed set file after the deduplication process, where obtaining a file in a knowledge pair format includes:
performing entry filtering on the initial seed set file, screening out repeated knowledge points in the initial seed set file, and obtaining a de-duplicated initial seed set file;
extracting stems of knowledge point names in the de-duplicated initial seed subset files by using a Bode stem extraction method to obtain stem extraction results;
performing secondary duplicate removal processing on the same knowledge points in the duplicate-removed initial seed set file according to the stem extraction result to obtain a secondary duplicate removal result;
and formatting the secondary de-duplication result to obtain a knowledge pair format file, wherein the knowledge pair format comprises knowledge point name-knowledge point interpretation.
According to an embodiment of the present invention, extracting triples including knowledge point entities in a universal knowledge graph by traversing the file in the knowledge pair format to obtain a plurality of sub-graphs having knowledge-entity-knowledge includes:
obtaining a knowledge point-entity-corresponding relation according to the knowledge pair format file;
according to the corresponding relation of the knowledge points and the entities, extracting the triples of the knowledge points and the entities and the knowledge points in the universal knowledge graph in a traversing searching mode;
from the triples, multiple subgraphs with knowledge-entity-knowledge are obtained.
According to an embodiment of the present invention, the performing data cleaning and data screening on the plurality of sub-graphs, and performing graph merging processing on the plurality of processed sub-graphs, generating a knowledge graph in the target professional field includes:
calculating the frequency of each entity in the multiple subgraphs by using a pre-trained word frequency calculation model according to a preset word frequency threshold;
and under the condition that the frequency of the entity in the subgraph meets a preset word frequency threshold, performing data deletion operation on the triples corresponding to the entity in the subgraph, and completing data cleaning and screening to obtain a plurality of preprocessed subgraphs.
According to an embodiment of the present invention, the above-mentioned performing data cleaning and data screening on the plurality of sub-graphs, and performing a graph merging process on the plurality of processed sub-graphs, generating a knowledge graph in the target professional field further includes:
calculating cosine similarity between any two entities in the preprocessed multiple subgraphs by using a synonym comparison algorithm based on a word network model;
under the condition that the cosine similarity meets a preset similarity threshold, determining two entities corresponding to the cosine similarity as the same entity;
in the process of spectrum merging treatment, two entities determined to be the same entity are screened according to a preset screening target, so that a knowledge spectrum in the target professional field is obtained.
According to an embodiment of the present invention, the cosine similarity is calculated according to formula (1):
Figure SMS_1
(1),
wherein ,
Figure SMS_2
indicate->
Figure SMS_3
Personal entity node->
Figure SMS_4
Indicate->
Figure SMS_5
Personal entity node->
Figure SMS_6
Representing the inner product operation of the vector.
According to an embodiment of the present invention, the preset screening target is calculated by formula (2):
Figure SMS_7
(2),
wherein ,
Figure SMS_8
representation->
Figure SMS_9
First entity node, ++>
Figure SMS_10
Indicate->
Figure SMS_11
Personal entity node->
Figure SMS_12
Representing the word frequency of the entity.
According to a second aspect of the present invention, there is provided a domain knowledge graph construction system based on seed set expansion, comprising:
the seed set file acquisition module is used for acquiring at least one initial document related to the target professional field, and carrying out format arrangement and document merging processing on the at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data;
the knowledge pair format file acquisition module is used for carrying out de-duplication treatment on the initial seed set file, and carrying out data preprocessing on knowledge point data in the initial seed set file after the de-duplication treatment to obtain a knowledge pair format file;
the sub-graph acquisition module is used for extracting triples containing knowledge point entities in the universal knowledge graph in a traversing search mode according to the knowledge pair format file to obtain a plurality of sub-graphs with knowledge-entity-knowledge;
and the target knowledge graph generation module is used for carrying out data cleaning and data screening on the plurality of subgraphs, carrying out graph merging processing on the plurality of processed subgraphs, and generating a knowledge graph in the target professional field.
According to a third aspect of the present invention, there is provided an electronic device comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a seed set extension-based domain knowledge graph construction method.
According to a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method of domain knowledge graph construction based on seed set expansion.
Drawings
FIG. 1 is a flow chart of a method of domain knowledge graph construction based on seed set expansion, in accordance with an embodiment of the invention;
FIG. 2 is a flow chart of acquiring a knowledge pair formatted file, according to an embodiment of the invention;
FIG. 3 is a flow diagram of obtaining multiple subgraphs with knowledge-entity-knowledge in accordance with an embodiment of the invention;
FIG. 4 is a flow chart of generating a knowledge-graph of a target area of expertise, according to an embodiment of the invention;
FIG. 5 is a flow chart for acquiring knowledge points and preprocessing the knowledge points in accordance with another embodiment of the invention;
FIG. 6 is a flow chart of a method of constructing a knowledge-graph from seed subsets to domains in accordance with another embodiment of the invention;
FIG. 7 is a schematic structural diagram of a species subset expansion-based domain knowledge graph construction system in accordance with an embodiment of the invention;
fig. 8 schematically shows a block diagram of an electronic device adapted to implement a seed-subset-extension-based domain knowledge-graph construction method, according to an embodiment of the invention.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
Regarding the construction of knowledge graph, the following three methods mainly exist in the prior art: knowledge extraction-based methods, ontology-based methods, and statistical-based methods.
The knowledge extraction-based method mainly extracts information such as entities, relations and the like from texts by extracting the texts through natural language processing technology, and converts the information into a structured knowledge representation form so as to construct a knowledge graph. The extraction method uses natural language processing technologies such as word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis and the like to process and analyze texts, and extracts information such as entities, attributes, relations and the like. The extraction method can automatically process a large amount of unstructured data, so that the knowledge graph is constructed more efficiently and accurately.
The ontology-based method is a knowledge graph construction method based on ontology modeling. An ontology is an ontology model that builds an ontology model describing its essential attributes, relationships, and concepts for a domain or specific problem. When the knowledge graph is constructed, relevant information can be mapped onto the ontology model, so that the knowledge graph is generated. Ontology-based methods can improve knowledge-graph accuracy and consistency because the ontology model is well designed and defined and based on strict logic and semantic specifications.
The statistical-based method is a method for constructing a knowledge graph by using a statistical learning method. This approach first requires a large amount of data to be collected and then trained using machine learning algorithms to learn the relationships and rules between the data. Finally, a knowledge graph is constructed using these relationships and rules. The statistical-based method can efficiently process a large amount of data and is suitable for constructing a large-scale knowledge graph. However, it may suffer from errors and ambiguity because it is based solely on the data itself, without considering the semantic and logical structure of knowledge.
However, in practical scenario applications and data sets, it can be seen that the above-described method may have certain difficulties and limitations in extracting a specialized domain knowledge graph. On one hand, the field knowledge has strong professional, and deep field knowledge and experience are needed to accurately extract and model; on the other hand, the professional domain knowledge generally lacks large-scale structured data, and generally only has unstructured data such as books, regulations, inquiry manuals and the like written by experts, so that the statistical-based method is difficult to support. In the fields of programming education, judicial education, basic discipline education and the like, the construction of the domain knowledge graph has positive effects and important significance for students to learn the professional domain knowledge and to clear the relation and distinction between different knowledge concepts. The existing general method does not consider the field specificity and complexity of the professional field knowledge, has limited professional degree and automation degree for constructing the field knowledge graph, and has great challenges and significance on how to automatically construct the field knowledge graph based on knowledge point data such as the related professional terms and language concepts of the field.
In order to solve the technical problems in the prior art, the invention provides a domain knowledge graph construction method for professional disciplines, which comprises programming education, judicial education, basic discipline education and the like. Taking programming education as an example, knowledge point data related to a programming scene is firstly obtained from massive web data (including books, regulations and inquiry manuals in the professional field), wherein the knowledge point data can be professional terms of specific languages, such as Python, java, C ++, or the like, or can be general programming language concepts, such as constants, variables, identifiers, keywords, floating points, and the like, so as to obtain an initial document set. And then, carrying out data preprocessing on the initial document set, filtering repeated data and data irrelevant to the professional field, reducing the candidate document set to improve the efficiency of the method, carrying out unified preprocessing on knowledge point data, including morphological reduction, duplication removal, synonym merging and the like, processing the data into a unified format for subsequent calculation, and obtaining a knowledge point seed subset. And finally, according to the pre-defined seed subsets, retrieving entities of programming data from the universal knowledge graph, extracting triples containing knowledge points, and finally obtaining the domain knowledge graph.
It is specifically stated that in the technical scheme disclosed by the invention, in the process of acquiring the data document in the target expertise field, the authorization of the relevant data document owner is obtained, the data document is processed, applied and stored under the permission of the relevant data document owner, the relevant process accords with the regulation of laws and regulations, necessary and reliable confidentiality measures are adopted, and the requirements of popular regulations are met.
Fig. 1 is a flowchart of a method for constructing a domain knowledge graph based on seed set expansion, according to an embodiment of the present invention.
As shown in FIG. 1, the method for constructing the domain knowledge graph based on seed set expansion comprises operations S110-S140.
At operation S110, at least one initial document related to the target professional field is acquired, and format arrangement and document merging processing are performed on the at least one initial document to obtain an initial seed set file, wherein knowledge point data is included in the initial document.
In the embodiment of the invention, knowledge point data related to the professional field with high authority or acceptance degree is acquired from massive Web documents and books as an initial document set. Taking programming education as an example, the initial set of documents includes two types: generic programming grammar rules (e.g., identifiers, abstract classes) and multi-language terminology (e.g., python, java). The corresponding sources are divided into two types, wherein the former sources are derived from open source data on oxford English-Chinese double-solution computer dictionary, software engineering Chinese-English contrast glossary, software design reconstruction and Github; the latter is derived from various language official development documents, including: python, java, C ++, HTML, javaScript, C #, PHP.
In the embodiment of the invention, the initial document sets with different file formats (such as txt, word, csv and the like) are subjected to format arrangement, combined into a unified format txt and then placed into a unified seed set file.
In operation S120, the initial seed set file is subjected to deduplication processing, and knowledge point data in the initial seed set file after deduplication processing is subjected to data preprocessing, so as to obtain a knowledge pair format file.
In the embodiment of the invention, entry filtering is carried out on the initial seed set file, and repeated knowledge existing in the file is screened out. Because the knowledge content and the range covered by the initial knowledge source are different, the knowledge points are repeated to a certain extent, the seed set is reduced, and the efficiency of the system is improved.
In the embodiment of the invention, the data preprocessing of the seed set file after the duplication removal comprises the following steps: extracting root and further removing duplication. Since knowledge from different sources has the problems of different word shapes and identical expression knowledge, merging processing is needed. Extracting stems of knowledge point names in the seed set file by using a natural language tool kit NLTK library through a Porter stemming method (Porter stemming), performing duplication removing operation on the obtained result, and further performing duplication removing on the same knowledge points. And then unifying the further de-duplicated results into a knowledge pair format of 'knowledge point name-knowledge point interpretation', so as to facilitate subsequent calculation operation.
In operation S130, according to the knowledge-to-format file, triples including knowledge point entities in the universal knowledge graph are extracted by traversing the search method, so as to obtain multiple subgraphs with knowledge-entity-knowledge.
Seed set of professional field data according to previous operation S1120
Figure SMS_15
And a plurality of public general knowledge patterns ConceptNet, DBpedia, wikidata in triplet->
Figure SMS_19
According to the seed subset->
Figure SMS_20
Knowledge points->
Figure SMS_16
Head entity matched into triplet +.>
Figure SMS_18
Or tail entity/>
Figure SMS_21
Extracting a triplet containing the knowledge point entity by traversing the search mode to obtain the relation between the knowledge point entity and the knowledge point entity or other entities, and obtaining a subgraph from each general knowledge graph, namely according to the seed subset->
Figure SMS_22
General knowledge graph
Figure SMS_13
Extracting subgraph->
Figure SMS_14
I.e.
Figure SMS_17
In operation S140, data cleaning and data screening are performed on the plurality of sub-graphs, and the plurality of processed sub-graphs are subjected to graph merging processing to generate a knowledge graph in the target professional field.
In the embodiment of the invention, firstly, the knowledge with low occurrence frequency is cleaned, and unusual cold knowledge is filtered out. Taking programming education as an example, the specific practice is as follows: calculating word frequencies of knowledge points in a seed set by adopting a python package wordfreq which is calculated in advance on independent corpus, wherein the word frequencies are calculated by using the wordfreq; an empirical threshold 1e-06 is set and if the frequency of occurrence of an entity is less than the threshold, the triplet is filtered out. Then focusing on the most frequent previous relationships, manually removing domain-specific knowledge, where
Figure SMS_23
Set to 500. The previous relation is marked, and the marking method is as follows: marked as 0, the relation in the programming field is represented, and the deletion is not performed; labeled 1, indicating the professional relationship of other fields (removing the triplet containing the relationship while creating a blacklist of nodes, the triplet containing the node is not common sense); labeled 2, indicates that the relationship is not a relationship in the programming domain, butThe node may belong to a programming knowledge point (removing triples containing the relationship). The relationship marked as 1 is a professional relationship of other fields, and the node related to the professional field relationship is assumed to be a node of the professional field, a blacklist is added, and then iteration is performed to remove the triples comprising the nodes in the blacklist.
In the embodiment of the invention, firstly, entity matching similarity is calculated based on a synonym comparison algorithm of WordNet, and all possibly aligned two-by-two entities are calculated through a cosine similarity formula
Figure SMS_24
,/>
Figure SMS_25
Similarity between entities, entity similarity is higher than the set threshold +.>
Figure SMS_26
Then it is considered an entity, here +.>
Figure SMS_27
Set to 0.9. When combining entities among maps, calculating the occurrence frequency of the entities by adopting wordfreq, and reserving the entity with higher occurrence frequency as a combined entity node +.>
Figure SMS_28
According to an embodiment of the present invention, the cosine similarity is calculated according to formula (1):
Figure SMS_29
(1),
wherein ,
Figure SMS_30
indicate->
Figure SMS_31
Personal entity node->
Figure SMS_32
Indicate->
Figure SMS_33
Personal entity node->
Figure SMS_34
Representing the inner product operation of the vector.
According to an embodiment of the present invention, the preset screening target is calculated by formula (2):
Figure SMS_35
(2),
wherein ,
Figure SMS_36
indicate->
Figure SMS_37
Personal entity node->
Figure SMS_38
Indicate->
Figure SMS_39
Personal entity node->
Figure SMS_40
Representing the word frequency of the entity.
According to the method for constructing the domain knowledge graph based on the seed set expansion, which is provided by the embodiment of the invention, the existing graph construction technology is used in a professional domain scene, the knowledge points are acquired, unified preprocessing is carried out on the knowledge point data, and the entity of the professional domain data is searched from the universal knowledge graph according to the pre-defined professional domain data seed set, so that the knowledge point extraction and the relation extraction of the domain data are realized, the acquired multiple subgraphs are combined, the limitation of the knowledge source of a single graph is made up, the coverage range of the knowledge is enlarged, and the domain knowledge graph is finally obtained. The method for constructing the domain knowledge graph based on seed set expansion has the advantages of simple program, low calculation complexity and high interpretability, and can achieve good effect.
FIG. 2 is a flow chart of acquiring a knowledge pair format file, according to an embodiment of the invention.
As shown in fig. 2, the above-mentioned de-duplication processing is performed on the initial seed set file, and data preprocessing is performed on knowledge point data in the de-duplication processed initial seed set file, so as to obtain a file in a knowledge pair format, where the file includes operations S210 to S240.
In operation S210, entry filtering is performed on the initial seed set file, and repeated knowledge points in the initial seed set file are screened out, so as to obtain a de-duplicated initial seed set file.
In operation S220, a stem of the knowledge point name in the initial seed subset file after the duplication removal is extracted by using a baud stem extraction method, so as to obtain a stem extraction result.
In operation S230, according to the stem extraction result, the same knowledge points in the initial seed set file after the duplication removal are subjected to secondary duplication removal processing, so as to obtain a secondary duplication removal result.
In operation S240, the secondary deduplication result is formatted to obtain a knowledge pair format file, where the knowledge pair format includes knowledge point name-knowledge point interpretation.
FIG. 3 is a flow diagram of obtaining multiple subgraphs with knowledge-entity-knowledge in accordance with an embodiment of the invention.
As shown in fig. 3, the above-mentioned document according to the knowledge pair format extracts the triples including the knowledge point entities in the universal knowledge graph by means of traversing search, and the obtaining of multiple sub-graphs with knowledge-entity-knowledge includes operations S310-S330.
In operation S310, a knowledge point-entity-correspondence is obtained according to the knowledge pair format file.
In operation S320, according to the corresponding relationship between knowledge points and entities, the triples of knowledge points and entities and knowledge points in the universal knowledge graph are extracted by means of traversal search.
In operation S330, a plurality of sub-graphs having knowledge-entity-knowledge are obtained from the triples.
Fig. 4 is a flowchart of generating a knowledge-graph of a target area of expertise, according to an embodiment of the invention.
As shown in fig. 4, the above-mentioned data cleaning and data screening are performed on the multiple subgraphs, and the multiple subgraphs after processing are subjected to spectrum merging processing, so as to generate knowledge maps S410 to S450 in the target professional field.
In operation S410, the frequency of each entity in the plurality of sub-graphs is calculated using a pre-trained word frequency calculation model according to a preset word frequency threshold.
In operation S420, under the condition that the frequency of the entity in the subgraph meets the preset word frequency threshold, performing data deletion operation on the triples corresponding to the entity in the subgraph, and completing data cleaning and screening to obtain a plurality of preprocessed subgraphs.
In operation S430, a cosine similarity between any two entities in the preprocessed plurality of sub-graphs is calculated using a synonym comparison algorithm based on the word network model.
In operation S440, in case the cosine similarity satisfies a preset similarity threshold, two entities corresponding to the cosine similarity are determined as the same entity.
In operation S450, during the spectrum merging process, two entities determined to be the same entity are screened according to a preset screening target, so as to obtain a knowledge spectrum of the target professional field.
The method for constructing the domain knowledge graph based on the seed set expansion is further described in detail below by means of another embodiment of the present invention and with reference to fig. 5 and 6.
FIG. 5 is a flow chart for acquiring knowledge points and preprocessing the knowledge points in accordance with another embodiment of the invention.
Fig. 6 is a flowchart of a method for constructing a seed set-to-domain knowledge-graph, in accordance with another embodiment of the invention.
As shown in fig. 5, first, knowledge point data related to the professional field, such as books, regulations, inquiry manuals, etc. in the professional field, is obtained from a huge amount of Web resources, and an initial document set is obtained.
Then, carrying out format unification and merging treatment according to the initial document set obtained in the last step to obtain an initial seed set; and performing deduplication processing on the obtained initial seed subsets, and deleting duplicate knowledge of different sources.
And carrying out data preprocessing on the seed set after the duplication removal, carrying out stem extraction and further duplication removal, and carrying out format arrangement on the result after the complete duplication removal to obtain a unified knowledge format of 'knowledge point name-knowledge point interpretation'.
As shown in fig. 5, the format of the obtained initial document set is unified, and the first duplication removal, the root extraction, the second duplication removal and the unified knowledge pair format are performed to obtain a final seed set file.
And extracting a triplet containing the knowledge point entity by traversing and searching according to the knowledge point data seed subset obtained in the last step, the plurality of public large-scale general knowledge maps ConceptNet, DBpedia, wikidata and the like, and extracting a domain knowledge subgraph from the general knowledge maps.
The extracted subgraphs are further cleaned and selected, unusual cold knowledge is filtered and knowledge in other fields is removed, and therefore high-quality field knowledge is obtained.
And carrying out map fusion on the obtained multiple subgraphs, calculating entity matching similarity based on a synonym comparison algorithm of WordNet, carrying out semantic matching on the relation and the relation class of the concept Net by utilizing a pre-training language model, and reserving the entity with higher occurrence frequency as a combined entity node.
Taking the acquisition of a knowledge graph in the programming field as an example, as shown in fig. 6, firstly, acquiring a universal knowledge graph, extracting knowledge points in the universal knowledge graph, which are associated with entities in a seed set file, according to the seed set file, and generating a subgraph of the universal knowledge graph; then cleaning and filtering the generated subgraph to obtain a preprocessed subgraph; and merging the preprocessed subgraphs again to obtain a knowledge graph in the programming field.
According to the technical scheme provided by the invention, in the construction process of the domain map, a large amount of knowledge point data related to the professional domain is acquired by utilizing massive web data, books, regulations, inquiry manuals and the like in the professional domain, and the initial document set is subjected to data preprocessing, so that a knowledge point seed subset with a standard form can be obtained. And simultaneously, according to a predefined data seed subset, retrieving the entity of professional data from a large-scale universal knowledge graph, extracting a triplet containing knowledge points, and combining the obtained multiple subgraphs to accurately construct a domain data knowledge graph.
Fig. 7 is a schematic structural diagram of a domain knowledge graph construction system based on seed set expansion according to an embodiment of the invention.
As shown in fig. 7, the system 700 includes a seed set file acquisition module 710, a knowledge pair format file acquisition module 720, a sub-graph acquisition module 730, and a target knowledge graph generation module 740.
The seed set file obtaining module 710 is configured to obtain at least one initial document related to the target professional field, and perform format arrangement and document merging processing on the at least one initial document to obtain an initial seed set file, where the initial document includes knowledge point data.
The knowledge-based format file obtaining module 720 is configured to perform deduplication processing on the initial seed set file, and perform data preprocessing on knowledge point data in the initial seed set file after the deduplication processing, so as to obtain a knowledge-based format file.
The sub-graph obtaining module 730 is configured to extract a triplet including a knowledge point entity in the universal knowledge graph by traversing the search method according to the knowledge pair format file, so as to obtain a plurality of sub-graphs having knowledge-entity-knowledge.
The target knowledge graph generating module 740 is configured to perform data cleaning and data screening on the multiple subgraphs, and perform graph merging processing on the multiple processed subgraphs to generate a knowledge graph in the target professional field.
Fig. 8 schematically shows a block diagram of an electronic device adapted to implement a seed-subset-extension-based domain knowledge-graph construction method, according to an embodiment of the invention.
As shown in fig. 8, an electronic device 800 according to an embodiment of the present invention includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may comprise a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the invention.
In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flow according to embodiments of the present invention by executing programs stored in one or more memories.
According to an embodiment of the invention, the electronic device 800 may further comprise an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to an input/output (I/O) interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to an input/output (I/O) interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.
The present invention also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present invention.
According to embodiments of the present invention, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the invention, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not meant to limit the scope of the invention, but to limit the invention thereto.

Claims (10)

1. The utility model provides a field knowledge graph construction method based on seed set expansion, which is characterized by comprising the following steps:
acquiring at least one initial document related to a target professional field, and performing format arrangement and document merging processing on at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data;
performing de-duplication treatment on the initial seed set file, and performing data preprocessing on knowledge point data in the de-duplication treated initial seed set file to obtain a knowledge pair format file;
extracting triples containing knowledge point entities in the universal knowledge graph by traversing and searching according to the knowledge pair format file to obtain a plurality of subgraphs with knowledge-entity-knowledge;
and carrying out data cleaning and data screening on the plurality of subgraphs, and carrying out spectrum merging processing on the plurality of treated subgraphs to generate a knowledge spectrum of the target professional field.
2. The method of claim 1, wherein performing deduplication processing on the initial seed set file, and performing data preprocessing on knowledge point data in the deduplicated initial seed set file, to obtain a knowledge pair format file comprises:
performing entry filtering on the initial seed set file, screening out repeated knowledge points in the initial seed set file, and obtaining a de-duplicated initial seed set file;
extracting stems of knowledge point names in the de-duplicated initial seed subset files by using a Bode stem extraction method to obtain stem extraction results;
performing secondary duplicate removal processing on the same knowledge points in the duplicate-removed initial seed set file according to the stem extraction result to obtain a secondary duplicate removal result;
and formatting the secondary de-duplication result to obtain a knowledge pair format file, wherein the knowledge pair format comprises knowledge point name-knowledge point interpretation.
3. The method of claim 1, wherein extracting the triples containing the knowledge-point entities in the universal knowledge graph by way of traversal search based on the knowledge-versus-format file, to obtain a plurality of sub-graphs having knowledge-entity-knowledge, comprises:
obtaining a knowledge point-entity-corresponding relation according to the knowledge pair format file;
extracting a triplet of knowledge points-entities-knowledge points in the universal knowledge graph in a traversing searching mode according to the corresponding relation of the knowledge points-entities;
and obtaining a plurality of subgraphs with knowledge-entity-knowledge according to the triples.
4. The method of claim 1, wherein performing data cleaning and data screening on the plurality of subgraphs, and performing pattern merging processing on the plurality of processed subgraphs, and generating the knowledge pattern of the target professional domain comprises:
calculating the frequency of each entity in the plurality of subgraphs by using a pre-trained word frequency calculation model according to a preset word frequency threshold;
and under the condition that the frequency of the entity in the subgraph meets the preset word frequency threshold, performing data deletion operation on the triples corresponding to the entity in the subgraph, and completing data cleaning and screening to obtain a plurality of preprocessed subgraphs.
5. The method as recited in claim 4, further comprising:
calculating cosine similarity between any two entities in the preprocessed multiple subgraphs by using a synonym comparison algorithm based on a word network model;
under the condition that the cosine similarity meets a preset similarity threshold, determining two entities corresponding to the cosine similarity as the same entity;
in the process of spectrum merging treatment, two entities determined to be the same entity are screened according to a preset screening target, so that a knowledge spectrum of the target professional field is obtained.
6. The method of claim 5, wherein the cosine similarity is calculated according to formula (1):
Figure QLYQS_1
(1),
wherein ,
Figure QLYQS_2
representation->
Figure QLYQS_3
First entity node, ++>
Figure QLYQS_4
Indicate->
Figure QLYQS_5
Personal entity node->
Figure QLYQS_6
Representing the inner product operation of the vector.
7. The method of claim 5, wherein the preset screening target is calculated by formula (2):
Figure QLYQS_7
(2),
wherein ,
Figure QLYQS_8
indicate->
Figure QLYQS_9
Personal entity node->
Figure QLYQS_10
Indicate->
Figure QLYQS_11
Personal entity node->
Figure QLYQS_12
Representing the word frequency of the entity.
8. The utility model provides a field knowledge graph construction system based on kind subset extension which characterized in that includes:
the seed set file acquisition module is used for acquiring at least one initial document related to the target professional field, and carrying out format arrangement and document merging processing on at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data;
the knowledge pair format file acquisition module is used for carrying out duplication removal processing on the initial seed set files and carrying out data preprocessing on knowledge point data in the initial seed set files subjected to the duplication removal processing to obtain knowledge pair format files;
the sub-graph acquisition module is used for extracting the triples containing the knowledge point entities in the universal knowledge graph in a traversing and searching mode according to the knowledge pair format file to obtain a plurality of sub-graphs with knowledge-entity-knowledge;
and the target knowledge graph generation module is used for carrying out data cleaning and data screening on the plurality of subgraphs, carrying out graph merging processing on the plurality of processed subgraphs, and generating the knowledge graph of the target professional field.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.
CN202310443200.3A 2023-04-24 2023-04-24 Method and system for constructing domain knowledge graph based on seed subset expansion Pending CN116150407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310443200.3A CN116150407A (en) 2023-04-24 2023-04-24 Method and system for constructing domain knowledge graph based on seed subset expansion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310443200.3A CN116150407A (en) 2023-04-24 2023-04-24 Method and system for constructing domain knowledge graph based on seed subset expansion

Publications (1)

Publication Number Publication Date
CN116150407A true CN116150407A (en) 2023-05-23

Family

ID=86354749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310443200.3A Pending CN116150407A (en) 2023-04-24 2023-04-24 Method and system for constructing domain knowledge graph based on seed subset expansion

Country Status (1)

Country Link
CN (1) CN116150407A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472061A (en) * 2019-07-08 2019-11-19 郑州大学 A kind of knowledge mapping fusion method based on short text similarity calculation
CN112100396A (en) * 2020-08-28 2020-12-18 泰康保险集团股份有限公司 Data processing method and device
CN112434169A (en) * 2020-11-13 2021-03-02 北京创业光荣信息科技有限责任公司 Knowledge graph construction method and system and computer equipment
CN113177124A (en) * 2021-05-11 2021-07-27 北京邮电大学 Vertical domain knowledge graph construction method and system
CN114595344A (en) * 2022-05-09 2022-06-07 北京市农林科学院信息技术研究中心 Crop variety management-oriented knowledge graph construction method and device
CN114860916A (en) * 2022-06-09 2022-08-05 国网冀北电力有限公司计量中心 Knowledge retrieval method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472061A (en) * 2019-07-08 2019-11-19 郑州大学 A kind of knowledge mapping fusion method based on short text similarity calculation
CN112100396A (en) * 2020-08-28 2020-12-18 泰康保险集团股份有限公司 Data processing method and device
CN112434169A (en) * 2020-11-13 2021-03-02 北京创业光荣信息科技有限责任公司 Knowledge graph construction method and system and computer equipment
CN113177124A (en) * 2021-05-11 2021-07-27 北京邮电大学 Vertical domain knowledge graph construction method and system
CN114595344A (en) * 2022-05-09 2022-06-07 北京市农林科学院信息技术研究中心 Crop variety management-oriented knowledge graph construction method and device
CN114860916A (en) * 2022-06-09 2022-08-05 国网冀北电力有限公司计量中心 Knowledge retrieval method and device

Similar Documents

Publication Publication Date Title
CN111488465A (en) Knowledge graph construction method and related device
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
CN107644011A (en) System and method for the extraction of fine granularity medical bodies
Kashmira et al. Generating entity relationship diagram from requirement specification based on nlp
EP3968244A1 (en) Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects
CN113196277A (en) System for retrieving natural language documents
CN114970525B (en) Text co-event recognition method, device and readable storage medium
CN111143571A (en) Entity labeling model training method, entity labeling method and device
CN111651569A (en) Knowledge base question-answering method and system in electric power field
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
US20220207240A1 (en) System and method for analyzing similarity of natural language data
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN112800244A (en) Method for constructing knowledge graph of traditional Chinese medicine and national medicine
CN117271558A (en) Language query model construction method, query language acquisition method and related devices
CN116245107A (en) Electric power audit text entity identification method, device, equipment and storage medium
Ahmed et al. Developing an ontology of concepts in the Qur'an
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
CN113869049B (en) Fact extraction method and device with legal attribute based on legal consultation problem
CN116150407A (en) Method and system for constructing domain knowledge graph based on seed subset expansion
CN115238093A (en) Model training method and device, electronic equipment and storage medium
CN114610576A (en) Log generation monitoring method and device
Vimal Application of logistic regression in natural language processing
Li Feature and variability extraction from natural language software requirements specifications
CN105808522A (en) Method and apparatus for semantic association
CN113326348A (en) Blog quality evaluation method and tool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230523