CN116150407A

CN116150407A - Method and system for constructing domain knowledge graph based on seed subset expansion

Info

Publication number: CN116150407A
Application number: CN202310443200.3A
Authority: CN
Inventors: 刘淇; 冯彬; 阮书岚; 陈恩红
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-05-23

Abstract

The invention discloses a field knowledge graph construction method and system based on seed set expansion. The method comprises the following steps: acquiring at least one initial document related to the target professional field, and performing format arrangement and document merging processing on the at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data; performing de-duplication treatment on the initial seed set file, and performing data preprocessing on knowledge point data in the de-duplication treated initial seed set file to obtain a knowledge pair format file; according to the knowledge pair format file, extracting triples containing knowledge point entities in the universal knowledge graph in a traversing and searching mode to obtain a plurality of subgraphs with knowledge-entity-knowledge; and carrying out data cleaning and data screening on the plurality of subgraphs, and carrying out spectrum merging processing on the plurality of treated subgraphs to generate a knowledge spectrum in the target professional field.

Description

Method and system for constructing domain knowledge graph based on seed subset expansion

Technical Field

The invention relates to the fields of natural language processing, data mining, information retrieval and the like, in particular to a seed set expansion-based field knowledge graph construction method, a seed set expansion-based field knowledge graph construction system, electronic equipment and a storage medium.

Background

Knowledge Graph (knowledgegraph) is a form of representing and organizing Knowledge in the computer field, and refers to the use of a large-scale semantic network of directed Graph structures to present concepts and related semantic links between concepts. If there is a Relationship between two nodes, they are connected together by an edge, then the node is called an Entity (Entity), and the edge between them is called a Relationship (Relationship). The basic unit of the knowledge graph is a triplet formed by an entity-relation-entity, which is also the core of the knowledge graph. The knowledge graph has the advantages of large scale, high knowledge quality, exquisite structure and the like, and has wide application prospect in numerous fields such as intelligent education, intelligent judicial, financial wind control, intelligent recommendation and the like at present by carrying out structured combing on massive knowledge and assisting downstream reasoning, mining and other tasks.

In the prior art, methods related to knowledge graph construction mainly comprise a knowledge extraction-based method, an ontology-based method and a statistical-based method. However, the above method may have a certain difficulty and limitation in extracting the specialized domain knowledge graph, and the degree of expertise and automation of the domain knowledge graph constructed based on the above method are very limited without considering the domain specificity and complexity of the specialized domain knowledge.

Disclosure of Invention

In view of the above problems, the present invention provides a method, a system, an electronic device, and a storage medium for constructing a domain knowledge graph based on seed set expansion, so as to solve at least one of the above problems.

According to a first aspect of the present invention, there is provided a method for constructing a domain knowledge graph based on seed set expansion, including:

acquiring at least one initial document related to the target professional field, and performing format arrangement and document merging processing on the at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data;

performing de-duplication treatment on the initial seed set file, and performing data preprocessing on knowledge point data in the de-duplication treated initial seed set file to obtain a knowledge pair format file;

according to the knowledge pair format file, extracting triples containing knowledge point entities in the universal knowledge graph in a traversing and searching mode to obtain a plurality of subgraphs with knowledge-entity-knowledge;

and carrying out data cleaning and data screening on the plurality of subgraphs, and carrying out spectrum merging processing on the plurality of treated subgraphs to generate a knowledge spectrum in the target professional field.

According to an embodiment of the present invention, the foregoing performing a deduplication process on an initial seed set file, and performing a data preprocessing on knowledge point data in the initial seed set file after the deduplication process, where obtaining a file in a knowledge pair format includes:

performing entry filtering on the initial seed set file, screening out repeated knowledge points in the initial seed set file, and obtaining a de-duplicated initial seed set file;

extracting stems of knowledge point names in the de-duplicated initial seed subset files by using a Bode stem extraction method to obtain stem extraction results;

performing secondary duplicate removal processing on the same knowledge points in the duplicate-removed initial seed set file according to the stem extraction result to obtain a secondary duplicate removal result;

and formatting the secondary de-duplication result to obtain a knowledge pair format file, wherein the knowledge pair format comprises knowledge point name-knowledge point interpretation.

According to an embodiment of the present invention, extracting triples including knowledge point entities in a universal knowledge graph by traversing the file in the knowledge pair format to obtain a plurality of sub-graphs having knowledge-entity-knowledge includes:

obtaining a knowledge point-entity-corresponding relation according to the knowledge pair format file;

according to the corresponding relation of the knowledge points and the entities, extracting the triples of the knowledge points and the entities and the knowledge points in the universal knowledge graph in a traversing searching mode;

from the triples, multiple subgraphs with knowledge-entity-knowledge are obtained.

According to an embodiment of the present invention, the performing data cleaning and data screening on the plurality of sub-graphs, and performing graph merging processing on the plurality of processed sub-graphs, generating a knowledge graph in the target professional field includes:

calculating the frequency of each entity in the multiple subgraphs by using a pre-trained word frequency calculation model according to a preset word frequency threshold;

and under the condition that the frequency of the entity in the subgraph meets a preset word frequency threshold, performing data deletion operation on the triples corresponding to the entity in the subgraph, and completing data cleaning and screening to obtain a plurality of preprocessed subgraphs.

According to an embodiment of the present invention, the above-mentioned performing data cleaning and data screening on the plurality of sub-graphs, and performing a graph merging process on the plurality of processed sub-graphs, generating a knowledge graph in the target professional field further includes:

calculating cosine similarity between any two entities in the preprocessed multiple subgraphs by using a synonym comparison algorithm based on a word network model;

under the condition that the cosine similarity meets a preset similarity threshold, determining two entities corresponding to the cosine similarity as the same entity;

in the process of spectrum merging treatment, two entities determined to be the same entity are screened according to a preset screening target, so that a knowledge spectrum in the target professional field is obtained.

According to an embodiment of the present invention, the cosine similarity is calculated according to formula (1):

（1），

wherein ,

indicate->

Personal entity node->

Indicate->

Personal entity node->

Representing the inner product operation of the vector.

According to an embodiment of the present invention, the preset screening target is calculated by formula (2):

（2），

wherein ,

representation->

First entity node, ++>

Indicate->

Personal entity node->

Representing the word frequency of the entity.

According to a second aspect of the present invention, there is provided a domain knowledge graph construction system based on seed set expansion, comprising:

the seed set file acquisition module is used for acquiring at least one initial document related to the target professional field, and carrying out format arrangement and document merging processing on the at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data;

the knowledge pair format file acquisition module is used for carrying out de-duplication treatment on the initial seed set file, and carrying out data preprocessing on knowledge point data in the initial seed set file after the de-duplication treatment to obtain a knowledge pair format file;

the sub-graph acquisition module is used for extracting triples containing knowledge point entities in the universal knowledge graph in a traversing search mode according to the knowledge pair format file to obtain a plurality of sub-graphs with knowledge-entity-knowledge;

and the target knowledge graph generation module is used for carrying out data cleaning and data screening on the plurality of subgraphs, carrying out graph merging processing on the plurality of processed subgraphs, and generating a knowledge graph in the target professional field.

According to a third aspect of the present invention, there is provided an electronic device comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform a seed set extension-based domain knowledge graph construction method.

According to a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method of domain knowledge graph construction based on seed set expansion.

Drawings

FIG. 1 is a flow chart of a method of domain knowledge graph construction based on seed set expansion, in accordance with an embodiment of the invention;

FIG. 2 is a flow chart of acquiring a knowledge pair formatted file, according to an embodiment of the invention;

FIG. 3 is a flow diagram of obtaining multiple subgraphs with knowledge-entity-knowledge in accordance with an embodiment of the invention;

FIG. 4 is a flow chart of generating a knowledge-graph of a target area of expertise, according to an embodiment of the invention;

FIG. 5 is a flow chart for acquiring knowledge points and preprocessing the knowledge points in accordance with another embodiment of the invention;

FIG. 6 is a flow chart of a method of constructing a knowledge-graph from seed subsets to domains in accordance with another embodiment of the invention;

FIG. 7 is a schematic structural diagram of a species subset expansion-based domain knowledge graph construction system in accordance with an embodiment of the invention;

fig. 8 schematically shows a block diagram of an electronic device adapted to implement a seed-subset-extension-based domain knowledge-graph construction method, according to an embodiment of the invention.

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

Regarding the construction of knowledge graph, the following three methods mainly exist in the prior art: knowledge extraction-based methods, ontology-based methods, and statistical-based methods.

The knowledge extraction-based method mainly extracts information such as entities, relations and the like from texts by extracting the texts through natural language processing technology, and converts the information into a structured knowledge representation form so as to construct a knowledge graph. The extraction method uses natural language processing technologies such as word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis and the like to process and analyze texts, and extracts information such as entities, attributes, relations and the like. The extraction method can automatically process a large amount of unstructured data, so that the knowledge graph is constructed more efficiently and accurately.

The ontology-based method is a knowledge graph construction method based on ontology modeling. An ontology is an ontology model that builds an ontology model describing its essential attributes, relationships, and concepts for a domain or specific problem. When the knowledge graph is constructed, relevant information can be mapped onto the ontology model, so that the knowledge graph is generated. Ontology-based methods can improve knowledge-graph accuracy and consistency because the ontology model is well designed and defined and based on strict logic and semantic specifications.

The statistical-based method is a method for constructing a knowledge graph by using a statistical learning method. This approach first requires a large amount of data to be collected and then trained using machine learning algorithms to learn the relationships and rules between the data. Finally, a knowledge graph is constructed using these relationships and rules. The statistical-based method can efficiently process a large amount of data and is suitable for constructing a large-scale knowledge graph. However, it may suffer from errors and ambiguity because it is based solely on the data itself, without considering the semantic and logical structure of knowledge.

However, in practical scenario applications and data sets, it can be seen that the above-described method may have certain difficulties and limitations in extracting a specialized domain knowledge graph. On one hand, the field knowledge has strong professional, and deep field knowledge and experience are needed to accurately extract and model; on the other hand, the professional domain knowledge generally lacks large-scale structured data, and generally only has unstructured data such as books, regulations, inquiry manuals and the like written by experts, so that the statistical-based method is difficult to support. In the fields of programming education, judicial education, basic discipline education and the like, the construction of the domain knowledge graph has positive effects and important significance for students to learn the professional domain knowledge and to clear the relation and distinction between different knowledge concepts. The existing general method does not consider the field specificity and complexity of the professional field knowledge, has limited professional degree and automation degree for constructing the field knowledge graph, and has great challenges and significance on how to automatically construct the field knowledge graph based on knowledge point data such as the related professional terms and language concepts of the field.

In order to solve the technical problems in the prior art, the invention provides a domain knowledge graph construction method for professional disciplines, which comprises programming education, judicial education, basic discipline education and the like. Taking programming education as an example, knowledge point data related to a programming scene is firstly obtained from massive web data (including books, regulations and inquiry manuals in the professional field), wherein the knowledge point data can be professional terms of specific languages, such as Python, java, C ++, or the like, or can be general programming language concepts, such as constants, variables, identifiers, keywords, floating points, and the like, so as to obtain an initial document set. And then, carrying out data preprocessing on the initial document set, filtering repeated data and data irrelevant to the professional field, reducing the candidate document set to improve the efficiency of the method, carrying out unified preprocessing on knowledge point data, including morphological reduction, duplication removal, synonym merging and the like, processing the data into a unified format for subsequent calculation, and obtaining a knowledge point seed subset. And finally, according to the pre-defined seed subsets, retrieving entities of programming data from the universal knowledge graph, extracting triples containing knowledge points, and finally obtaining the domain knowledge graph.

It is specifically stated that in the technical scheme disclosed by the invention, in the process of acquiring the data document in the target expertise field, the authorization of the relevant data document owner is obtained, the data document is processed, applied and stored under the permission of the relevant data document owner, the relevant process accords with the regulation of laws and regulations, necessary and reliable confidentiality measures are adopted, and the requirements of popular regulations are met.

Fig. 1 is a flowchart of a method for constructing a domain knowledge graph based on seed set expansion, according to an embodiment of the present invention.

As shown in FIG. 1, the method for constructing the domain knowledge graph based on seed set expansion comprises operations S110-S140.

At operation S110, at least one initial document related to the target professional field is acquired, and format arrangement and document merging processing are performed on the at least one initial document to obtain an initial seed set file, wherein knowledge point data is included in the initial document.

In the embodiment of the invention, knowledge point data related to the professional field with high authority or acceptance degree is acquired from massive Web documents and books as an initial document set. Taking programming education as an example, the initial set of documents includes two types: generic programming grammar rules (e.g., identifiers, abstract classes) and multi-language terminology (e.g., python, java). The corresponding sources are divided into two types, wherein the former sources are derived from open source data on oxford English-Chinese double-solution computer dictionary, software engineering Chinese-English contrast glossary, software design reconstruction and Github; the latter is derived from various language official development documents, including: python, java, C ++, HTML, javaScript, C #, PHP.

In the embodiment of the invention, the initial document sets with different file formats (such as txt, word, csv and the like) are subjected to format arrangement, combined into a unified format txt and then placed into a unified seed set file.

In operation S120, the initial seed set file is subjected to deduplication processing, and knowledge point data in the initial seed set file after deduplication processing is subjected to data preprocessing, so as to obtain a knowledge pair format file.

In the embodiment of the invention, entry filtering is carried out on the initial seed set file, and repeated knowledge existing in the file is screened out. Because the knowledge content and the range covered by the initial knowledge source are different, the knowledge points are repeated to a certain extent, the seed set is reduced, and the efficiency of the system is improved.

In the embodiment of the invention, the data preprocessing of the seed set file after the duplication removal comprises the following steps: extracting root and further removing duplication. Since knowledge from different sources has the problems of different word shapes and identical expression knowledge, merging processing is needed. Extracting stems of knowledge point names in the seed set file by using a natural language tool kit NLTK library through a Porter stemming method (Porter stemming), performing duplication removing operation on the obtained result, and further performing duplication removing on the same knowledge points. And then unifying the further de-duplicated results into a knowledge pair format of 'knowledge point name-knowledge point interpretation', so as to facilitate subsequent calculation operation.

In operation S130, according to the knowledge-to-format file, triples including knowledge point entities in the universal knowledge graph are extracted by traversing the search method, so as to obtain multiple subgraphs with knowledge-entity-knowledge.

Seed set of professional field data according to previous operation S1120

And a plurality of public general knowledge patterns ConceptNet, DBpedia, wikidata in triplet->

According to the seed subset->

Knowledge points->

Head entity matched into triplet +.>

Or tail entity/>

Extracting a triplet containing the knowledge point entity by traversing the search mode to obtain the relation between the knowledge point entity and the knowledge point entity or other entities, and obtaining a subgraph from each general knowledge graph, namely according to the seed subset->

General knowledge graph

Extracting subgraph->

I.e.

。

In operation S140, data cleaning and data screening are performed on the plurality of sub-graphs, and the plurality of processed sub-graphs are subjected to graph merging processing to generate a knowledge graph in the target professional field.

In the embodiment of the invention, firstly, the knowledge with low occurrence frequency is cleaned, and unusual cold knowledge is filtered out. Taking programming education as an example, the specific practice is as follows: calculating word frequencies of knowledge points in a seed set by adopting a python package wordfreq which is calculated in advance on independent corpus, wherein the word frequencies are calculated by using the wordfreq; an empirical threshold 1e-06 is set and if the frequency of occurrence of an entity is less than the threshold, the triplet is filtered out. Then focusing on the most frequent previous relationships, manually removing domain-specific knowledge, where

Set to 500. The previous relation is marked, and the marking method is as follows: marked as 0, the relation in the programming field is represented, and the deletion is not performed; labeled 1, indicating the professional relationship of other fields (removing the triplet containing the relationship while creating a blacklist of nodes, the triplet containing the node is not common sense); labeled 2, indicates that the relationship is not a relationship in the programming domain, butThe node may belong to a programming knowledge point (removing triples containing the relationship). The relationship marked as 1 is a professional relationship of other fields, and the node related to the professional field relationship is assumed to be a node of the professional field, a blacklist is added, and then iteration is performed to remove the triples comprising the nodes in the blacklist.

In the embodiment of the invention, firstly, entity matching similarity is calculated based on a synonym comparison algorithm of WordNet, and all possibly aligned two-by-two entities are calculated through a cosine similarity formula

，/>

Similarity between entities, entity similarity is higher than the set threshold +.>

Then it is considered an entity, here +.>

Set to 0.9. When combining entities among maps, calculating the occurrence frequency of the entities by adopting wordfreq, and reserving the entity with higher occurrence frequency as a combined entity node +.>

。

（1），

wherein ,

indicate->

Personal entity node->

Indicate->

Personal entity node->

Representing the inner product operation of the vector.

（2），

wherein ,

indicate->

Personal entity node->

Indicate->

Personal entity node->

Representing the word frequency of the entity.

According to the method for constructing the domain knowledge graph based on the seed set expansion, which is provided by the embodiment of the invention, the existing graph construction technology is used in a professional domain scene, the knowledge points are acquired, unified preprocessing is carried out on the knowledge point data, and the entity of the professional domain data is searched from the universal knowledge graph according to the pre-defined professional domain data seed set, so that the knowledge point extraction and the relation extraction of the domain data are realized, the acquired multiple subgraphs are combined, the limitation of the knowledge source of a single graph is made up, the coverage range of the knowledge is enlarged, and the domain knowledge graph is finally obtained. The method for constructing the domain knowledge graph based on seed set expansion has the advantages of simple program, low calculation complexity and high interpretability, and can achieve good effect.

FIG. 2 is a flow chart of acquiring a knowledge pair format file, according to an embodiment of the invention.

As shown in fig. 2, the above-mentioned de-duplication processing is performed on the initial seed set file, and data preprocessing is performed on knowledge point data in the de-duplication processed initial seed set file, so as to obtain a file in a knowledge pair format, where the file includes operations S210 to S240.

In operation S210, entry filtering is performed on the initial seed set file, and repeated knowledge points in the initial seed set file are screened out, so as to obtain a de-duplicated initial seed set file.

In operation S220, a stem of the knowledge point name in the initial seed subset file after the duplication removal is extracted by using a baud stem extraction method, so as to obtain a stem extraction result.

In operation S230, according to the stem extraction result, the same knowledge points in the initial seed set file after the duplication removal are subjected to secondary duplication removal processing, so as to obtain a secondary duplication removal result.

In operation S240, the secondary deduplication result is formatted to obtain a knowledge pair format file, where the knowledge pair format includes knowledge point name-knowledge point interpretation.

FIG. 3 is a flow diagram of obtaining multiple subgraphs with knowledge-entity-knowledge in accordance with an embodiment of the invention.

As shown in fig. 3, the above-mentioned document according to the knowledge pair format extracts the triples including the knowledge point entities in the universal knowledge graph by means of traversing search, and the obtaining of multiple sub-graphs with knowledge-entity-knowledge includes operations S310-S330.

In operation S310, a knowledge point-entity-correspondence is obtained according to the knowledge pair format file.

In operation S320, according to the corresponding relationship between knowledge points and entities, the triples of knowledge points and entities and knowledge points in the universal knowledge graph are extracted by means of traversal search.

In operation S330, a plurality of sub-graphs having knowledge-entity-knowledge are obtained from the triples.

Fig. 4 is a flowchart of generating a knowledge-graph of a target area of expertise, according to an embodiment of the invention.

As shown in fig. 4, the above-mentioned data cleaning and data screening are performed on the multiple subgraphs, and the multiple subgraphs after processing are subjected to spectrum merging processing, so as to generate knowledge maps S410 to S450 in the target professional field.

In operation S410, the frequency of each entity in the plurality of sub-graphs is calculated using a pre-trained word frequency calculation model according to a preset word frequency threshold.

In operation S420, under the condition that the frequency of the entity in the subgraph meets the preset word frequency threshold, performing data deletion operation on the triples corresponding to the entity in the subgraph, and completing data cleaning and screening to obtain a plurality of preprocessed subgraphs.

In operation S430, a cosine similarity between any two entities in the preprocessed plurality of sub-graphs is calculated using a synonym comparison algorithm based on the word network model.

In operation S440, in case the cosine similarity satisfies a preset similarity threshold, two entities corresponding to the cosine similarity are determined as the same entity.

In operation S450, during the spectrum merging process, two entities determined to be the same entity are screened according to a preset screening target, so as to obtain a knowledge spectrum of the target professional field.

The method for constructing the domain knowledge graph based on the seed set expansion is further described in detail below by means of another embodiment of the present invention and with reference to fig. 5 and 6.

FIG. 5 is a flow chart for acquiring knowledge points and preprocessing the knowledge points in accordance with another embodiment of the invention.

Fig. 6 is a flowchart of a method for constructing a seed set-to-domain knowledge-graph, in accordance with another embodiment of the invention.

As shown in fig. 5, first, knowledge point data related to the professional field, such as books, regulations, inquiry manuals, etc. in the professional field, is obtained from a huge amount of Web resources, and an initial document set is obtained.

Then, carrying out format unification and merging treatment according to the initial document set obtained in the last step to obtain an initial seed set; and performing deduplication processing on the obtained initial seed subsets, and deleting duplicate knowledge of different sources.

And carrying out data preprocessing on the seed set after the duplication removal, carrying out stem extraction and further duplication removal, and carrying out format arrangement on the result after the complete duplication removal to obtain a unified knowledge format of 'knowledge point name-knowledge point interpretation'.

As shown in fig. 5, the format of the obtained initial document set is unified, and the first duplication removal, the root extraction, the second duplication removal and the unified knowledge pair format are performed to obtain a final seed set file.

And extracting a triplet containing the knowledge point entity by traversing and searching according to the knowledge point data seed subset obtained in the last step, the plurality of public large-scale general knowledge maps ConceptNet, DBpedia, wikidata and the like, and extracting a domain knowledge subgraph from the general knowledge maps.

The extracted subgraphs are further cleaned and selected, unusual cold knowledge is filtered and knowledge in other fields is removed, and therefore high-quality field knowledge is obtained.

And carrying out map fusion on the obtained multiple subgraphs, calculating entity matching similarity based on a synonym comparison algorithm of WordNet, carrying out semantic matching on the relation and the relation class of the concept Net by utilizing a pre-training language model, and reserving the entity with higher occurrence frequency as a combined entity node.

Taking the acquisition of a knowledge graph in the programming field as an example, as shown in fig. 6, firstly, acquiring a universal knowledge graph, extracting knowledge points in the universal knowledge graph, which are associated with entities in a seed set file, according to the seed set file, and generating a subgraph of the universal knowledge graph; then cleaning and filtering the generated subgraph to obtain a preprocessed subgraph; and merging the preprocessed subgraphs again to obtain a knowledge graph in the programming field.

According to the technical scheme provided by the invention, in the construction process of the domain map, a large amount of knowledge point data related to the professional domain is acquired by utilizing massive web data, books, regulations, inquiry manuals and the like in the professional domain, and the initial document set is subjected to data preprocessing, so that a knowledge point seed subset with a standard form can be obtained. And simultaneously, according to a predefined data seed subset, retrieving the entity of professional data from a large-scale universal knowledge graph, extracting a triplet containing knowledge points, and combining the obtained multiple subgraphs to accurately construct a domain data knowledge graph.

Fig. 7 is a schematic structural diagram of a domain knowledge graph construction system based on seed set expansion according to an embodiment of the invention.

As shown in fig. 7, the system 700 includes a seed set file acquisition module 710, a knowledge pair format file acquisition module 720, a sub-graph acquisition module 730, and a target knowledge graph generation module 740.

The seed set file obtaining module 710 is configured to obtain at least one initial document related to the target professional field, and perform format arrangement and document merging processing on the at least one initial document to obtain an initial seed set file, where the initial document includes knowledge point data.

The knowledge-based format file obtaining module 720 is configured to perform deduplication processing on the initial seed set file, and perform data preprocessing on knowledge point data in the initial seed set file after the deduplication processing, so as to obtain a knowledge-based format file.

The sub-graph obtaining module 730 is configured to extract a triplet including a knowledge point entity in the universal knowledge graph by traversing the search method according to the knowledge pair format file, so as to obtain a plurality of sub-graphs having knowledge-entity-knowledge.

The target knowledge graph generating module 740 is configured to perform data cleaning and data screening on the multiple subgraphs, and perform graph merging processing on the multiple processed subgraphs to generate a knowledge graph in the target professional field.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present invention includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may comprise a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the invention.

In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flow according to embodiments of the present invention by executing programs stored in one or more memories.

According to an embodiment of the invention, the electronic device 800 may further comprise an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to an input/output (I/O) interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to an input/output (I/O) interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

The present invention also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present invention.

According to embodiments of the present invention, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the invention, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not meant to limit the scope of the invention, but to limit the invention thereto.

Claims

1. The utility model provides a field knowledge graph construction method based on seed set expansion, which is characterized by comprising the following steps:

acquiring at least one initial document related to a target professional field, and performing format arrangement and document merging processing on at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data;

extracting triples containing knowledge point entities in the universal knowledge graph by traversing and searching according to the knowledge pair format file to obtain a plurality of subgraphs with knowledge-entity-knowledge;

and carrying out data cleaning and data screening on the plurality of subgraphs, and carrying out spectrum merging processing on the plurality of treated subgraphs to generate a knowledge spectrum of the target professional field.

2. The method of claim 1, wherein performing deduplication processing on the initial seed set file, and performing data preprocessing on knowledge point data in the deduplicated initial seed set file, to obtain a knowledge pair format file comprises:

3. The method of claim 1, wherein extracting the triples containing the knowledge-point entities in the universal knowledge graph by way of traversal search based on the knowledge-versus-format file, to obtain a plurality of sub-graphs having knowledge-entity-knowledge, comprises:

extracting a triplet of knowledge points-entities-knowledge points in the universal knowledge graph in a traversing searching mode according to the corresponding relation of the knowledge points-entities;

and obtaining a plurality of subgraphs with knowledge-entity-knowledge according to the triples.

4. The method of claim 1, wherein performing data cleaning and data screening on the plurality of subgraphs, and performing pattern merging processing on the plurality of processed subgraphs, and generating the knowledge pattern of the target professional domain comprises:

calculating the frequency of each entity in the plurality of subgraphs by using a pre-trained word frequency calculation model according to a preset word frequency threshold;

and under the condition that the frequency of the entity in the subgraph meets the preset word frequency threshold, performing data deletion operation on the triples corresponding to the entity in the subgraph, and completing data cleaning and screening to obtain a plurality of preprocessed subgraphs.

5. The method as recited in claim 4, further comprising:

in the process of spectrum merging treatment, two entities determined to be the same entity are screened according to a preset screening target, so that a knowledge spectrum of the target professional field is obtained.

6. The method of claim 5, wherein the cosine similarity is calculated according to formula (1):

（1），

wherein ,

representation->

First entity node, ++>

Indicate->

Personal entity node->

Representing the inner product operation of the vector.

7. The method of claim 5, wherein the preset screening target is calculated by formula (2):

（2），

wherein ,

indicate->

Personal entity node->

Indicate->

Personal entity node->

Representing the word frequency of the entity.

8. The utility model provides a field knowledge graph construction system based on kind subset extension which characterized in that includes:

the seed set file acquisition module is used for acquiring at least one initial document related to the target professional field, and carrying out format arrangement and document merging processing on at least one initial document to obtain an initial seed set file, wherein the initial document comprises knowledge point data;

the knowledge pair format file acquisition module is used for carrying out duplication removal processing on the initial seed set files and carrying out data preprocessing on knowledge point data in the initial seed set files subjected to the duplication removal processing to obtain knowledge pair format files;

the sub-graph acquisition module is used for extracting the triples containing the knowledge point entities in the universal knowledge graph in a traversing and searching mode according to the knowledge pair format file to obtain a plurality of sub-graphs with knowledge-entity-knowledge;

and the target knowledge graph generation module is used for carrying out data cleaning and data screening on the plurality of subgraphs, carrying out graph merging processing on the plurality of processed subgraphs, and generating the knowledge graph of the target professional field.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.

10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.