CN116401375B

CN116401375B - Knowledge graph construction method and system

Info

Publication number: CN116401375B
Application number: CN202310356339.4A
Authority: CN
Inventors: 朱楠
Original assignee: Shenzhen Hongpeng Digital Supply Chain Management Co ltd
Current assignee: Shenzhen Hongpeng Digital Supply Chain Management Co ltd
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2024-02-20
Anticipated expiration: 2043-03-23
Also published as: CN116401375A

Abstract

The invention provides a knowledge graph construction method, which comprises the following steps: extracting entity phrases and keywords in the webpage links, and filtering the entity phrases according to the keywords to obtain filtering entities; acquiring noun phrases in a filtering entity, and calculating the occurrence probability of a unigram of the noun phrases and the occurrence probability of a bigram; obtaining the pairing probability of noun phrases according to the appearance probability of the single grammar and the binary grammar; screening noun phrases in the filtering entity according to the pairing probability to obtain a key entity; calculating the feature word vector of each noun phrase in the key entity, carrying out cluster analysis on the noun phrase according to the feature word vector to obtain the attribute of the noun phrase, and setting the regular expression of each attribute; and iterating by taking a single entity in the key entities as a center according to a bootstrap method, and respectively extracting the relation content of each entity to form a knowledge graph. The invention can improve the extraction accuracy when extracting the content comprising a plurality of entities and acquiring the attribute among the entities.

Description

Knowledge graph construction method and system

Technical Field

The invention belongs to the technical field of knowledge maps, and particularly relates to a knowledge map method and system.

Background

The knowledge graph is a series of different graphs for displaying the knowledge development process and the structural relationship, the knowledge resource and the carrier thereof are described by using a visualization technology, and knowledge and the interrelation between the knowledge resource and the carrier thereof are mined, analyzed, constructed, drawn and displayed. Knowledge maps are typically composed of nodes and edges, each node representing an entity, and the edges connecting the nodes are relationships between entities, i.e., attributes of the entities.

In the prior art, when an intellectual property map is established, related content needs to be collected from the Internet to form the intellectual property map; the content of the Internet is updated at any time, and in order to update the knowledge graph at the same time, the updated Internet content needs to be manually screened out and provided for the knowledge graph; and some internet contents comprise descriptions of a plurality of entities, so that the knowledge graph cannot accurately obtain attributes among the entities when the internet contents are extracted, and the extraction accuracy is low.

Disclosure of Invention

The invention aims to solve the technical problems in the prior art and provides an industrial knowledge graph construction method which improves the extraction accuracy when extracting contents comprising a plurality of entities and acquiring attributes among the entities.

In order to achieve the above object of the present invention, according to a first aspect of the present invention, there is provided a knowledge graph construction method comprising the steps of: extracting entity phrases and keywords in the webpage links, and filtering the entity phrases according to the keywords to obtain filtering entities; acquiring noun phrases in a filtering entity, and calculating the occurrence probability of a unigram of the noun phrases and the occurrence probability of a bigram; obtaining the pairing probability of noun phrases according to the occurrence probability of the unitary grammar and the occurrence probability of the binary grammar; screening noun phrases in the filtering entity according to the pairing probability to obtain a key entity; calculating the feature word vector of each noun phrase in the key entity, carrying out cluster analysis on the noun phrase according to the feature word vector to obtain the attribute of the noun phrase, and setting the regular expression of each attribute; and iterating by taking a single entity in the key entities as a center according to a bootstrap method, and respectively extracting the relation content of each entity to form a knowledge graph.

Further, the step of calculating the occurrence probability of the unigram of the noun phrase and the occurrence probability of the bigram is specifically as follows: calculating the occurrence probability of the single grammar of the noun phrase according to the single formula, and calculating the occurrence probability of the single grammar of the noun phrase according to the binary formula; the unitary formula is as follows: the binary formula is as follows: /> Wherein P is _a (w _i ) Representing the probability of occurrence of a univariate grammar, P _b (w _i ) Representing the probability of occurrence, w, of the secondary grammar _i Representing the ith noun phrase in the filtering entity, wherein N represents the number of times the noun phrase appears in the web page connection; n represents the total number of noun phrases in the filtering entity, m ε n.

Further, the pairing probability of noun phrases is calculated as follows: p=λ ₁ P _a +λ ₂ P _b The method comprises the steps of carrying out a first treatment on the surface of the Wherein P represents the pairing probability, lambda ₁ Weights representing a unigram, P _a Representing the probability of occurrence, lambda, of a univariate grammar ₂ Weights representing the binary grammar, P _b Representing the probability of occurrence of the binary grammar.

Further, the step of extracting the entity phrase and the keyword in the web page link is preceded by the step of generating a web page library; the step of extracting the entity phrase and the keyword in the webpage link comprises the following steps: acquiring webpage links in a webpage library, and extracting entity phrases and keywords of the webpage links; the generation step of the webpage library comprises the following steps: step S1: designating industrial information needing to create a knowledge graph, and creating a webpage library of the industrial information; step S2: acquiring a new link site, acquiring element information of the link site according to a crawling program, calculating the correlation between the element information and industry information, and discarding the link site if the correlation of the link site is smaller than a set threshold; if the correlation of the link site is greater than the set threshold, executing step S3; step S3: and adding the link site into a webpage library.

Further, step S3 further includes: and crawling and storing the article content in the link site, analyzing whether the article content has an external link, and if so, putting the external link as a new link site into the step S2.

Further, before the step of extracting the entity phrase and the keyword in the web page link, a web page link updating judging step is further included; the step of extracting the entity phrase and the keyword in the webpage link comprises the following steps: acquiring updated webpage links in a webpage library, and extracting entity phrases and keywords of the webpage links; the web page link updating judging step specifically comprises the following steps: and polling the webpage links in the webpage library by polling, and judging whether the webpage links are updated according to the stored article content of the webpage links, the webpage types of the webpage links and judgment basis.

Further, the web page types include news websites and content websites; the judgment basis of the news website comprises whether the article content has a new link or not; the content-based website includes a latest posting time, a comment time, and a data fingerprint.

In order to achieve the above object of the present invention, according to a second aspect of the present invention, there is provided a knowledge graph construction system, in which the knowledge graph construction method of the first aspect is used in the execution process; the system comprises an input module, an extraction module, a calculation module and an iteration module; the extraction module is used for extracting entity phrases and keywords of web page links in the web page library and filtering to obtain filtering entities; the computing module is also used for computing the occurrence probability of the unitary grammar, the occurrence probability and the pairing probability of the binary grammar; the extraction module is also used for extracting key entities from the filtering entities according to the pairing probability; the computing module is also used for computing the feature word vector of each noun phrase in the key entity and analyzing and obtaining the attribute of the noun phrase; the input module is used for setting a regular expression according to the attribute of the noun phrase; the iteration module is used for iterating by taking a single entity in the key entities as a center according to the bootstrap method and extracting the relation content of the single entity to form a knowledge graph.

Further, the input module is also used for inputting industry information and a set threshold value which need to create a knowledge graph; the system also comprises a storage module, an analysis module and a judgment module; the storage module is used for storing a webpage library corresponding to each piece of industry information; the analysis module is used for acquiring a new link site and acquiring element information; the computing module is also used for computing the correlation between the element information of the link site and the industry information; the judging module is used for judging that if the correlation of the link site is larger than a set threshold value, the link site is added into the webpage library; if the correlation of the link site is less than the set threshold, discarding the link site.

Further, the storage module is also used for crawling article contents of web page links stored in the web page library for the first time and storing the article contents; the judging module is also used for acquiring article contents of web page links of the web page library, and if the article contents have external links, the judging module sends the external links to the calculating module for correlation calculation; the system also comprises an updating module; the updating module is used for carrying out polling on the webpage links in the webpage library to judge whether the webpage links are updated or not; the extraction module is used for extracting the entity phrase and the keyword of the updated webpage link in the webpage library.

The invention has the technical principle and beneficial effects that: according to the scheme, a plurality of entity phrases in the webpage connection are obtained, the entity phrases are subjected to pairing probability calculation, noun phrases with low pairing probability, namely noun phrases with non-attribute description, are discarded, and the problem that the extraction accuracy is low due to the fact that more entity words are extracted in the extraction process is avoided; in the prior art, the single grammar and the binary grammar are generally used for semantic recognition, and the scheme combines the attribute pairing probability of noun phrases obtained by the single grammar and the binary grammar; noun phrases with high attribute pairing probability are screened out, and the content extraction accuracy of the webpage links is improved.

Drawings

FIG. 1 is a schematic flow chart of a knowledge graph construction method according to the present invention;

FIG. 2 is a schematic diagram of a process for creating and updating a web page library according to the present invention;

FIG. 3 is a flow chart of the present invention for determining whether a web page library is updated;

fig. 4 is a schematic structural diagram of a knowledge graph construction system according to the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

The invention provides a knowledge graph construction method, which is used for identifying and crawling articles related to industrial information on the Internet, filtering the content in the articles to obtain the properties related to the industrial information in the articles, including registered capital, financing conditions, occupied area, production scale, employee number, pollution conditions and upstream and downstream relations, and establishing the knowledge graph of the upstream and downstream of the industrial information through the properties; the industrial information can be fully understood in detail through the knowledge graph, and the financial institution is assisted in enterprise risk control.

As shown in the attached figure 1, the knowledge graph construction method provided by the invention comprises the following steps:

extracting entity phrase E (E) in web page link ₁ ,e ₂ ,e ₃ ,…,e _n ) And keyword K (K) ₁ ,k ₂ ,k ₃ ,…,k _x ) Filtering the entity phrase according to the keywords to obtain a filtering entity R, wherein R=K n E; in this embodiment, the preferred number of keywords is 10, i.e., x=10.

Acquiring noun phrases W (W) ₁ ,w ₂ ,w ₃ ,…,w _n ) Calculating the occurrence probability of the unigram of the noun phrase and the occurrence probability of the bigram; obtaining the pairing probability of noun phrases according to the occurrence probability of the unitary grammar and the occurrence probability of the binary grammar;

screening noun phrases in the filtering entity according to the pairing probability to obtain a key entity; calculating the feature word vector of each noun phrase in the key entity, and carrying out cluster analysis on the noun phrases according to the feature word vector to obtain the attribute Attr (a) of the noun phrases ₁ ,a ₂ ,a ₃ ,…,a _n ) Setting regular expressions of all attributes; specifically, in this embodiment, the word2vector algorithm is used to calculate the feature word vector of each noun phrase in the key entity, that is, the similar phrase of each noun phrase.

According to bootstrap method with ith entity R in key entities _i Iterating for the center, and respectively extracting the relation content of each entity to form a knowledge graph; in this embodiment, bootstrapping is preferably a Bootstrapping algorithm.

Preferably, the step of calculating the occurrence probability of the unigram of the noun phrase and the occurrence probability of the bigram is specifically: calculating the occurrence probability of the single grammar of the noun phrase according to the single formula, and calculating the occurrence probability of the single grammar of the noun phrase according to the binary formula;

the unitary formula is as follows:

the binary formula is as follows:

wherein P is _a (w _i ) Representing the probability of occurrence of a univariate grammar, P _b (w _i ) Representing the probability of occurrence, w, of the secondary grammar _i Representing the ith noun phrase in the filtering entity, wherein N represents the number of times the noun phrase appears in the web page connection; n represents the total number of noun phrases in the filtering entity, m ε n.

The pairing probability of noun phrases is calculated as follows:

P＝λ ₁ P _a +λ ₂ P _b

wherein P represents the pairing probability, lambda ₁ Weights representing a unigram, P _a Representing the probability of occurrence, lambda, of a univariate grammar ₂ Weights representing the binary grammar, P _b Representing the probability of occurrence of the binary grammar. In the actual real-time process, the weight lambda of the unigram ₁ And weights lambda of the binary grammar ₂ Iterative interpolation calculation is carried out according to the existing samples through an EM algorithm (expectation maximization algorithm);

in this embodiment, in order to increase the speed of identifying and crawling articles, the step of extracting the entity phrases and keywords in the web page links further includes a step of generating a web page library; screening out web page links related to the industry information through a generation step of a web page library, and discarding web page links not related to the industry information, so that article identification and crawling steps are not required for the web page links not related to the industry information, and article identification and crawling efficiency is improved;

the step of extracting the entity phrase and the keyword in the webpage link comprises the following steps: acquiring webpage links in a webpage library, and extracting entity phrases and keywords of the webpage links;

as shown in fig. 2, in this embodiment, the generating step of the web page library includes:

step S1: designating industrial information needing to create a knowledge graph, and creating a webpage library of the industrial information;

step S2: acquiring a new link site, traversing and searching according to the depth or breadth of the crawling program to acquire element information of the link site, judging whether the element information and the industry information have correlation, and discarding the link site if the element information and the industry information do not have correlation; if the element information and the industry information have correlation, executing step S3;

step S3: and adding the link site into a webpage library.

The element information in the embodiment comprises labels, abstracts and webpage contents of the link sites; specifically, whether the element information has correlation with the industry information is analyzed through a correlation analysis algorithm, the correlation analysis algorithm comprises a cosine similarity algorithm based on a keyword vector and other similarity algorithms, and the keyword vector is generated according to a column data model.

In this embodiment, the step of generating the web page library further includes: setting data requirements, and directionally collecting webpage links of a certain theme according to the data requirements, wherein the data requirements can be keywords of a certain industry or describe a certain field. And taking the webpage links of the theme as new link sites, and acquiring the new link sites by using a depth and breadth first algorithm, wherein the new link sites are object pages which need to be determined whether to meet data requirements before or after batch downloading, and the link sites are stored in disk space, so that the requirements return to the selected link sites and the root node of the monitoring site through the depth and breadth first algorithm. The selected link sites form a web page library corresponding to the data requirement.

In this embodiment, since some link sites include external connections with strong correlation, in order to perform deep extraction on the link sites, it is further required to determine whether there is an external link in the link, so step S3 further includes: and crawling and storing the article content in the link site, analyzing whether the article content has an external link, and if so, putting the external link as a new link site into the step S2.

As shown in fig. 3, the step of extracting the entity phrase and the keyword in the web page link further comprises a web page link update judging step;

the step of extracting the entity phrase and the keyword in the webpage link comprises the following steps: acquiring updated webpage links in a webpage library, and extracting entity phrases and keywords of the webpage links;

the web page link updating judging step specifically comprises the following steps: polling the webpage links in the webpage library by polling, and judging whether the webpage links are updated according to the stored article content of the webpage links, the webpage types of the webpage links and judgment basis; specifically, updating refers to the operation of each web page link on the release of new object content; and after the polling is carried out, after the webpage link update in the webpage library is judged, starting the webpage link extraction and crawling steps.

If the updating judgment is not carried out, all the webpage links in the webpage library need to be crawled regularly, a large amount of time and network resources are consumed in the crawling process, and meanwhile, the crawling program can blindly acquire information in the webpage links or can not know the information; through the updating judging step, the periodical polling judging updating step is simpler and faster than the crawling process, so that the crawling program can acquire updated information in the webpage link more quickly, the crawling rate is improved, and the time and network resources are saved.

The webpage types comprise news websites and content websites; in this embodiment, according to different website types, the adopted judgment basis is specifically as follows: if the news website is the news website, whether to update is determined according to whether a new link exists; if the web site is a content web site, whether the web site is updated or not is determined according to the release time, the comment time and the data fingerprint of the web site, wherein the data fingerprint represents the unique identification of the web page link or the article content, and the web page link or the article content is generally obtained after the hash algorithm. Specifically, news websites mainly refer to websites such as corporate networks and news websites, which are frequently and dynamically added with webpage links; the content websites mainly refer to question-answering websites with relatively strong interactivity, such as learning, career and other forum websites.

As shown in fig. 4, the invention also discloses a knowledge graph construction system, and the knowledge graph construction method is used in the execution process; the system comprises an input module, an extraction module, a calculation module and an iteration module;

the extraction module is used for extracting entity phrases and keywords of web page links in the web page library and filtering to obtain filtering entities; the extraction module is also used for extracting key entities from the filtering entities according to the pairing probability;

the computing module is also used for computing the occurrence probability of the unitary grammar, the occurrence probability and the pairing probability of the binary grammar;

the computing module is also used for computing the feature word vector of each noun phrase in the key entity and analyzing and obtaining the attribute of the noun phrase;

the input module is used for setting a regular expression according to the attribute of the noun phrase;

the iteration module is used for iterating by taking a single entity in the key entities as a center according to the bootstrap method and extracting the relation content of the single entity to form a knowledge graph.

The input module is also used for inputting industry information and a set threshold value which need to create a knowledge graph;

the system also comprises a storage module, an analysis module and a judgment module;

the storage module is used for storing a webpage library corresponding to each piece of industry information;

the analysis module is used for acquiring a new link site and element information and analyzing the correlation between the element information and industry information;

the judging module is used for judging that if the correlation of the link site is larger than a set threshold value, the link site is added into the webpage library; if the correlation of the link site is less than the set threshold, discarding the link site.

The storage module is also used for crawling article contents of web page links stored in the web page library for the first time and storing the article contents;

in this embodiment, the analysis module includes a correlation analysis module and a directional acquisition module, the correlation analysis module is used for acquiring new link sites and element information, analyzing the correlation between the element information and industry information, the directional acquisition module is used for specifying data requirements, directionally acquiring link sites of a certain topic according to the data requirements, and screening out link sites meeting the data requirements through a depth and breadth first algorithm.

The judging module is also used for acquiring article contents of web page links of the web page library, and if the article contents have external links, the judging module sends the external links to the calculating module for correlation calculation;

the system also comprises an updating module; the updating module is used for carrying out polling on the webpage links in the webpage library to judge whether the webpage links are updated or not; in this embodiment, when the update module determines that the web page link in the web page library is updated, the notification extraction module performs the extraction action, and the notification method may use methods such as socket communication and delivery with a public data structure

The extraction module is used for extracting the entity phrase and the keyword of the updated webpage link in the webpage library.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. The knowledge graph construction method is characterized by comprising the following steps of:

extracting entity phrases and keywords in the webpage links, and filtering the entity phrases according to the keywords to obtain filtering entities;

acquiring noun phrases in a filtering entity, and calculating the occurrence probability of a unigram of the noun phrases and the occurrence probability of a bigram; obtaining the pairing probability of noun phrases according to the occurrence probability of the unitary grammar and the occurrence probability of the binary grammar;

screening noun phrases in the filtering entity according to the pairing probability to obtain a key entity; calculating the feature word vector of each noun phrase in the key entity, carrying out cluster analysis on the noun phrase according to the feature word vector to obtain the attribute of the noun phrase, and setting the regular expression of each attribute;

and iterating by taking a single entity in the key entities as a center according to a bootstrap method, and respectively extracting the relation content of each entity to form a knowledge graph.

2. The knowledge graph construction method as claimed in claim 1, wherein the step of calculating the occurrence probability of the unigram and the occurrence probability of the bigram of the noun phrase is specifically: calculating the occurrence probability of the single grammar of the noun phrase according to the single formula, and calculating the occurrence probability of the single grammar of the noun phrase according to the binary formula;

the unitary formula is as follows:

the binary formula is as follows:

wherein P is _a (w _i ) Representing the probability of occurrence of a univariate grammar, P _b (w _i ) Representing the probability of occurrence, w, of the secondary grammar _i Representing the ith noun phrase in the filtering entity, wherein N represents the number of times the noun phrase appears in the web page connection; n represents the filtering entityM e n.

3. The knowledge graph construction method according to claim 1 or 2, characterized in that the calculation process of the pairing probability of noun phrases is as follows:

P＝λ ₁ P _a +λ ₂ P _b

wherein P represents the pairing probability, lambda ₁ Weights representing a unigram, P _a Representing the probability of occurrence, lambda, of a univariate grammar ₂ Weights representing the binary grammar, P _b Representing the probability of occurrence of the binary grammar.

4. The knowledge graph construction method as claimed in claim 1 or 2, wherein the step of extracting the entity phrase and the keyword in the web page link further comprises a step of generating a web page library;

the generation step of the webpage library comprises the following steps:

step S2: acquiring a new link site, acquiring element information of the link site according to a crawling program, calculating the correlation between the element information and industry information, and discarding the link site if the correlation of the link site is smaller than a set threshold; if the correlation of the link site is greater than the set threshold, executing step S3;

step S3: and adding the link site into a webpage library.

5. The knowledge graph construction method as claimed in claim 4, wherein the step S3 further comprises: and crawling and storing the article content in the link site, analyzing whether the article content has an external link, and if so, putting the external link as a new link site into the step S2.

6. The knowledge graph construction method as claimed in claim 5, wherein the step of extracting the entity phrase and the keyword from the web page link further comprises a web page link update judging step;

the web page link updating judging step specifically comprises the following steps: and polling the webpage links in the webpage library by polling, and judging whether the webpage links are updated according to the stored article content of the webpage links, the webpage types of the webpage links and judgment basis.

7. The knowledge graph construction method as claimed in claim 6, wherein the web page type includes news web sites and content web sites; the judgment basis of the news website comprises whether the article content has a new link or not; the content-based website includes a latest posting time, a comment time, and a data fingerprint.

8. A knowledge graph construction system characterized in that a knowledge graph construction method according to claim 1 is used in the execution process; the system comprises an input module, an extraction module, a calculation module and an iteration module;

the extraction module is used for extracting entity phrases and keywords of web page links in the web page library and filtering to obtain filtering entities;

the computing module is also used for computing the occurrence probability of the unitary grammar, the occurrence probability and the pairing probability of the binary grammar; the extraction module is also used for extracting key entities from the filtering entities according to the pairing probability;

9. The knowledge graph construction system of claim 8, wherein the input module is further configured to input industry information and set thresholds for creating a knowledge graph;

the analysis module is used for acquiring a new link site and acquiring element information;

the computing module is also used for computing the correlation between the element information of the link site and the industry information;

10. The knowledge graph construction system according to claim 9, wherein the storage module is further configured to crawl and store article contents linked to web pages stored in the web page library for the first time;

the system also comprises an updating module; the updating module is used for carrying out polling on the webpage links in the webpage library to judge whether the webpage links are updated or not;