CN112800305A

CN112800305A - Knowledge graph data extraction method and device based on web crawler

Info

Publication number: CN112800305A
Application number: CN202110034207.0A
Authority: CN
Inventors: 洪万福; 钱智毅; 吴文杰
Original assignee: Xiamen Yuanting Information Technology Co ltd
Current assignee: Xiamen Yuanting Information Technology Co ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-05-14

Abstract

The embodiment of the invention provides a knowledge graph data extraction method and device based on a web crawler, a readable storage medium and computing equipment, which are used for realizing crawler code multiplexing, deeply and automatically crawling webpage data in batches and avoiding the need of modifying a large number of webpage analysis codes caused by page changes. The method comprises the following steps: acquiring a target webpage for crawling data; configuring crawling rules and parsing rules of the target webpage; crawling the target webpage and a webpage linked with the target webpage according to the crawling rule; acquiring entity information and relationship information contained in the target webpage and a webpage linked with the target webpage according to the analysis rule; and generating a knowledge graph according to the entity information and the relation information.

Description

Knowledge graph data extraction method and device based on web crawler

Technical Field

The invention relates to the technical field of artificial intelligence and automatic machine learning, in particular to a knowledge graph data extraction method and device based on a web crawler, a readable storage medium and computing equipment.

Background

With the rapid development of networks, the world wide web becomes a carrier of a large amount of information, and in the process of establishing a map, data provided by an enterprise may not meet the existing services: firstly, the data is not comprehensive enough, and secondly, the data has certain timeliness. The method for enriching the database data by crawling data from an open source website is a good choice, however, the web page formats of the current web pages are not uniform, even if the same web page may contain different types of entities and relationships, and the following defects exist in the process of writing corresponding crawler codes for each type of data to extract the data: firstly, the workload is large, and each entity and relationship of each page need to compile corresponding analysis logic; secondly, the code is not beneficial to maintenance, the webpage may be adjusted correspondingly along with the time, and when the webpage structure changes, the code needs to be adjusted.

Disclosure of Invention

To this end, the present invention provides a web crawler-based knowledge-graph data extraction method, apparatus, readable storage medium and computing device in an effort to solve or at least alleviate at least one of the problems identified above.

According to an aspect of the embodiment of the invention, a knowledge graph data extraction method based on a web crawler is provided, which comprises the following steps:

acquiring a target webpage for crawling data;

configuring crawling rules and parsing rules of the target webpage;

crawling the target webpage and a webpage linked with the target webpage according to the crawling rule; acquiring entity information and relationship information contained in the target webpage and a webpage linked with the target webpage according to the analysis rule;

and generating a knowledge graph according to the entity information and the relation information.

Optionally, the parsing rule includes: an entity resolution rule and a relationship resolution rule; the entity analysis rule is used for analyzing the target webpage and entity information of a webpage linked with the target webpage, wherein the entity information comprises attributes, labels and key information of entities; the relationship analysis rule is used for analyzing the relationship information of the target webpage and the webpage linked with the target webpage, and the relationship information comprises the type of the relationship, the attribute of the relationship, the starting node and the ending node of the relationship and the main key of the relationship.

Optionally, configuring a parsing rule of the target webpage includes:

dividing the target webpage and the webpage linked with the target webpage into classes of different levels according to element information contained in the target webpage and the webpage linked with the target webpage;

and configuring the analysis rule corresponding to the webpage of each category for the webpage of each category.

Optionally, for each category of web pages, configuring a parsing rule corresponding to the category of web pages includes:

for each category of web pages, configuring field information and a selector corresponding to the category of web pages;

the field information is used for mapping entity information, or relationship information, or page attribute information, and the selector is used for extracting content corresponding to the field name according to the page language and content coding characteristics of the webpage.

Optionally, configuring the crawling rule of the target webpage includes:

for each category of web pages, configuring crawling breadth rules and crawling depth rules corresponding to the category of web pages;

the crawling breadth rule is configured to determine a crawling entry of a next webpage at the same level according to the page attribute information and the page content distribution characteristics, and the crawling depth rule is configured to determine a crawling entry of a next webpage at a next level according to the page attribute information and the page content distribution characteristics.

Optionally, the method further comprises:

and adopting a distributed crawler network to perform continuous data crawling on the target webpage and the webpage linked with the target webpage regularly.

Optionally, the configuring a selector corresponding to the category of web pages includes:

and configuring a regular expression-based selection rule which is made according to CSS language or XPATH language and corresponds to the webpage of the category.

According to another aspect of the embodiments of the present invention, there is provided a web crawler-based knowledge-graph data extraction apparatus, including:

the data source acquisition unit is used for acquiring a target webpage for crawling data;

the analysis rule configuration unit is used for configuring the crawling rule and the analysis rule of the target webpage;

the data crawling and analyzing unit is used for crawling the target webpage and the webpage linked with the target webpage according to the crawling rule; acquiring entity information and relationship information contained in the target webpage and a webpage linked with the target webpage according to the analysis rule;

and the map generation unit is used for generating a knowledge map according to the entity information and the relation information.

According to yet another aspect of the present invention, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform a web crawler-based knowledge-graph data extraction method as described above.

According to yet another aspect of the present invention, there is provided a computing device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors to perform a web crawler-based knowledgegraph data extraction method as described above.

According to the technical scheme provided by the invention, data are captured for the knowledge graph based on a web crawler technology, and automatic crawling and analysis of web page data are realized by configuring a rule template, so that extraction of knowledge graph data is completed; because the analysis logic of the knowledge graph does not need to be independently written for each page, and the analysis code does not need to be adjusted when the structure of the webpage changes, the efficiency of extracting data from the knowledge graph in the webpage is greatly improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a block diagram of an exemplary computing device;

FIG. 2 is a flow chart diagram of a web crawler-based knowledge-graph data extraction method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart diagram of a web crawler-based knowledge-graph data extraction method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a web crawler-based knowledge-graph extraction process according to an embodiment of the present invention;

FIG. 5 is a computer presentation interface screen shot of a web crawler-based knowledge-graph extraction result, in accordance with a specific embodiment of the present invention;

fig. 6 is a schematic structural diagram of a web crawler-based knowledge-graph data extraction device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

FIG. 1 is a block diagram of an example computing device 100 arranged to implement a web crawler-based knowledge-graph data extraction method in accordance with the present invention. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including desktop and notebook configurations, a server, a virtual computing device in a cluster of multiple computers.

Among other things, one or more programs 122 of computing device 100 include instructions for performing a web crawler-based method of knowledge-graph data extraction in accordance with the present invention.

FIG. 2 illustrates a flow chart of a web crawler-based knowledge-graph data extraction method according to the present invention, the method beginning at step S210.

In step S210, a target web page for crawling data is acquired.

Specifically, the website of the open-source webpage can be collected in advance, the map data is from the open-source network, and the website of the open-source webpage containing the required map data is collected to prepare for subsequent configuration of webpage analysis rules and data storage. The collected target web pages may include multiple portal pages of the same website and may also include portal pages of different websites.

Subsequently, in step S220, the crawling rules and parsing rules of the target web page are configured.

Step S220 specifically includes: configuring a universal adjustable entity analysis rule, a universal adjustable relation analysis rule and a universal adjustable depth crawling rule, and analyzing the html document based on the configured rule in the process sequence. The method comprises the steps of configuring a universal adjustable entity analysis rule, analyzing entity information existing in a page in batch, including attributes of an analysis entity, labels for setting the entity and main keys for setting the entity. And configuring a universal adjustable relation analysis rule, and analyzing the relation existing in the page in batch, wherein the type of the analysis relation, the attribute of the relation, the starting node and the ending node of the relation, and a primary key for setting the relation are included. And configuring a universal adjustable depth crawling rule, and acquiring the url of the next page to be crawled by analyzing specific information in the html document to serve as an entrance for crawling the next page.

Data in the web pages exist in all html web pages, the data format in each web page is not uniform, entities and relations existing in the web pages are analyzed in batches by configuring a universal web page analysis rule, and a new url is analyzed through the current web page to serve as an entrance of the next web page, so that data crawling and knowledge graph data extraction of the target web page are automatically performed.

Specifically, a user can set a web page entry on a template interface, configure an entity and a relationship mapping rule of a web page, and configure an entry rule of a next page of a current page and an entity and a relationship mapping rule of the next page in a form of a visual interface.

Subsequently, in step S230, the target webpage and the webpage linked to the target webpage are crawled according to the crawling rule; and acquiring entity and relationship information contained in the target webpage and the webpage linked with the target webpage according to the analysis rule.

Preferably, the target webpage and the webpage linked with the target webpage are subjected to continuous data crawling periodically by adopting a distributed crawler network. The continuous construction of the knowledge graph can be realized by the integration of the distributed crawler technology and the modeling and construction of the knowledge graph ontology; the method comprises the steps of crawling webpage data regularly and continuously through a distributed crawler, and extracting and constructing an industry knowledge graph continuously through a mapping rule configured by an analysis template so as to achieve the purposes of continuous knowledge extraction and learning of the knowledge graph.

Subsequently, in step S240, a knowledge graph is generated from the entity information and the relationship information.

Specifically, the entities and the objects of the relationships analyzed from the web page are put into a database by calling a data insertion interface of the map; when the extraction process is abnormal, the knowledge graph can roll back data.

Further, in step S220, configuring a parsing rule of the target webpage, including: dividing the target webpage and the webpage linked with the target webpage into different levels of categories according to element information contained in the target webpage and the webpage linked with the target webpage; and configuring the analysis rule corresponding to the webpage of each category for the webpage of each category. According to the embodiment of the invention, the webpages are classified and the analysis rules are respectively configured according to the element information characteristics contained in the webpages, so that the analysis codes are prevented from being independently written for each webpage, and the workload is reduced.

Further, for each category of web pages, configuring a parsing rule corresponding to the category of web pages, including: for each category of web pages, configuring field information and a selector corresponding to the category of web pages; the field information is used for mapping entity information, or relationship information, or page attribute information, and the selector is used for extracting content corresponding to the field name according to the page language and content coding characteristics of the webpage. Wherein the configuration of the selector comprises: and configuring a regular expression-based selection rule which is made according to CSS language or XPATH language and corresponds to the webpage of the category.

Further, in step S220, the crawling rule of the target webpage is configured, including: for each category of web pages, configuring crawling breadth rules and crawling depth rules corresponding to the category of web pages; the crawling breadth rule is configured to determine a crawling entry of the next webpage at the same level according to the page attribute information and the page content distribution characteristics, and the crawling depth rule is configured to determine a crawling entry of the next webpage at the next level according to the page attribute information and the page content distribution characteristics.

The inventive concept is described in more detail below with reference to specific embodiments.

Referring to fig. 3 and 4, a method for extracting knowledge-graph data based on web crawlers according to an embodiment of the present invention includes:

step 1: and collecting a web page website as a data source for data extraction. For example, crawling weapon encyclopedia data, crawling address: http:// www.wuqibaike.com/index.

Step 2: configuring a webpage analysis template, configuring the analysis template of the webpage, configuring an entity and a relationship mapping rule, and configuring an entity attribute entity label and a mapping rule of a relationship attribute relationship type according to element information in different webpages.

And (3) mapping classification list data:

serial number	Name of field	Selector class	Selector device	Whether or not it is an attribute
					1	Classes of weapons	CSS	dl.f14 dt>a	Text
2	Category URL	CSS	dl.f14 dt>a	Properties

Weapon list data mapping:

weapon detail data mapping:

and step 3: and configuring the crawling breadth and the crawling depth rule of the webpage.

And 4, step 4: and (4) running data extraction, wherein the entity and relationship information of each webpage are analyzed by running a program of a webpage analysis rule according to the configuration of the webpage analysis template, and an extraction result is output.

And 5: and (4) leading the entity and the relation object generated in the step (4) into the map through a map knowledge extraction function, and analyzing a visual map, wherein the visual effect is shown as a computer picture captured in a picture 5.

Compared with the background technology, the invention has the following advantages:

firstly, the method comprises the following steps: the data can be extracted in a visual configuration mode even if a crawler cannot be used, the html can be converted into a map data mode by knowing knowledge of the html page, and a user does not need to consider how the back end extracts; secondly, the method comprises the following steps: the method is suitable for document data extraction in various html forms, the extraction logic is reused, for different webpages, the back-end code extraction logic does not need to be adjusted, and only the analysis template needs to be modified in a visual form; thirdly, the method comprises the following steps: the function is crawled to the degree of depth and the breadth of integrated reptile, and the crawling of data is more comprehensive.

Referring to fig. 6, an embodiment of the present invention further provides a web crawler-based knowledge graph data extraction apparatus, including:

a data source obtaining unit 610, configured to obtain a target web page for crawling data;

a parsing rule configuration unit 620, configured to configure crawling rules and parsing rules of the target web page;

the data crawling analysis unit 630 is configured to crawl the target webpage and a webpage linked with the target webpage according to the crawling rule; acquiring entity information and relationship information contained in the target webpage and a webpage linked with the target webpage according to the analysis rule;

and the map generating unit 640 is configured to generate a knowledge map according to the entity information and the relationship information.

Optionally, when the parsing rule configuring unit 620 is configured to configure the parsing rule of the target webpage, it is specifically configured to: dividing the target webpage and the webpage linked with the target webpage into classes of different levels according to element information contained in the target webpage and the webpage linked with the target webpage; and configuring the analysis rule corresponding to the webpage of each category for the webpage of each category.

Optionally, the parsing rule configuring unit 620 is configured to, when configuring, for each category of web pages, a parsing rule corresponding to the category of web pages, specifically: for each category of web pages, configuring field information and a selector corresponding to the category of web pages; the field information is used for mapping entity information, or relationship information, or page attribute information, and the selector is used for extracting content corresponding to the field name according to the page language and content coding characteristics of the webpage.

Optionally, when the parsing rule configuring unit 620 is configured to configure the crawling rule of the target webpage, specifically configured to: for each category of web pages, configuring crawling breadth rules and crawling depth rules corresponding to the category of web pages; the crawling breadth rule is configured to determine a crawling entry of a next webpage at the same level according to the page attribute information and the page content distribution characteristics, and the crawling depth rule is configured to determine a crawling entry of a next webpage at a next level according to the page attribute information and the page content distribution characteristics.

Optionally, when the parsing rule configuring unit 620 is configured to configure the selector corresponding to the category of web pages, it is specifically configured to: and configuring a regular expression-based selection rule which is made according to CSS language or XPATH language and corresponds to the webpage of the category.

It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing inventive embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the apparatus in the examples invented herein may be arranged in an apparatus as described in this embodiment or alternatively may be located in one or more apparatuses different from the apparatus in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features of the invention in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so invented, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature of the invention in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention is to be considered as illustrative and not restrictive in character, with the scope of the invention being indicated by the appended claims.

Claims

1. A knowledge graph data extraction method based on web crawlers is characterized by comprising the following steps:

acquiring a target webpage for crawling data;

configuring crawling rules and parsing rules of the target webpage;

2. The method of claim 1, wherein the parsing rule comprises: an entity resolution rule and a relationship resolution rule; the entity analysis rule is used for analyzing the target webpage and entity information of a webpage linked with the target webpage, wherein the entity information comprises attributes, labels and key information of entities; the relationship analysis rule is used for analyzing the relationship information of the target webpage and the webpage linked with the target webpage, and the relationship information comprises the type of the relationship, the attribute of the relationship, the starting node and the ending node of the relationship and the main key of the relationship.

3. The method of claim 1, wherein configuring the parsing rules for the target web page comprises:

4. The method of claim 3, wherein configuring, for each category of web pages, a parsing rule corresponding to the category of web pages comprises:

5. The method of claim 3, wherein configuring crawling rules for the target web page comprises:

6. The method of claim 5, further comprising:

7. The method of claim 4, wherein configuring the selector corresponding to the category of web pages comprises:

8. A knowledge-graph data extraction device based on web crawlers is characterized by comprising:

9. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the method of any one of claims 1-7.

10. A computing device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors to perform the method recited in any of claims 1-7.