WO2017101591A1

WO2017101591A1 - Method for constructing knowledge base, and controller

Info

Publication number: WO2017101591A1
Application number: PCT/CN2016/103419
Authority: WO
Inventors: 卢剑锋
Original assignee: 华为技术有限公司
Priority date: 2015-12-17
Filing date: 2016-10-26
Publication date: 2017-06-22
Also published as: CN105589945A

Abstract

Provided are a method for constructing a knowledge base, and a controller, relating to the technical field of Internet. The existing problem that constructed domain knowledge is not complete enough due to the limitation of the rich degree of WEB detail page information is solved. The method provided in the present invention comprises: receiving a knowledge base construction task, wherein the knowledge base construction task contains a task name identifying a knowledge base to be constructed; querying a task configuration corresponding to the task name, wherein the task configuration comprises at least two sub-tasks; sending the at least two sub-tasks to a grabber, and triggering the grabber to execute the at least two sub-tasks to obtain at least two items of structured data; receiving the at least two items of structured data returned by the grabber; and merging the at least two items of structured data, and saving the merged structured data in the knowledge base corresponding to the task name.

Description

Knowledge base construction method and controller

This application claims priority to Chinese Patent Application No. 201510953365.0, filed on Dec. 17, 2015, entitled "A Knowledge Base Construction Method, Controller", the entire contents of which are incorporated herein by reference. In the application.

Technical field

The present invention relates to the field of Internet technologies, and in particular, to a knowledge base construction method and a controller.

Background technique

With the development of the Internet, the information on the Internet is growing rapidly. In order to ensure that computer applications can understand and intelligently handle the target things, it is very necessary to build and use a domain knowledge base that is as rich, accurate and timely as possible. . At present, automatic or semi-automatic knowledge extraction methods are used for the construction of domain knowledge bases, such as: crawling encyclopedic sites and vertical websites through custom crawlers, and obtaining semi-structured information such as object attributes and tables of web page details pages. To build a domain knowledge base.

However, in the process of implementing the present invention, the inventors of the present invention have found that the domain knowledge base constructed by extracting the data information of the WEB detail page, the integrity of the domain object knowledge attribute filled in is often limited by the WEB details. The richness of the page information, when the WEB details page information is not rich enough, it is easy to cause the domain object knowledge attribute extracted from the WEB detail page information to be insufficient, and the domain object cannot be completely described, for example, in a specific music detail page, often It only includes a small amount of information such as singers, albums, and a small number of tags related to the first music, and the style, classification, scene, and other information to which the music belongs cannot be obtained through the detailed information page, affecting the integrity of the music knowledge base. .

Summary of the invention

The main object of the present invention is to provide a knowledge base construction method and a controller to solve the problem that the existing richness of the WEB detail page information is insufficient, resulting in incomplete domain knowledge.

In order to achieve the above object, embodiments of the present invention adopt the following technical solutions:

In a first aspect, an embodiment of the present invention provides a method for constructing a knowledge base, which is applied to a controller, and the method may include:

Receiving a knowledge base construction task; the knowledge base construction task includes a task name identifying the knowledge base to be built;

Querying a task configuration corresponding to the task name; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; each subtask is used to: indicate the crawler according to Extracting a template, performing structured data extraction on a page corresponding to the webpage type in the target website;

Sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;

Receiving the at least two structured data returned by the grabber;

The at least two structured data are merged, and the merged structured data is stored in a knowledge base corresponding to the task name.

Optionally, according to the currently known webpage type, the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask The page type is the index navigation page.

In order to enable the controller to conveniently query the task configuration corresponding to the task name, in an implementation manner of the first aspect, before receiving the knowledge base construction task, the method may further include:

Receiving a creation request; the creation request includes: the task name and a task attribute; and storing a correspondence between the task name and the task attribute.

Further, in another implementation manner of the first aspect, the method may further include:

Receiving a query request sent by the user; the query request includes: the task name;

Querying a knowledge base corresponding to the task name, and feeding back structured data in the knowledge base to the user.

Further, since the domain knowledge information is constantly being updated, in order to build the knowledge base The knowledge information in the current knowledge information is the latest knowledge information. In another implementation manner of the first aspect, the receiving the knowledge base construction task may include:

Receive knowledge base build tasks on a regular basis;

The storing the merged structured data in the knowledge base corresponding to the task name may include:

The structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.

In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.

In a second aspect, an embodiment of the present invention provides a controller, which may include:

An interface unit, configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;

a task scheduling unit, configured to query a task configuration corresponding to the task name received by the interface unit; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; Each sub-task is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extraction template;

And sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;

Receiving the at least two structured data returned by the crawler, and merging the at least two structured data;

The task storage unit is configured to store the structured data merged by the task scheduling unit into a knowledge base corresponding to the task name.

In an implementation manner of the second aspect, the interface unit may be further configured to:

Receiving a creation request before receiving the knowledge base construction task; the creation request includes: the task name and a task attribute;

The controller may further include: a task management unit;

The task management unit is configured to store, after the interface unit receives the creation request, a correspondence between the task name and the task attribute.

Further, in another implementation manner of the second aspect, the interface unit may be further configured to:

The task scheduling unit may be further configured to query a knowledge base corresponding to the task name, and feed back structured data in the knowledge base to the user.

Further, the domain information is continuously updated, and in order to make the knowledge information in the built knowledge base the current latest knowledge information, in another implementation manner of the second aspect, the interface unit is specifically used for :

Receive knowledge base build tasks on a regular basis;

The task storage unit is specifically configured to delete the structured data existing in the knowledge base, and store the currently merged structured data into the knowledge base.

In a third aspect, an embodiment of the present invention provides a controller, which may include:

a communication unit, configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;

a processor, configured to query a task configuration corresponding to the task name received by the communication unit; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; Subtasks are used to: instruct the grabber to follow the extraction mode a board, performing structured data extraction on a page corresponding to the webpage type in the target website;

The memory is configured to store the structured data merged by the processor into a knowledge base corresponding to the task name.

In an implementation manner of the third aspect, the communication unit may be further configured to:

The processor may be further configured to store, after the communication unit receives the creation request, a correspondence between the task name and the task attribute.

Further, in another implementation manner of the third aspect, the communication unit may be further configured to:

The processor may be further configured to query a knowledge base corresponding to the task name, and feed back structured data in the knowledge base to the user.

Further, the domain information is continuously updated, and in order to make the knowledge information in the constructed knowledge base the current latest knowledge information, in another implementation manner of the third aspect, the communication unit is specifically used for :

Receive knowledge base build tasks on a regular basis;

The memory is specifically configured to delete the structured data existing in the knowledge base, and store the currently merged structured data into the knowledge base.

As can be seen from the above, an embodiment of the present invention provides a knowledge base construction method and a controller, which receive a knowledge base construction task, and query a task configuration including at least two subtasks corresponding to the task name, and each subtask corresponds to a type of webpage. And then sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two The structured data stores the merged structured data into a knowledge base corresponding to the task name. In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;

2 is a structural diagram of a controller 10 according to an embodiment of the present invention;

FIG. 3 is a structural diagram of a gripper 20 according to an embodiment of the present invention;

4 is a flowchart of a method for constructing a knowledge base according to an embodiment of the present invention;

FIG. 5 is a structural diagram of a controller according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

Figure 1 shows a simplified schematic of a system architecture that can be applied to the present invention, see Figure 1, The system architecture may include: a controller 10, a crawler 20, and a web page (WEB) server 30; wherein the controller 10, the crawler 20, and the WEB server 30 establish a communication link through the network, and the network may be Any connection method such as wired, wireless communication link or fiber optic cable;

The controller 10 is mainly configured to: receive a task of constructing a knowledge base, query a task configuration corresponding to the current task, acquire at least two subtasks according to the task configuration, and execute the at least two subtasks by the dispatcher 20, multipath Iterate over different types of web pages of the target website and obtain at least two structured data construction domain knowledge bases;

The crawler 20 is mainly configured to: extract the page content corresponding to the webpage type of the subtask in the target website, and obtain structured data corresponding to the extracted template.

The WEB server 30 includes a plurality of vertical domain WEB websites, which operate as an entry for the crawler 20 to access web resources. The crawler 20 can pass a Uniform Resource Locator (URL) address after receiving the subtasks. To access the target website in the WEB server.

Specifically, as shown in FIG. 2 and FIG. 3, the controller 10 may include: an interface unit 101, a task scheduling unit 102, a task storage unit 103, and a task management unit 104. The grabber 20 may include: a receiving unit. 201, WEB content download unit 202, WEB content extraction unit 203; each unit completes the construction of the domain knowledge base through the following process:

After the interface unit 101 receives the knowledge base construction task including the task name, the task scheduling unit 102 acquires at least two subtasks included in the task configuration from the task configuration corresponding to the task name in the task storage unit 103. The at least two subtasks are sent to the crawler 10, and the dispatcher 10 executes the respective subtasks to traverse different webpages of the target website to obtain at least two structured data; wherein the task configuration stored in the task storage unit 103 is configured by The task management unit 104 stores in the task storage unit 103 after the interface unit 101 receives the creation request.

After the receiving unit 201 of the crawler 20 receives the scheduling task of the plurality of subtasks issued by the controller 10, the WEB content downloading unit 202 downloads the WEB page of the webpage type corresponding to the subtask in the target website, and then, WEB The content extraction unit 203 extracts the content of the downloaded WEB page according to the extraction template corresponding to the subtask, obtains structured data, and passes through the receiving unit. 201 sends the acquired structured data to the task scheduling unit 102 of the controller 10; the task scheduling unit 102 merges the structured data corresponding to the plurality of subtasks, and stores the merged structured data in the task storage unit 103. In the knowledge base, after the interface unit 101 receives the query request sent by the user, the corresponding structured data is read from the knowledge base of the task storage unit 103 and fed back to the user.

For convenience of description, the knowledge base construction method in the present invention is shown and described in detail in the form of steps, wherein the steps shown may also be in a group other than the devices in the system architecture shown in FIG. The execution of the instructions in the computer system is performed, and in addition, although the logical order is shown in the figures, in some cases the steps shown or described may be performed in a different order than the ones described herein.

FIG. 4 is a flowchart of a method for constructing a knowledge base according to an embodiment of the present invention. The method is applied to the system architecture shown in FIG. 1. As shown in FIG. 4, the method may include:

S101: The controller receives the knowledge base construction task, and the knowledge base construction task includes a task name that identifies the knowledge base to be built.

Optionally, the controller may receive a knowledge base construction task sent by the user through the terminal held by the user, or receive a knowledge base construction task sent by the user through the user interaction interface of the controller.

For example, the user can input "Baidu Music Knowledge Base" in the input box on the controller display screen, and click the corresponding button to trigger the Baidu music knowledge base to build the task and send the task to the controller; among them, "Baidu music knowledge The library is the knowledge base to be built.

S102: The controller queries a task configuration corresponding to the task name. The task configuration includes: at least two subtasks, and each subtask corresponds to: a target website, an extraction template, and a web page type.

Each subtask is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extracted template; the target website is a website to be structured data extraction; and the extraction template includes: At least one attribute related to the knowledge in the knowledge base; the webpage type may be a detail page or an index navigation page or other type of webpage; in order to maximize the enriched knowledge base, in the embodiment of the present invention, each subtask corresponds to The extraction template is different, and the type of the webpage corresponding to each subtask is also different. At the same time, when configuring the task, as many subtasks as possible should be configured to extract many different attributes in more kinds of webpages. Knowledge information.

Optionally, according to the currently known webpage type, the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the webpage type of the second subtask is Index navigation page; understandably, with the development of computer technology, if other types of web pages appear in the future, other types of web pages can be set to a sub-task, and structured data is extracted from the web page. Rich domain knowledge base.

It should be noted that, in the embodiment of the present invention, the detail page may be: a page capable of querying the details of an object in a certain domain; the index navigation page may be: providing an index of a set of domain objects for the user, guiding the user to browse a certain The page of the detailed page of the domain object is usually the home page of the target website; the structured data may be: the knowledge data extracted according to the extracted template is combined in the form of a list, and the combined data is called structured. data.

For example, if you build a music knowledge base, you can configure two subtasks: subtask 1 and subtask 2. Subtask 1 corresponds to the detail page in Baidu website, and the corresponding extraction template includes: singer, album, scene and other attributes; subtask 2 Corresponding to the index navigation page in Baidu website, and the corresponding extraction template includes: song style, song age and other attributes.

In the embodiment of the present invention, in order to enable the controller to conveniently query the task configuration corresponding to the task name, before receiving the knowledge base construction task, the method may further include:

Receiving a create request; the create request includes: a task name and a task attribute;

Corresponding relationship between the storage task name and the task attribute;

Correspondingly, the controller querying the task configuration corresponding to the task name may specifically include:

The controller queries the correspondence between the task name pre-stored in the controller and the task attribute, and acquires the task configuration corresponding to the task name.

S103: The controller sends at least two subtasks to the gripper.

Optionally, the controller may send at least two subtasks to the crawler in turn, or may send at least two subtasks to the crawler at the same time, which is not limited in the embodiment of the present invention.

S104: The crawler respectively executes at least two subtasks, acquires at least two structured data, and returns two structured data to the controller.

The process of the crawler performing each subtask is the same as the webpage content extraction by the existing crawler: firstly downloading the WEB page of the webpage type corresponding to the subtask in the target website, and then corresponding to the subtask according to the subtask The extraction template extracts the data of the downloaded WEB page according to the attributes contained in the extracted template, and constructs the extracted data in the form of a list to generate structured data.

For example, if the subtask 1 corresponds to the detail page in the Baidu website, and the corresponding extraction template includes: singer, album, scene and other attributes, the crawler can obtain the song related information from the Baidu website details page when the subtask 1 Knowledge information such as singers, albums, and scenes.

S105: The controller merges the at least two structured data returned by the received grabber, and stores the merged structured data in a knowledge base corresponding to the task name.

The merging may refer to de-duplicating at least two structured data of the same domain object; for example, when constructing the music knowledge base, at least two structured data of each song of the plurality of songs may be acquired. At this time, at least two structured data of a certain song can be deduplicated and merged together.

Since, for any domain object, there is unique identification information in the target website, in the embodiment of the present invention, at least two structured data with the same identification information are deduplicated and combined.

For example, if the user needs to establish a knowledge base related to "Little Apple", and the task configuration of the construction task includes: a detailed sub-task and an index navigation sub-task, the following two structured data can be obtained according to the extraction template of each sub-task. :

1) Detailed subtask, that is, structured data extraction is performed on the detail page of the target website for the domain knowledge base, and the subtask execution result outputs structured data as shown in Table 1, which is related to "Little Apple". Detailed attribute information:

Table 1

唯一标识Uniquely identifies	名称name	歌手singer	专辑Album
120125029120125029	小苹果Little apple	筷子兄弟Chopsticks brother	《老男孩之猛龙过江》电影原声The Old Boy's Raptors Crossing the River Movie Soundtrack

2) The navigation subtask, that is, the structured data extraction is performed on the index navigation page of the target website for the domain knowledge base, and the subtask execution result outputs the structured data as shown in Table 2, and the structured data is included with "Little Apple". Related classification information:

Table 2

场景Scenes	唯一标识Uniquely identifies
广场舞Square dance	120125029120125029

Then, the structured data of Table 1 and Table 2 are merged. When merging, the structured data obtained by the navigation subtask is merged into the corresponding detailed subtask result, which is related to "Little Apple" as shown in Table 3. The knowledge base, in this way, builds richer knowledge information related to "Little Apple."

table 3

Further, in order to facilitate the user to query the knowledge information, the method may further include:

Receiving a query request sent by the user, where the query request includes: a task name;

Query the knowledge base corresponding to the task name and feed back the structured data in the knowledge base to the user.

Further, since the domain knowledge information is continuously updated, in order to make the knowledge information in the constructed knowledge base the current latest knowledge information, specifically, the receiving knowledge base construction task may include:

Receive knowledge base build tasks on a regular basis;

The storing the merged structured data in the knowledge base corresponding to the task name may specifically include:

Delete the existing structured data in the knowledge base and store the currently merged structured data in the knowledge base.

At the same time, it can be understood that the task configuration stored in the controller can be updated periodically, some new subtasks are added, or new attributes are added to the extracted templates in the existing subtasks to obtain the most abundant and up-to-date knowledge information. .

It should be noted that, in the embodiment of the present invention, the task of periodically receiving the knowledge base may be: receiving the knowledge base construction task at an interval preset time, wherein the preset time may be set according to requirements, and the comparison in the embodiment of the present invention is not limited.

As can be seen from the above, an embodiment of the present invention provides a knowledge base construction method, which receives a knowledge base construction task, and queries a task configuration including at least two subtasks corresponding to the task name, each subtask corresponding to a type of web page, and then, Sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two structured Data, the merged structured data is stored in a knowledge base corresponding to the task name. In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.

It should be noted that the above process may be performed by each unit in the controller shown in FIG. 2, and details are not described herein. In addition, the interface unit in the controller shown in FIG. 2 of the present invention may be a communication unit of the controller; the task scheduling unit and the task management unit may be separately set up processors, or may be integrated into one processor of the controller. In addition, it may also be stored in the memory of the controller in the form of program code, and a function of the above knowledge base construction is invoked and executed by a certain processor of the controller, and the task storage unit may be a memory in the controller. The processor described here can be a medium A Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. Specifically, the present invention also provides a controller, preferably for implementing the above method.

FIG. 5 is a structural diagram of a controller 10 according to an embodiment of the present invention, for performing the foregoing method. As shown in FIG. 5, the controller 10 may include: a communication interface 1001, a processor 1002, a memory 1003, and At least one communication bus 1004 for implementing connections and mutual communication between the devices;

The communication interface 1001 can be used for data communication with an external network element.

The processor 1002 may be a central processing unit (CPU), may be an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrations of embodiments of the present invention. The circuit, for example: one or more digital singal processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs).

The memory 1003 may be a volatile memory such as a random-access memory (RAM) or a non-volatile memory such as a read-only memory. , ROM), flash memory, hard disk drive (HDD) or solid-state drive (SSD); or a combination of the above types of memory for storage to implement the knowledge base of the present invention Related applications, task configurations, and knowledge base.

The communication bus 1004 can be divided into an address bus, a data bus, a control bus, etc., and can be an Industry Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. For ease of representation, only one thick line is shown in Figure 5, but it does not mean that there is only one bus or one type of bus.

The communication unit 1001 is configured to receive a knowledge base construction task, where the knowledge base construction task includes a task name that identifies the knowledge base to be built.

The processor 1002 is configured to query a task configuration corresponding to the task name received by the communication unit 1001. The task configuration includes: at least two subtasks, each of which is configured with: a target website, an extraction template, and a webpage type;

And sending the at least two subtasks to the crawler, acquiring at least two structured data returned by the crawler after executing the at least two subtasks, and merging the at least two structured data;

The memory 1003 is configured to store the structured data merged by the processor 1002 into a knowledge base corresponding to the task name.

Each subtask is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extraction template; the target website is to be structured data extraction The extraction template includes: at least one attribute related to knowledge in the knowledge base to be built; the webpage type may be a detail page or an index navigation page or other types of web pages; in order to maximize the build of the knowledge base, In the embodiment of the present invention, the extraction templates corresponding to each subtask are different, and the webpage types corresponding to each subtask are also different. At the same time, when performing task configuration, multiple subtasks should be configured as much as possible. Extract knowledge information of many different attributes in a wider variety of web pages.

Optionally, according to the currently known webpage type, the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask The webpage type is an index navigation page; understandably, with the development of computer technology, if other types of webpages appear in the future, other types of webpages may be set corresponding to one subtask, and structured data is obtained from the webpage. The extraction to enrich the domain knowledge base.

Further, the communication unit 1001 can be specifically configured to:

Receiving a knowledge base construction task sent by the user through the terminal held by the user, or receiving a knowledge base construction task sent by the user through the user interaction interface of the controller.

Further, in the embodiment of the present invention, in order to enable the controller to conveniently query the task configuration corresponding to the task name, the communication unit 1001 may further be configured to:

Receiving a creation request before receiving the knowledge base construction task; the creation request includes: a task name and a task attribute; storing a correspondence between the task name and the task attribute.

Further, the processor 1002 is specifically configured to:

The at least two subtasks are sent to the crawler in turn, or the at least two subtasks are sent to the crawler at the same time, which is not limited in the embodiment of the present invention.

Further, in combining the at least two structured data, the processor 1002 may specifically be used to:

At least two structured data of the same domain object are deduplicated and combined; for example, when constructing the music knowledge base, at least two structured data of each song of the plurality of songs can be acquired, and at this time, At least two structured data of a song are deduplicated and merged together.

Further, in order to facilitate the user to query the knowledge information, the communication unit 1001 can also be used to:

The processor 1002 is further configured to: after the communication unit 1001 receives the query request, query a knowledge base corresponding to the task name, and feed back the structured data in the knowledge base to the user.

Further, the domain information is continuously updated. In order to make the knowledge information in the built knowledge base the current latest knowledge information, the communication unit 1001 may be specifically configured to:

Receive knowledge base build tasks on a regular basis;

The processor 1002 is specifically configured to:

It can be understood that the task configuration stored in the controller can be updated periodically, some new sub-tasks are added, or new attributes are added to the extracted templates in the existing sub-tasks to obtain the most abundant and up-to-date knowledge information. .

It should be noted that the task of the periodic receiving knowledge base construction in the embodiment of the present invention may be: receiving the knowledge base construction task at an interval preset time, wherein the preset time may be set according to requirements, and the comparison in the embodiment of the present invention is not performed. limited.

As can be seen from the above, an embodiment of the present invention provides a controller, which receives a knowledge base construction task, and queries a task configuration including at least two subtasks corresponding to the task name, and each subtask corresponds to a type of webpage, and then Transmitting at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two structured data, The merged structured data is stored in a knowledge base corresponding to the task name. In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.

It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and are not limited thereto; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that The technical solutions described in the foregoing embodiments are modified, or the equivalents of the technical features are replaced. The modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

A knowledge base construction method is applied to a controller, which is characterized in that:

Receiving a knowledge base construction task; the knowledge base construction task includes a task name identifying the knowledge base to be built;

Querying a task configuration corresponding to the task name; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; each subtask is used to: indicate the crawler according to Extracting a template, performing structured data extraction on a page corresponding to the webpage type in the target website;

Sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;

Receiving the at least two structured data returned by the grabber;

The at least two structured data are merged, and the merged structured data is stored in a knowledge base corresponding to the task name.
The method according to claim 1, wherein the at least two subtasks comprise: a first subtask and a second subtask; wherein the webpage type of the first subtask is a detail page; the second The page type of the subtask is the index navigation page.
The method according to claim 1 or 2, wherein before receiving the knowledge base construction task, the method further comprises:

Receiving a creation request; the creation request includes: the task name and a task attribute;

And storing a correspondence between the task name and the task attribute.
The method according to any one of claims 1-3, wherein the sending the at least two subtasks to the crawler comprises:

Transmitting the at least two subtasks to the crawler respectively;

Alternatively, the at least two subtasks are simultaneously sent to the grabber.
The method of claim 1 further comprising:

Receiving a query request sent by the user; the query request includes: the task name;

Querying a knowledge base corresponding to the task name, and feeding back structured data in the knowledge base to the user.
The method according to claim 1, wherein the receiving the knowledge base construction task comprises:

Receive knowledge base build tasks on a regular basis;

The storing the merged structured data into the knowledge base corresponding to the task name specifically includes:

The structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.
A controller, comprising:

An interface unit, configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;

a task scheduling unit, configured to query a task configuration corresponding to the task name received by the interface unit; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; Each sub-task is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extraction template;

And sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;

Receiving the at least two structured data returned by the crawler, and merging the at least two structured data;

The task storage unit is configured to store the structured data merged by the task scheduling unit into a knowledge base corresponding to the task name.
The controller according to claim 7, wherein the at least two subtasks comprise: a first subtask and a second subtask; wherein the webpage type of the first subtask is a detail page; The page type of the second subtask is the index navigation page.
The controller according to claim 7 or 8, wherein the interface unit is further configured to:

Receiving a creation request before the interface unit receives the knowledge base construction task; the creation request includes: the task name and a task attribute;

The controller further includes:

The task management unit stores, in the task storage unit, a correspondence between the task name received by the interface unit and the task attribute.
The controller according to any one of claims 7-9, wherein the task scheduling unit is specifically configured to:

Transmitting the at least two subtasks to the crawler respectively;

Alternatively, the at least two subtasks are simultaneously sent to the grabber.
The controller according to claim 7, wherein the interface unit is further configured to:

Receiving a query request sent by the user; the query request includes: the task name;

The task scheduling unit is further configured to: after the interface unit receives the query request sent by the user, query a knowledge base corresponding to the task name, and feed back the structured data in the knowledge base to the user .
The controller according to claim 7, wherein the interface unit is specifically configured to:

Receive knowledge base build tasks on a regular basis;

The task storage unit is specifically configured to:

The structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.
A controller, comprising:

a communication unit, configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;

a processor, configured to query a task configuration corresponding to the task name received by the communication unit; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; Subtasks are used to: instruct the grabber to follow the extraction mode a board, performing structured data extraction on a page corresponding to the webpage type in the target website;

And sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;

Receiving the at least two structured data returned by the crawler, and merging the at least two structured data;

The memory is configured to store the structured data merged by the processor into a knowledge base corresponding to the task name.