WO2017101591A1 - Method for constructing knowledge base, and controller - Google Patents

Method for constructing knowledge base, and controller Download PDF

Info

Publication number
WO2017101591A1
WO2017101591A1 PCT/CN2016/103419 CN2016103419W WO2017101591A1 WO 2017101591 A1 WO2017101591 A1 WO 2017101591A1 CN 2016103419 W CN2016103419 W CN 2016103419W WO 2017101591 A1 WO2017101591 A1 WO 2017101591A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
knowledge base
structured data
subtasks
subtask
Prior art date
Application number
PCT/CN2016/103419
Other languages
French (fr)
Chinese (zh)
Inventor
卢剑锋
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017101591A1 publication Critical patent/WO2017101591A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to a knowledge base construction method and a controller.
  • the inventors of the present invention have found that the domain knowledge base constructed by extracting the data information of the WEB detail page, the integrity of the domain object knowledge attribute filled in is often limited by the WEB details.
  • the main object of the present invention is to provide a knowledge base construction method and a controller to solve the problem that the existing richness of the WEB detail page information is insufficient, resulting in incomplete domain knowledge.
  • an embodiment of the present invention provides a method for constructing a knowledge base, which is applied to a controller, and the method may include:
  • the knowledge base construction task includes a task name identifying the knowledge base to be built;
  • the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; each subtask is used to: indicate the crawler according to Extracting a template, performing structured data extraction on a page corresponding to the webpage type in the target website;
  • the at least two structured data are merged, and the merged structured data is stored in a knowledge base corresponding to the task name.
  • the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask
  • the page type is the index navigation page.
  • the method may further include:
  • the creation request includes: the task name and a task attribute; and storing a correspondence between the task name and the task attribute.
  • the method may further include:
  • the query request includes: the task name;
  • the receiving the knowledge base construction task may include:
  • the storing the merged structured data in the knowledge base corresponding to the task name may include:
  • the structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.
  • the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction
  • the knowledge base is not rich enough.
  • an embodiment of the present invention provides a controller, which may include:
  • An interface unit configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;
  • a task scheduling unit configured to query a task configuration corresponding to the task name received by the interface unit;
  • the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type;
  • Each sub-task is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extraction template;
  • the task storage unit is configured to store the structured data merged by the task scheduling unit into a knowledge base corresponding to the task name.
  • the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask
  • the page type is the index navigation page.
  • the interface unit may be further configured to:
  • the creation request includes: the task name and a task attribute;
  • the controller may further include: a task management unit;
  • the task management unit is configured to store, after the interface unit receives the creation request, a correspondence between the task name and the task attribute.
  • the interface unit may be further configured to:
  • the query request includes: the task name;
  • the task scheduling unit may be further configured to query a knowledge base corresponding to the task name, and feed back structured data in the knowledge base to the user.
  • the domain information is continuously updated, and in order to make the knowledge information in the built knowledge base the current latest knowledge information, in another implementation manner of the second aspect, the interface unit is specifically used for :
  • the task storage unit is specifically configured to delete the structured data existing in the knowledge base, and store the currently merged structured data into the knowledge base.
  • an embodiment of the present invention provides a controller, which may include:
  • a communication unit configured to receive a knowledge base construction task;
  • the knowledge base construction task includes a task name that identifies the knowledge base to be built;
  • a processor configured to query a task configuration corresponding to the task name received by the communication unit;
  • the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type;
  • Subtasks are used to: instruct the grabber to follow the extraction mode a board, performing structured data extraction on a page corresponding to the webpage type in the target website;
  • the memory is configured to store the structured data merged by the processor into a knowledge base corresponding to the task name.
  • the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask
  • the page type is the index navigation page.
  • the communication unit may be further configured to:
  • the creation request includes: the task name and a task attribute;
  • the processor may be further configured to store, after the communication unit receives the creation request, a correspondence between the task name and the task attribute.
  • the communication unit may be further configured to:
  • the query request includes: the task name;
  • the processor may be further configured to query a knowledge base corresponding to the task name, and feed back structured data in the knowledge base to the user.
  • the domain information is continuously updated, and in order to make the knowledge information in the constructed knowledge base the current latest knowledge information, in another implementation manner of the third aspect, the communication unit is specifically used for :
  • the memory is specifically configured to delete the structured data existing in the knowledge base, and store the currently merged structured data into the knowledge base.
  • an embodiment of the present invention provides a knowledge base construction method and a controller, which receive a knowledge base construction task, and query a task configuration including at least two subtasks corresponding to the task name, and each subtask corresponds to a type of webpage. And then sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two The structured data stores the merged structured data into a knowledge base corresponding to the task name.
  • the knowledge base is constructed by extracting knowledge of various types of web pages.
  • the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction
  • the knowledge base is not rich enough.
  • FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention
  • FIG. 2 is a structural diagram of a controller 10 according to an embodiment of the present invention.
  • FIG. 3 is a structural diagram of a gripper 20 according to an embodiment of the present invention.
  • FIG. 4 is a flowchart of a method for constructing a knowledge base according to an embodiment of the present invention.
  • FIG. 5 is a structural diagram of a controller according to an embodiment of the present invention.
  • FIG 1 shows a simplified schematic of a system architecture that can be applied to the present invention, see Figure 1,
  • the system architecture may include: a controller 10, a crawler 20, and a web page (WEB) server 30; wherein the controller 10, the crawler 20, and the WEB server 30 establish a communication link through the network, and the network may be Any connection method such as wired, wireless communication link or fiber optic cable;
  • WEB web page
  • the controller 10 is mainly configured to: receive a task of constructing a knowledge base, query a task configuration corresponding to the current task, acquire at least two subtasks according to the task configuration, and execute the at least two subtasks by the dispatcher 20, multipath Iterate over different types of web pages of the target website and obtain at least two structured data construction domain knowledge bases;
  • the crawler 20 is mainly configured to: extract the page content corresponding to the webpage type of the subtask in the target website, and obtain structured data corresponding to the extracted template.
  • the WEB server 30 includes a plurality of vertical domain WEB websites, which operate as an entry for the crawler 20 to access web resources.
  • the crawler 20 can pass a Uniform Resource Locator (URL) address after receiving the subtasks. To access the target website in the WEB server.
  • URL Uniform Resource Locator
  • the controller 10 may include: an interface unit 101, a task scheduling unit 102, a task storage unit 103, and a task management unit 104.
  • the grabber 20 may include: a receiving unit. 201, WEB content download unit 202, WEB content extraction unit 203; each unit completes the construction of the domain knowledge base through the following process:
  • the task scheduling unit 102 acquires at least two subtasks included in the task configuration from the task configuration corresponding to the task name in the task storage unit 103.
  • the at least two subtasks are sent to the crawler 10, and the dispatcher 10 executes the respective subtasks to traverse different webpages of the target website to obtain at least two structured data; wherein the task configuration stored in the task storage unit 103 is configured by
  • the task management unit 104 stores in the task storage unit 103 after the interface unit 101 receives the creation request.
  • the WEB content downloading unit 202 downloads the WEB page of the webpage type corresponding to the subtask in the target website, and then, WEB
  • the content extraction unit 203 extracts the content of the downloaded WEB page according to the extraction template corresponding to the subtask, obtains structured data, and passes through the receiving unit.
  • 201 sends the acquired structured data to the task scheduling unit 102 of the controller 10; the task scheduling unit 102 merges the structured data corresponding to the plurality of subtasks, and stores the merged structured data in the task storage unit 103.
  • the interface unit 101 receives the query request sent by the user, the corresponding structured data is read from the knowledge base of the task storage unit 103 and fed back to the user.
  • the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction
  • the knowledge base is not rich enough.
  • the knowledge base construction method in the present invention is shown and described in detail in the form of steps, wherein the steps shown may also be in a group other than the devices in the system architecture shown in FIG.
  • the execution of the instructions in the computer system is performed, and in addition, although the logical order is shown in the figures, in some cases the steps shown or described may be performed in a different order than the ones described herein.
  • FIG. 4 is a flowchart of a method for constructing a knowledge base according to an embodiment of the present invention. The method is applied to the system architecture shown in FIG. 1. As shown in FIG. 4, the method may include:
  • the controller receives the knowledge base construction task, and the knowledge base construction task includes a task name that identifies the knowledge base to be built.
  • the controller may receive a knowledge base construction task sent by the user through the terminal held by the user, or receive a knowledge base construction task sent by the user through the user interaction interface of the controller.
  • the user can input "Baidu Music Knowledge Base” in the input box on the controller display screen, and click the corresponding button to trigger the Baidu music knowledge base to build the task and send the task to the controller; among them, "Baidu music knowledge
  • the library is the knowledge base to be built.
  • the controller queries a task configuration corresponding to the task name.
  • the task configuration includes: at least two subtasks, and each subtask corresponds to: a target website, an extraction template, and a web page type.
  • Each subtask is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extracted template;
  • the target website is a website to be structured data extraction;
  • the extraction template includes: At least one attribute related to the knowledge in the knowledge base;
  • the webpage type may be a detail page or an index navigation page or other type of webpage; in order to maximize the enriched knowledge base, in the embodiment of the present invention, each subtask corresponds to The extraction template is different, and the type of the webpage corresponding to each subtask is also different.
  • the task when configuring the task, as many subtasks as possible should be configured to extract many different attributes in more kinds of webpages. Knowledge information.
  • the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the webpage type of the second subtask is Index navigation page;
  • the webpage type of the first subtask is a detail page
  • the webpage type of the second subtask is Index navigation page
  • the detail page may be: a page capable of querying the details of an object in a certain domain
  • the index navigation page may be: providing an index of a set of domain objects for the user, guiding the user to browse a certain
  • the page of the detailed page of the domain object is usually the home page of the target website
  • the structured data may be: the knowledge data extracted according to the extracted template is combined in the form of a list, and the combined data is called structured. data.
  • subtask 1 corresponds to the detail page in Baidu website, and the corresponding extraction template includes: singer, album, scene and other attributes; subtask 2 Corresponding to the index navigation page in Baidu website, and the corresponding extraction template includes: song style, song age and other attributes.
  • the method may further include:
  • the create request includes: a task name and a task attribute;
  • controller querying the task configuration corresponding to the task name may specifically include:
  • the controller queries the correspondence between the task name pre-stored in the controller and the task attribute, and acquires the task configuration corresponding to the task name.
  • S103 The controller sends at least two subtasks to the gripper.
  • the controller may send at least two subtasks to the crawler in turn, or may send at least two subtasks to the crawler at the same time, which is not limited in the embodiment of the present invention.
  • the crawler respectively executes at least two subtasks, acquires at least two structured data, and returns two structured data to the controller.
  • the process of the crawler performing each subtask is the same as the webpage content extraction by the existing crawler: firstly downloading the WEB page of the webpage type corresponding to the subtask in the target website, and then corresponding to the subtask according to the subtask
  • the extraction template extracts the data of the downloaded WEB page according to the attributes contained in the extracted template, and constructs the extracted data in the form of a list to generate structured data.
  • the crawler can obtain the song related information from the Baidu website details page when the subtask 1 Knowledge information such as singers, albums, and scenes.
  • the controller merges the at least two structured data returned by the received grabber, and stores the merged structured data in a knowledge base corresponding to the task name.
  • the merging may refer to de-duplicating at least two structured data of the same domain object; for example, when constructing the music knowledge base, at least two structured data of each song of the plurality of songs may be acquired. At this time, at least two structured data of a certain song can be deduplicated and merged together.
  • the task configuration of the construction task includes: a detailed sub-task and an index navigation sub-task
  • the following two structured data can be obtained according to the extraction template of each sub-task. :
  • the method may further include:
  • the receiving knowledge base construction task may include:
  • the storing the merged structured data in the knowledge base corresponding to the task name may specifically include:
  • the task configuration stored in the controller can be updated periodically, some new subtasks are added, or new attributes are added to the extracted templates in the existing subtasks to obtain the most abundant and up-to-date knowledge information. .
  • the task of periodically receiving the knowledge base may be: receiving the knowledge base construction task at an interval preset time, wherein the preset time may be set according to requirements, and the comparison in the embodiment of the present invention is not limited.
  • an embodiment of the present invention provides a knowledge base construction method, which receives a knowledge base construction task, and queries a task configuration including at least two subtasks corresponding to the task name, each subtask corresponding to a type of web page, and then, Sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two structured Data, the merged structured data is stored in a knowledge base corresponding to the task name.
  • the knowledge base is constructed by extracting knowledge of various types of web pages.
  • the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction
  • the knowledge base is not rich enough.
  • the interface unit in the controller shown in FIG. 2 of the present invention may be a communication unit of the controller; the task scheduling unit and the task management unit may be separately set up processors, or may be integrated into one processor of the controller.
  • it may also be stored in the memory of the controller in the form of program code, and a function of the above knowledge base construction is invoked and executed by a certain processor of the controller, and the task storage unit may be a memory in the controller.
  • the processor described here can be a medium A Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
  • the present invention also provides a controller, preferably for implementing the above method.
  • FIG. 5 is a structural diagram of a controller 10 according to an embodiment of the present invention, for performing the foregoing method.
  • the controller 10 may include: a communication interface 1001, a processor 1002, a memory 1003, and At least one communication bus 1004 for implementing connections and mutual communication between the devices;
  • the communication interface 1001 can be used for data communication with an external network element.
  • the processor 1002 may be a central processing unit (CPU), may be an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrations of embodiments of the present invention.
  • the circuit for example: one or more digital singal processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs).
  • DSPs digital singal processors
  • FPGAs Field Programmable Gate Arrays
  • the memory 1003 may be a volatile memory such as a random-access memory (RAM) or a non-volatile memory such as a read-only memory. , ROM), flash memory, hard disk drive (HDD) or solid-state drive (SSD); or a combination of the above types of memory for storage to implement the knowledge base of the present invention
  • RAM random-access memory
  • non-volatile memory such as a read-only memory.
  • ROM read-only memory
  • flash memory such as hard disk drive (HDD) or solid-state drive (SSD); or a combination of the above types of memory for storage to implement the knowledge base of the present invention
  • HDD hard disk drive
  • SSD solid-state drive
  • the communication bus 1004 can be divided into an address bus, a data bus, a control bus, etc., and can be an Industry Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component
  • EISA Extended Industry Standard Architecture
  • the communication unit 1001 is configured to receive a knowledge base construction task, where the knowledge base construction task includes a task name that identifies the knowledge base to be built.
  • the processor 1002 is configured to query a task configuration corresponding to the task name received by the communication unit 1001.
  • the task configuration includes: at least two subtasks, each of which is configured with: a target website, an extraction template, and a webpage type;
  • the memory 1003 is configured to store the structured data merged by the processor 1002 into a knowledge base corresponding to the task name.
  • Each subtask is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extraction template; the target website is to be structured data extraction
  • the extraction template includes: at least one attribute related to knowledge in the knowledge base to be built; the webpage type may be a detail page or an index navigation page or other types of web pages; in order to maximize the build of the knowledge base,
  • the extraction templates corresponding to each subtask are different, and the webpage types corresponding to each subtask are also different.
  • multiple subtasks should be configured as much as possible. Extract knowledge information of many different attributes in a wider variety of web pages.
  • the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask
  • the webpage type is an index navigation page; understandably, with the development of computer technology, if other types of webpages appear in the future, other types of webpages may be set corresponding to one subtask, and structured data is obtained from the webpage. The extraction to enrich the domain knowledge base.
  • the communication unit 1001 can be specifically configured to:
  • the communication unit 1001 may further be configured to:
  • the creation request includes: a task name and a task attribute; storing a correspondence between the task name and the task attribute.
  • processor 1002 is specifically configured to:
  • the at least two subtasks are sent to the crawler in turn, or the at least two subtasks are sent to the crawler at the same time, which is not limited in the embodiment of the present invention.
  • the processor 1002 may specifically be used to:
  • At least two structured data of the same domain object are deduplicated and combined; for example, when constructing the music knowledge base, at least two structured data of each song of the plurality of songs can be acquired, and at this time, At least two structured data of a song are deduplicated and merged together.
  • the communication unit 1001 can also be used to:
  • the query request includes: the task name;
  • the processor 1002 is further configured to: after the communication unit 1001 receives the query request, query a knowledge base corresponding to the task name, and feed back the structured data in the knowledge base to the user.
  • the communication unit 1001 may be specifically configured to:
  • the processor 1002 is specifically configured to:
  • the structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.
  • the task configuration stored in the controller can be updated periodically, some new sub-tasks are added, or new attributes are added to the extracted templates in the existing sub-tasks to obtain the most abundant and up-to-date knowledge information. .
  • the task of the periodic receiving knowledge base construction in the embodiment of the present invention may be: receiving the knowledge base construction task at an interval preset time, wherein the preset time may be set according to requirements, and the comparison in the embodiment of the present invention is not performed. limited.
  • an embodiment of the present invention provides a controller, which receives a knowledge base construction task, and queries a task configuration including at least two subtasks corresponding to the task name, and each subtask corresponds to a type of webpage, and then Transmitting at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two structured data,
  • the merged structured data is stored in a knowledge base corresponding to the task name.
  • the knowledge base is constructed by extracting knowledge of various types of web pages.
  • the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction
  • the knowledge base is not rich enough.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Provided are a method for constructing a knowledge base, and a controller, relating to the technical field of Internet. The existing problem that constructed domain knowledge is not complete enough due to the limitation of the rich degree of WEB detail page information is solved. The method provided in the present invention comprises: receiving a knowledge base construction task, wherein the knowledge base construction task contains a task name identifying a knowledge base to be constructed; querying a task configuration corresponding to the task name, wherein the task configuration comprises at least two sub-tasks; sending the at least two sub-tasks to a grabber, and triggering the grabber to execute the at least two sub-tasks to obtain at least two items of structured data; receiving the at least two items of structured data returned by the grabber; and merging the at least two items of structured data, and saving the merged structured data in the knowledge base corresponding to the task name.

Description

一种知识库构建方法、控制器Knowledge base construction method and controller
本申请要求于2015年12月17日提交中国专利局、申请号为201510953365.0,发明名称为“一种知识库构建方法、控制器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 201510953365.0, filed on Dec. 17, 2015, entitled "A Knowledge Base Construction Method, Controller", the entire contents of which are incorporated herein by reference. In the application.
技术领域Technical field
本发明涉及互联网技术领域,尤其涉及一种知识库构建方法、控制器。The present invention relates to the field of Internet technologies, and in particular, to a knowledge base construction method and a controller.
背景技术Background technique
随着互联网的发展,互联网中信息在急剧增长,为了确保计算机应用可以与时俱进地理解并智能处理目标事物,构建和使用一个尽可能丰富、准确且及时的领域知识库是非常有必要的。目前,对于领域知识库的构建多采用自动或半自动知识抽取方法,如:通过定制爬虫对百科类站点、垂直网站进行抓取,获取网页(WEB)详情页面的对象属性、表格等半结构化信息来构建领域知识库。With the development of the Internet, the information on the Internet is growing rapidly. In order to ensure that computer applications can understand and intelligently handle the target things, it is very necessary to build and use a domain knowledge base that is as rich, accurate and timely as possible. . At present, automatic or semi-automatic knowledge extraction methods are used for the construction of domain knowledge bases, such as: crawling encyclopedic sites and vertical websites through custom crawlers, and obtaining semi-structured information such as object attributes and tables of web page details pages. To build a domain knowledge base.
但是,在实现本发明的过程中,本发明技术人员发现:这类通过抽取WEB详情页面的数据信息构建起来的领域知识库,其所填充的领域对象知识属性的完整性往往受限于WEB详情页面信息的丰富程度,当WEB详情页面信息不够丰富时,容易导致从该WEB详情页面信息抽取得到的领域对象知识属性不足,无法完整描述领域对象,比如:在一具体音乐的详情页面中,往往只包括与本首音乐相关的歌手、专辑、少量标签等少量信息,而该音乐所归属的风格、分类、场景等信息通过该详情页面都是无法获取到的,影响该音乐知识库的完整性。However, in the process of implementing the present invention, the inventors of the present invention have found that the domain knowledge base constructed by extracting the data information of the WEB detail page, the integrity of the domain object knowledge attribute filled in is often limited by the WEB details. The richness of the page information, when the WEB details page information is not rich enough, it is easy to cause the domain object knowledge attribute extracted from the WEB detail page information to be insufficient, and the domain object cannot be completely described, for example, in a specific music detail page, often It only includes a small amount of information such as singers, albums, and a small number of tags related to the first music, and the style, classification, scene, and other information to which the music belongs cannot be obtained through the detailed information page, affecting the integrity of the music knowledge base. .
发明内容Summary of the invention
本发明的主要目的,在于提供一种知识库构建方法、控制器,以解决现有受限于WEB详情页面信息的丰富程度,导致构建的领域知识不够完整的问题。The main object of the present invention is to provide a knowledge base construction method and a controller to solve the problem that the existing richness of the WEB detail page information is insufficient, resulting in incomplete domain knowledge.
为达到上述目的,本发明的实施例采用如下技术方案: In order to achieve the above object, embodiments of the present invention adopt the following technical solutions:
第一方面,本发明实施例提供一种知识库构建方法,应用于控制器,所述方法可以包括:In a first aspect, an embodiment of the present invention provides a method for constructing a knowledge base, which is applied to a controller, and the method may include:
接收知识库构建任务;所述知识库构建任务包含标识待构建知识库的任务名称;Receiving a knowledge base construction task; the knowledge base construction task includes a task name identifying the knowledge base to be built;
查询与所述任务名称相对应的任务配置;所述任务配置包含:至少两个子任务,每个子任务对应设置有:目标网站、抽取模板以及网页类型;每个子任务用于:指示抓取器根据所述抽取模板,在所述目标网站中与所述网页类型对应的页面上进行结构化数据抽取;Querying a task configuration corresponding to the task name; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; each subtask is used to: indicate the crawler according to Extracting a template, performing structured data extraction on a page corresponding to the webpage type in the target website;
将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,得到至少两个结构化数据;Sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;
接收所述抓取器返回的所述至少两个结构化数据;Receiving the at least two structured data returned by the grabber;
合并所述至少两个结构化数据,将合并后的结构化数据存入与所述任务名称对应的知识库。The at least two structured data are merged, and the merged structured data is stored in a knowledge base corresponding to the task name.
可选的,根据目前已知的网页类型,所述至少两个子任务可以包含:第一子任务和第二子任务,其中,所述第一子任务的网页类型为详情页,第二子任务的网页类型为索引导航页。Optionally, according to the currently known webpage type, the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask The page type is the index navigation page.
其中,为了使控制器方便地查询到与任务名称对应的任务配置,在第一方面的一种可实现方式中,在接收知识库构建任务之前,所述方法还可以包括:In order to enable the controller to conveniently query the task configuration corresponding to the task name, in an implementation manner of the first aspect, before receiving the knowledge base construction task, the method may further include:
接收创建请求;所述创建请求包含:所述任务名称以及任务属性;存储所述任务名称与所述任务属性的对应关系。Receiving a creation request; the creation request includes: the task name and a task attribute; and storing a correspondence between the task name and the task attribute.
进一步的,为了方便用户进行知识信息查询,在第一方面的另一种可实现方式中,所述方法还可以包括:Further, in another implementation manner of the first aspect, the method may further include:
接收用户发送的查询请求;所述查询请求包含:所述任务名称;Receiving a query request sent by the user; the query request includes: the task name;
查询与所述任务名称相对应的知识库,将所述知识库中的结构化数据反馈给所述用户。Querying a knowledge base corresponding to the task name, and feeding back structured data in the knowledge base to the user.
进一步的,由于领域知识信息在不断的进行更新,为了使构建的知识库 中的知识信息为当前最新知识信息,在第一方面的再一种可实现方式中,所述接收知识库构建任务可以包括:Further, since the domain knowledge information is constantly being updated, in order to build the knowledge base The knowledge information in the current knowledge information is the latest knowledge information. In another implementation manner of the first aspect, the receiving the knowledge base construction task may include:
定期接收知识库构建任务;Receive knowledge base build tasks on a regular basis;
所述将合并后的结构化数据存入与所述任务名称对应的知识库具体可以包括:The storing the merged structured data in the knowledge base corresponding to the task name may include:
删除所述知识库中已有的结构化数据,将当前合并后的结构化数据存入知识库。The structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.
如此,通过对多种类型的网页的知识抽取实现知识库的构建,由于不同类型的网页包含不同属性的知识信息,此时,将不同网页抽取到的知识信息进行合并汇总,可以很大程度的丰富知识信息的种类,实现丰富完善领域知识库的目的,避免了现有仅对单一类型的页面(如:详情页面)的内容进行抽取,导致获取到的知识信息不够充分,进而使构建的领域知识库不够丰富的问题。In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.
第二方面,本发明实施例提供一种控制器,可以包括:In a second aspect, an embodiment of the present invention provides a controller, which may include:
接口单元,用于接收知识库构建任务;所述知识库构建任务包含标识待构建知识库的任务名称;An interface unit, configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;
任务调度单元,用于查询与所述接口单元接收到的任务名称相对应的任务配置;所述任务配置包含:至少两个子任务,每个子任务对应设置有:目标网站、抽取模板以及网页类型;每个子任务用于:指示抓取器根据所述抽取模板,在所述目标网站中与所述网页类型对应的页面上进行结构化数据抽取;a task scheduling unit, configured to query a task configuration corresponding to the task name received by the interface unit; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; Each sub-task is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extraction template;
以及,将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,得到至少两个结构化数据;And sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;
接收所述抓取器返回的所述至少两个结构化数据,并合并所述至少两个结构化数据;Receiving the at least two structured data returned by the crawler, and merging the at least two structured data;
任务存储单元,用于将任务调度单元合并后的结构化数据存入与所述任务名称对应的知识库。 The task storage unit is configured to store the structured data merged by the task scheduling unit into a knowledge base corresponding to the task name.
可选的,根据目前已知的网页类型,所述至少两个子任务可以包含:第一子任务和第二子任务,其中,所述第一子任务的网页类型为详情页,第二子任务的网页类型为索引导航页。Optionally, according to the currently known webpage type, the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask The page type is the index navigation page.
其中,为了使控制器方便地查询到与任务名称对应的任务配置,在第二方面的一种可实现方式中,所述接口单元还可以用于:In an implementation manner of the second aspect, the interface unit may be further configured to:
在接收知识库构建任务之前,接收创建请求;所述创建请求包含:所述任务名称以及任务属性;Receiving a creation request before receiving the knowledge base construction task; the creation request includes: the task name and a task attribute;
所述控制器,还可以包括:任务管理单元;The controller may further include: a task management unit;
所述任务管理单元,用于在接口单元接收到创建请求后,存储所述任务名称与所述任务属性的对应关系。The task management unit is configured to store, after the interface unit receives the creation request, a correspondence between the task name and the task attribute.
进一步的,为了方便用户进行知识信息查询,在第二方面的另一种可实现方式中,所述接口单元,还可以用于:Further, in another implementation manner of the second aspect, the interface unit may be further configured to:
接收用户发送的查询请求;所述查询请求包含:所述任务名称;Receiving a query request sent by the user; the query request includes: the task name;
所述任务调度单元,还可以用于查询与所述任务名称相对应的知识库,将所述知识库中的结构化数据反馈给所述用户。The task scheduling unit may be further configured to query a knowledge base corresponding to the task name, and feed back structured data in the knowledge base to the user.
进一步的,由于领域知识信息在不断的进行更新,为了使构建的知识库中的知识信息为当前最新知识信息,在第二方面的再一种可实现方式中,所述接口单元,具体用于:Further, the domain information is continuously updated, and in order to make the knowledge information in the built knowledge base the current latest knowledge information, in another implementation manner of the second aspect, the interface unit is specifically used for :
定期接收知识库构建任务;Receive knowledge base build tasks on a regular basis;
所述任务存储单元,具体用于删除所述知识库中已有的结构化数据,将当前合并后的结构化数据存入知识库。The task storage unit is specifically configured to delete the structured data existing in the knowledge base, and store the currently merged structured data into the knowledge base.
第三方面,本发明实施例提供一种控制器,可以包括:In a third aspect, an embodiment of the present invention provides a controller, which may include:
通信单元,用于接收知识库构建任务;所述知识库构建任务包含标识待构建知识库的任务名称;a communication unit, configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;
处理器,用于查询与所述通信单元接收到的任务名称相对应的任务配置;所述任务配置包含:至少两个子任务,每个子任务对应设置有:目标网站、抽取模板以及网页类型;每个子任务用于:指示抓取器根据所述抽取模 板,在所述目标网站中与所述网页类型对应的页面上进行结构化数据抽取;a processor, configured to query a task configuration corresponding to the task name received by the communication unit; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; Subtasks are used to: instruct the grabber to follow the extraction mode a board, performing structured data extraction on a page corresponding to the webpage type in the target website;
以及,将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,得到至少两个结构化数据;And sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;
接收所述抓取器返回的所述至少两个结构化数据,并合并所述至少两个结构化数据;Receiving the at least two structured data returned by the crawler, and merging the at least two structured data;
存储器,用于将处理器合并后的结构化数据存入与所述任务名称对应的知识库。The memory is configured to store the structured data merged by the processor into a knowledge base corresponding to the task name.
可选的,根据目前已知的网页类型,所述至少两个子任务可以包含:第一子任务和第二子任务,其中,所述第一子任务的网页类型为详情页,第二子任务的网页类型为索引导航页。Optionally, according to the currently known webpage type, the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask The page type is the index navigation page.
其中,为了使控制器方便地查询到与任务名称对应的任务配置,在第三方面的一种可实现方式中,所述通信单元还可以用于:In an implementation manner of the third aspect, the communication unit may be further configured to:
在接收知识库构建任务之前,接收创建请求;所述创建请求包含:所述任务名称以及任务属性;Receiving a creation request before receiving the knowledge base construction task; the creation request includes: the task name and a task attribute;
所述处理器,还可以用于在通信单元接收到创建请求后,存储所述任务名称与所述任务属性的对应关系。The processor may be further configured to store, after the communication unit receives the creation request, a correspondence between the task name and the task attribute.
进一步的,为了方便用户进行知识信息查询,在第三方面的另一种可实现方式中,所述通信单元,还可以用于:Further, in another implementation manner of the third aspect, the communication unit may be further configured to:
接收用户发送的查询请求;所述查询请求包含:所述任务名称;Receiving a query request sent by the user; the query request includes: the task name;
所述处理器,还可以用于查询与所述任务名称相对应的知识库,将所述知识库中的结构化数据反馈给所述用户。The processor may be further configured to query a knowledge base corresponding to the task name, and feed back structured data in the knowledge base to the user.
进一步的,由于领域知识信息在不断的进行更新,为了使构建的知识库中的知识信息为当前最新知识信息,在第三方面的再一种可实现方式中,所述通信单元,具体用于:Further, the domain information is continuously updated, and in order to make the knowledge information in the constructed knowledge base the current latest knowledge information, in another implementation manner of the third aspect, the communication unit is specifically used for :
定期接收知识库构建任务;Receive knowledge base build tasks on a regular basis;
所述存储器,具体用于删除所述知识库中已有的结构化数据,将当前合并后的结构化数据存入知识库。 The memory is specifically configured to delete the structured data existing in the knowledge base, and store the currently merged structured data into the knowledge base.
由上可知,本发明实施例提供一种知识库构建方法、控制器,接收知识库构建任务,查询与所述任务名称相对应的包含至少两个子任务的任务配置,每个子任务对应一类网页,然后,将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,遍历不同种类的网页得到至少两个结构化数据,合并所述至少两个结构化数据,将合并后的结构化数据存入与所述任务名称对应的知识库。如此,通过对多种类型的网页的知识抽取实现知识库的构建,由于不同类型的网页包含不同属性的知识信息,此时,将不同网页抽取到的知识信息进行合并汇总,可以很大程度的丰富知识信息的种类,实现丰富完善领域知识库的目的,避免了现有仅对单一类型的页面(如:详情页面)的内容进行抽取,导致获取到的知识信息不够充分,进而使构建的领域知识库不够丰富的问题。As can be seen from the above, an embodiment of the present invention provides a knowledge base construction method and a controller, which receive a knowledge base construction task, and query a task configuration including at least two subtasks corresponding to the task name, and each subtask corresponds to a type of webpage. And then sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two The structured data stores the merged structured data into a knowledge base corresponding to the task name. In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.
图1为本发明实施例提供的系统架构的示意图;FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;
图2为本发明实施例提供的控制器10的结构图;2 is a structural diagram of a controller 10 according to an embodiment of the present invention;
图3为本发明实施例提供的抓取器20的结构图;FIG. 3 is a structural diagram of a gripper 20 according to an embodiment of the present invention;
图4为本发明实施例提供的知识库构建方法的流程图;4 is a flowchart of a method for constructing a knowledge base according to an embodiment of the present invention;
图5为本发明实施例提供的控制器的结构图。FIG. 5 is a structural diagram of a controller according to an embodiment of the present invention.
具体实施方式detailed description
下面结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
图1示出了可以应用于本发明的系统架构的简化示意图,参见图1,所 述系统架构可以包括:控制器10、抓取器20以及网页(WEB)服务器30;其中,控制器10、抓取器20以及WEB服务器30之间通过网络建立通信链路,所述网络可以为有线、无线通信链路或者光纤电缆等任一种连接方式;Figure 1 shows a simplified schematic of a system architecture that can be applied to the present invention, see Figure 1, The system architecture may include: a controller 10, a crawler 20, and a web page (WEB) server 30; wherein the controller 10, the crawler 20, and the WEB server 30 establish a communication link through the network, and the network may be Any connection method such as wired, wireless communication link or fiber optic cable;
控制器10主要用于:接收构建知识库任务,查询与本次任务相对应的任务配置,根据所述任务配置获取至少两个子任务,调度抓取器20执行所述至少两个子任务,多路迭代目标网站的不同类型的网页,获取至少两个结构化数据构建领域知识库;The controller 10 is mainly configured to: receive a task of constructing a knowledge base, query a task configuration corresponding to the current task, acquire at least two subtasks according to the task configuration, and execute the at least two subtasks by the dispatcher 20, multipath Iterate over different types of web pages of the target website and obtain at least two structured data construction domain knowledge bases;
抓取器20主要用于:对目标网站中与子任务的网页类型对应的页面内容进行抽取,获得与抽取模板相对应的结构化数据。The crawler 20 is mainly configured to: extract the page content corresponding to the webpage type of the subtask in the target website, and obtain structured data corresponding to the extracted template.
WEB服务器30包含多个垂直领域WEB网站,其作为抓取器20访问网页资源的入口而操作,抓取器20可以在接收到子任务后,通过统一资源定位符(Uniform Resource Locator,URL)地址来访问WEB服务器中的目标网站。The WEB server 30 includes a plurality of vertical domain WEB websites, which operate as an entry for the crawler 20 to access web resources. The crawler 20 can pass a Uniform Resource Locator (URL) address after receiving the subtasks. To access the target website in the WEB server.
具体的,如图2和图3所示,所述控制器10可以包括:接口单元101、任务调度单元102、任务存储单元103、任务管理单元104;所述抓取器20可以包括:接收单元201、WEB内容下载单元202、WEB内容抽取单元203;各单元通过下述过程来完成领域知识库的构建:Specifically, as shown in FIG. 2 and FIG. 3, the controller 10 may include: an interface unit 101, a task scheduling unit 102, a task storage unit 103, and a task management unit 104. The grabber 20 may include: a receiving unit. 201, WEB content download unit 202, WEB content extraction unit 203; each unit completes the construction of the domain knowledge base through the following process:
在所述接口单元101接收到包含任务名称的知识库构建任务后,任务调度单元102从任务存储单元103中与该任务名称相对应的任务配置,获取任务配置包含的至少两个子任务,将所述至少两个子任务发送给抓取器10,调度抓取器10执行各个子任务,遍历目标网站的不同网页,获取至少两个结构化数据;其中,任务存储单元103中存储的任务配置,由任务管理单元104在接口单元101接收到创建请求后存储到任务存储单元103中的。After the interface unit 101 receives the knowledge base construction task including the task name, the task scheduling unit 102 acquires at least two subtasks included in the task configuration from the task configuration corresponding to the task name in the task storage unit 103. The at least two subtasks are sent to the crawler 10, and the dispatcher 10 executes the respective subtasks to traverse different webpages of the target website to obtain at least two structured data; wherein the task configuration stored in the task storage unit 103 is configured by The task management unit 104 stores in the task storage unit 103 after the interface unit 101 receives the creation request.
在抓取器20的接收单元201接收到控制器10发出的执行多个子任务的调度任务后,WEB内容下载单元202对目标网站中与子任务对应的网页类型的WEB页面进行下载,然后,WEB内容抽取单元203根据子任务对应的抽取模板对下载的WEB页面的内容进行抽取,得到结构化数据,并通过接收单元 201将获取的结构化数据发送至控制器10的任务调度单元102;任务调度单元102将多个子任务对应的结构化数据进行合并,并将合并后的结构化数据存储到任务存储单元103中的知识库中,以便在接口单元101接收到用户发送的查询请求后,从任务存储单元103的知识库中读取相应的结构化数据反馈给用户。After the receiving unit 201 of the crawler 20 receives the scheduling task of the plurality of subtasks issued by the controller 10, the WEB content downloading unit 202 downloads the WEB page of the webpage type corresponding to the subtask in the target website, and then, WEB The content extraction unit 203 extracts the content of the downloaded WEB page according to the extraction template corresponding to the subtask, obtains structured data, and passes through the receiving unit. 201 sends the acquired structured data to the task scheduling unit 102 of the controller 10; the task scheduling unit 102 merges the structured data corresponding to the plurality of subtasks, and stores the merged structured data in the task storage unit 103. In the knowledge base, after the interface unit 101 receives the query request sent by the user, the corresponding structured data is read from the knowledge base of the task storage unit 103 and fed back to the user.
如此,通过对多种类型的网页的知识抽取实现知识库的构建,由于不同类型的网页包含不同属性的知识信息,此时,将不同网页抽取到的知识信息进行合并汇总,可以很大程度的丰富知识信息的种类,实现丰富完善领域知识库的目的,避免了现有仅对单一类型的页面(如:详情页面)的内容进行抽取,导致获取到的知识信息不够充分,进而使构建的领域知识库不够丰富的问题。In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.
为了便于描述,以下以步骤的形式示出并详细描述了本发明中的知识库构建方法,其中,示出的步骤也可以在除图1所示的系统架构中的设备之外的诸如一组可执行指令的计算机系统中执行,此外,虽然在图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。For convenience of description, the knowledge base construction method in the present invention is shown and described in detail in the form of steps, wherein the steps shown may also be in a group other than the devices in the system architecture shown in FIG. The execution of the instructions in the computer system is performed, and in addition, although the logical order is shown in the figures, in some cases the steps shown or described may be performed in a different order than the ones described herein.
图4为本发明实施例提供的知识库构建方法的流程图,应用于如图1所示的系统架构中,如图4所示,该方法可以包括:FIG. 4 is a flowchart of a method for constructing a knowledge base according to an embodiment of the present invention. The method is applied to the system architecture shown in FIG. 1. As shown in FIG. 4, the method may include:
S101:控制器接收知识库构建任务,知识库构建任务包含标识待构建知识库的任务名称。S101: The controller receives the knowledge base construction task, and the knowledge base construction task includes a task name that identifies the knowledge base to be built.
可选的,控制器可以接收用户通过用户手持的终端发送的知识库构建任务,或者,接收用户通过控制器的用户交互界面发送的知识库构建任务。Optionally, the controller may receive a knowledge base construction task sent by the user through the terminal held by the user, or receive a knowledge base construction task sent by the user through the user interaction interface of the controller.
例如,用户可以在控制器显示屏上的输入框内输入“百度音乐知识库”,并点击相应的按钮,触发百度音乐知识库构建任务并将该任务发送给控制器;其中,“百度音乐知识库”即为待构建的知识库。For example, the user can input "Baidu Music Knowledge Base" in the input box on the controller display screen, and click the corresponding button to trigger the Baidu music knowledge base to build the task and send the task to the controller; among them, "Baidu music knowledge The library is the knowledge base to be built.
S102:控制器查询与任务名称对应的任务配置;任务配置包含:至少两个子任务,每个子任务对应设置有:目标网站、抽取模板以及网页类型。 S102: The controller queries a task configuration corresponding to the task name. The task configuration includes: at least two subtasks, and each subtask corresponds to: a target website, an extraction template, and a web page type.
其中,每个子任务用于:指示抓取器根据抽取模板,在目标网站中与网页类型对应的页面上进行结构化数据抽取;目标网站为待进行结构化数据抽取的网站;抽取模板包含:与待构建知识库中的知识相关的至少一个属性;网页类型可以为详情页或索引导航页或者其他类型网页;为了最大程度地丰富构建的知识库,在本发明实施例中,每个子任务对应的抽取模板是不同的,且每个子任务对应的网页类型也是不同的,同时,在进行任务配置时,应当尽可能多的配置多个子任务,以求在更多种类的网页中抽取众多不同属性的知识信息。Each subtask is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extracted template; the target website is a website to be structured data extraction; and the extraction template includes: At least one attribute related to the knowledge in the knowledge base; the webpage type may be a detail page or an index navigation page or other type of webpage; in order to maximize the enriched knowledge base, in the embodiment of the present invention, each subtask corresponds to The extraction template is different, and the type of the webpage corresponding to each subtask is also different. At the same time, when configuring the task, as many subtasks as possible should be configured to extract many different attributes in more kinds of webpages. Knowledge information.
可选的,根据目前已知的网页类型,至少两个子任务可以包含:第一子任务和第二子任务,其中,第一子任务的网页类型为详情页,第二子任务的网页类型为索引导航页;可理解的是,随着计算机技术的发展,在未来若出现其他类型的网页,则可以将其他类型的网页对应设置一个子任务,从该网页中进行结构化数据的抽取,以丰富领域知识库。Optionally, according to the currently known webpage type, the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the webpage type of the second subtask is Index navigation page; understandably, with the development of computer technology, if other types of web pages appear in the future, other types of web pages can be set to a sub-task, and structured data is extracted from the web page. Rich domain knowledge base.
需要说明的是,在本发明实施例中,详情页可以为:能够查询到某一领域对象的详细情况的页面;索引导航页可以为:为用户提供一组领域对象的索引,引导用户浏览某一领域对象的详细页的页面,通常为目标网站的首页;结构化数据可以为:将根据抽取模板抽取到的知识数据,以列表的形式组合在一起,将组合后的数据称之为结构化数据。It should be noted that, in the embodiment of the present invention, the detail page may be: a page capable of querying the details of an object in a certain domain; the index navigation page may be: providing an index of a set of domain objects for the user, guiding the user to browse a certain The page of the detailed page of the domain object is usually the home page of the target website; the structured data may be: the knowledge data extracted according to the extracted template is combined in the form of a list, and the combined data is called structured. data.
例如:若构建音乐知识库,可以配置两个子任务:子任务1和子任务2,子任务1对应百度网站中的详情页,且对应的抽取模板包含:歌手、专辑、场景等属性;子任务2对应百度网站中的索引导航页,且对应的抽取模板包含:歌曲风格、歌曲年代等属性。For example, if you build a music knowledge base, you can configure two subtasks: subtask 1 and subtask 2. Subtask 1 corresponds to the detail page in Baidu website, and the corresponding extraction template includes: singer, album, scene and other attributes; subtask 2 Corresponding to the index navigation page in Baidu website, and the corresponding extraction template includes: song style, song age and other attributes.
其中,在本发明实施例中,为了使控制器方便地查询到与任务名称对应的任务配置,在接收知识库构建任务之前,该方法还可以包括:In the embodiment of the present invention, in order to enable the controller to conveniently query the task configuration corresponding to the task name, before receiving the knowledge base construction task, the method may further include:
接收创建请求;创建请求包含:任务名称以及任务属性;Receiving a create request; the create request includes: a task name and a task attribute;
存储任务名称与任务属性的对应关系;Corresponding relationship between the storage task name and the task attribute;
相对应的,控制器查询与任务名称对应的任务配置具体可以包括: Correspondingly, the controller querying the task configuration corresponding to the task name may specifically include:
控制器查询控制器中预先存储的任务名称与任务属性的对应关系,获取与任务名称对应的任务配置。The controller queries the correspondence between the task name pre-stored in the controller and the task attribute, and acquires the task configuration corresponding to the task name.
S103:控制器向抓取器发送至少两个子任务。S103: The controller sends at least two subtasks to the gripper.
可选的,控制器可以依次向抓取器发送至少两个子任务,也可以同时向抓取器发送至少两个子任务,本发明实施例对比不进行限定。Optionally, the controller may send at least two subtasks to the crawler in turn, or may send at least two subtasks to the crawler at the same time, which is not limited in the embodiment of the present invention.
S104:抓取器分别执行至少两个子任务,获取至少两个结构化数据,并向控制器返回两个结构化数据。S104: The crawler respectively executes at least two subtasks, acquires at least two structured data, and returns two structured data to the controller.
其中,抓取器执行每个子任务的过程,与现有抓取器进行网页内容抽取是相同:先对目标网站中与子任务对应的网页类型的WEB页面进行下载,然后,根据与子任务对应的抽取模板,按照抽取模板包含的属性对下载的WEB页面内容进行数据抽取,将抽取的数据以列表的形式构建在一起,生成结构化数据。The process of the crawler performing each subtask is the same as the webpage content extraction by the existing crawler: firstly downloading the WEB page of the webpage type corresponding to the subtask in the target website, and then corresponding to the subtask according to the subtask The extraction template extracts the data of the downloaded WEB page according to the attributes contained in the extracted template, and constructs the extracted data in the form of a list to generate structured data.
例如,若子任务1对应百度网站中的详情页,且对应的抽取模板包含:歌手、专辑、场景等属性,则抓取器在子任务1时,可以从百度网站的详情页获取与歌曲相关的歌手、专辑以及场景等知识信息。For example, if the subtask 1 corresponds to the detail page in the Baidu website, and the corresponding extraction template includes: singer, album, scene and other attributes, the crawler can obtain the song related information from the Baidu website details page when the subtask 1 Knowledge information such as singers, albums, and scenes.
S105:控制器对接收到的抓取器返回的至少两个结构化数据进行合并,将合并后的结构化数据存入与任务名称对应的知识库。S105: The controller merges the at least two structured data returned by the received grabber, and stores the merged structured data in a knowledge base corresponding to the task name.
其中,合并可以指将同一领域对象的至少两个结构化数据进行去重后组合在一起;例如,构建音乐知识库的时候,可以获取到多个歌曲中每个歌曲的至少两个结构化数据,此时,可以将某首歌曲的至少两个结构化数据去重后合并在一起。The merging may refer to de-duplicating at least two structured data of the same domain object; for example, when constructing the music knowledge base, at least two structured data of each song of the plurality of songs may be acquired. At this time, at least two structured data of a certain song can be deduplicated and merged together.
由于,对于任一领域对象而言,在目标网站中都具有唯一的标识信息,因此,在本发明实施例中,标识信息相同的至少两个结构化数据进行去重后组合在一起。Since, for any domain object, there is unique identification information in the target website, in the embodiment of the present invention, at least two structured data with the same identification information are deduplicated and combined.
例如,若用户需要建立与《小苹果》相关的知识库,且该构建任务的任务配置包含:详情子任务和索引导航子任务,按照每个子任务的抽取模板可以获取到如下两个结构化数据: For example, if the user needs to establish a knowledge base related to "Little Apple", and the task configuration of the construction task includes: a detailed sub-task and an index navigation sub-task, the following two structured data can be obtained according to the extraction template of each sub-task. :
1)详情子任务,即针对领域知识库在目标网站的详情页面进行结构化数据抽取,该子任务执行结果输出如表1所示的结构化数据,该结构化数据包含与《小苹果》相关的详情属性信息:1) Detailed subtask, that is, structured data extraction is performed on the detail page of the target website for the domain knowledge base, and the subtask execution result outputs structured data as shown in Table 1, which is related to "Little Apple". Detailed attribute information:
表1Table 1
唯一标识Uniquely identifies 名称name 歌手singer 专辑Album
120125029120125029 小苹果Little apple 筷子兄弟Chopsticks brother 《老男孩之猛龙过江》电影原声The Old Boy's Raptors Crossing the River Movie Soundtrack
2)导航子任务,即针对领域知识库在目标网站的索引导航页面进行结构化数据抽取,该子任务执行结果输出如表2所示的结构化数据,该结构化数据包含与《小苹果》相关的分类信息:2) The navigation subtask, that is, the structured data extraction is performed on the index navigation page of the target website for the domain knowledge base, and the subtask execution result outputs the structured data as shown in Table 2, and the structured data is included with "Little Apple". Related classification information:
表2Table 2
场景Scenes 唯一标识Uniquely identifies
广场舞Square dance 120125029120125029
然后,将表1和表2的结构化数据进行合并,合并时将导航子任务所获得到的结构化数据合并到相应的详情子任务结果中,得到如表3所示与《小苹果》相关的知识库,如此,构建与《小苹果》相关的更加丰富的知识信息。Then, the structured data of Table 1 and Table 2 are merged. When merging, the structured data obtained by the navigation subtask is merged into the corresponding detailed subtask result, which is related to "Little Apple" as shown in Table 3. The knowledge base, in this way, builds richer knowledge information related to "Little Apple."
表3table 3
Figure PCTCN2016103419-appb-000001
Figure PCTCN2016103419-appb-000001
进一步的,为了方便用户进行知识信息查询,该方法还可以包括:Further, in order to facilitate the user to query the knowledge information, the method may further include:
接收用户发送的查询请求,查询请求包含:任务名称;Receiving a query request sent by the user, where the query request includes: a task name;
查询与任务名称相对应的知识库,将知识库中的结构化数据反馈给用户。Query the knowledge base corresponding to the task name and feed back the structured data in the knowledge base to the user.
进一步的,由于领域知识信息在不断的进行更新,为了使构建的知识库中的知识信息为当前最新知识信息,具体的,接收知识库构建任务可以包括: Further, since the domain knowledge information is continuously updated, in order to make the knowledge information in the constructed knowledge base the current latest knowledge information, specifically, the receiving knowledge base construction task may include:
定期接收知识库构建任务;Receive knowledge base build tasks on a regular basis;
将合并后的结构化数据存入与任务名称对应的知识库具体可以包括:The storing the merged structured data in the knowledge base corresponding to the task name may specifically include:
删除知识库中已有的结构化数据,将当前合并后的结构化数据存入知识库。Delete the existing structured data in the knowledge base and store the currently merged structured data in the knowledge base.
同时,可理解的是,还可以定期更新控制器中存储的任务配置,增加一些新的子任务或者对现有子任务中的抽取模板中增加新的属性,以获取最丰富、最新的知识信息。At the same time, it can be understood that the task configuration stored in the controller can be updated periodically, some new subtasks are added, or new attributes are added to the extracted templates in the existing subtasks to obtain the most abundant and up-to-date knowledge information. .
需要说明的是,本发明实施例中定期接收知识库构建任务可以指:间隔预设时间接收知识库构建任务,其中,预设时间可以根据需要进行设定,本发明实施例对比不进行限定。It should be noted that, in the embodiment of the present invention, the task of periodically receiving the knowledge base may be: receiving the knowledge base construction task at an interval preset time, wherein the preset time may be set according to requirements, and the comparison in the embodiment of the present invention is not limited.
由上可知,本发明实施例提供一种知识库构建方法,接收知识库构建任务,查询与所述任务名称相对应的包含至少两个子任务的任务配置,每个子任务对应一类网页,然后,将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,遍历不同种类的网页得到至少两个结构化数据,合并所述至少两个结构化数据,将合并后的结构化数据存入与所述任务名称对应的知识库。如此,通过对多种类型的网页的知识抽取实现知识库的构建,由于不同类型的网页包含不同属性的知识信息,此时,将不同网页抽取到的知识信息进行合并汇总,可以很大程度的丰富知识信息的种类,实现丰富完善领域知识库的目的,避免了现有仅对单一类型的页面(如:详情页面)的内容进行抽取,导致获取到的知识信息不够充分,进而使构建的领域知识库不够丰富的问题。As can be seen from the above, an embodiment of the present invention provides a knowledge base construction method, which receives a knowledge base construction task, and queries a task configuration including at least two subtasks corresponding to the task name, each subtask corresponding to a type of web page, and then, Sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two structured Data, the merged structured data is stored in a knowledge base corresponding to the task name. In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.
需要说明的是,上述过程可以由图2所示控制器中的各单元执行,具体不再赘述。此外,本发明图2所示控制器中的接口单元可以为控制器的通信单元;任务调度单元、任务管理单元可以为单独设立的处理器,也可以集成在控制器的某一个处理器中实现,此外,也可以以程序代码的形式存储于控制器的存储器中,由控制器的某一个处理器调用并执行以上知识库构建的功能,任务存储单元可以为控制器中存储器。这里所述的处理器可以是一个中 央处理器(Central Processing Unit,CPU),或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。具体的,下面本发明还提供了一种控制器,优选地用于实现上述方法。It should be noted that the above process may be performed by each unit in the controller shown in FIG. 2, and details are not described herein. In addition, the interface unit in the controller shown in FIG. 2 of the present invention may be a communication unit of the controller; the task scheduling unit and the task management unit may be separately set up processors, or may be integrated into one processor of the controller. In addition, it may also be stored in the memory of the controller in the form of program code, and a function of the above knowledge base construction is invoked and executed by a certain processor of the controller, and the task storage unit may be a memory in the controller. The processor described here can be a medium A Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. Specifically, the present invention also provides a controller, preferably for implementing the above method.
图5为本发明实施例提供的一种控制器10的结构图,用于执行上述方法,如图5所示,所述控制器10可以包括:通信接口1001、处理器1002、存储器1003、以及至少一个通信总线1004,用于实现这些装置之间的连接和相互通信;FIG. 5 is a structural diagram of a controller 10 according to an embodiment of the present invention, for performing the foregoing method. As shown in FIG. 5, the controller 10 may include: a communication interface 1001, a processor 1002, a memory 1003, and At least one communication bus 1004 for implementing connections and mutual communication between the devices;
其中,通信接口1001,可用于与外部网元之间进行数据通信。The communication interface 1001 can be used for data communication with an external network element.
处理器1002可能是一个中央处理器(central processing unit,简称为CPU),也可以是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路,例如:一个或多个微处理器(digital singnal processor,DSP),或,一个或者多个现场可编程门阵列(Field Programmable Gate Array,FPGA)。The processor 1002 may be a central processing unit (CPU), may be an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrations of embodiments of the present invention. The circuit, for example: one or more digital singal processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs).
存储器1003,可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);或者非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);或者上述种类的存储器的组合,用于存储可实现本发明知识库构建相关的应用程序、任务配置以及知识库。The memory 1003 may be a volatile memory such as a random-access memory (RAM) or a non-volatile memory such as a read-only memory. , ROM), flash memory, hard disk drive (HDD) or solid-state drive (SSD); or a combination of the above types of memory for storage to implement the knowledge base of the present invention Related applications, task configurations, and knowledge base.
通信总线1004可以分为地址总线、数据总线、控制总线等,可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外部设备互连(Peripheral Component,PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,EISA)总线等。为便于表示,图5中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The communication bus 1004 can be divided into an address bus, a data bus, a control bus, etc., and can be an Industry Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. For ease of representation, only one thick line is shown in Figure 5, but it does not mean that there is only one bus or one type of bus.
通信单元1001,用于接收知识库构建任务;所述知识库构建任务包含标识待构建知识库的任务名称。 The communication unit 1001 is configured to receive a knowledge base construction task, where the knowledge base construction task includes a task name that identifies the knowledge base to be built.
处理器1002,用于查询与所述通信单元1001接收到的任务名称对应的任务配置;所述任务配置包含:至少两个子任务,每个子任务对应设置有:目标网站、抽取模板以及网页类型;The processor 1002 is configured to query a task configuration corresponding to the task name received by the communication unit 1001. The task configuration includes: at least two subtasks, each of which is configured with: a target website, an extraction template, and a webpage type;
以及,向抓取器发送所述至少两个子任务,获取抓取器执行所述至少两个子任务后返回的至少两个结构化数据,并对至少两个结构化数据进行合并;And sending the at least two subtasks to the crawler, acquiring at least two structured data returned by the crawler after executing the at least two subtasks, and merging the at least two structured data;
存储器1003,用于将处理器1002合并后的结构化数据存入与所述任务名称对应的知识库。The memory 1003 is configured to store the structured data merged by the processor 1002 into a knowledge base corresponding to the task name.
其中,每个子任务用于:指示抓取器根据所述抽取模板,在所述目标网站中与所述网页类型对应的页面上进行结构化数据抽取;所述目标网站为待进行结构化数据抽取的网站;所述抽取模板包含:与待构建知识库中的知识相关的至少一个属性;所述网页类型可以为详情页或索引导航页或者其他类型网页;为了最大程度地丰富构建的知识库,在本发明实施例中,每个子任务对应的抽取模板是不同的,且每个子任务对应的网页类型也是不同的,同时,在进行任务配置时,应当尽可能多的配置多个子任务,以求在更多种类的网页中抽取众多不同属性的知识信息。Each subtask is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extraction template; the target website is to be structured data extraction The extraction template includes: at least one attribute related to knowledge in the knowledge base to be built; the webpage type may be a detail page or an index navigation page or other types of web pages; in order to maximize the build of the knowledge base, In the embodiment of the present invention, the extraction templates corresponding to each subtask are different, and the webpage types corresponding to each subtask are also different. At the same time, when performing task configuration, multiple subtasks should be configured as much as possible. Extract knowledge information of many different attributes in a wider variety of web pages.
可选的,根据目前已知的网页类型,所述至少两个子任务可以包含:第一子任务和第二子任务,其中,所述第一子任务的网页类型为详情页,第二子任务的网页类型为索引导航页;可理解的是,随着计算机技术的发展,在未来若出现其他类型的网页,则可以将其他类型的网页对应设置一个子任务,从该网页中进行结构化数据的抽取,以丰富领域知识库。Optionally, according to the currently known webpage type, the at least two subtasks may include: a first subtask and a second subtask, wherein the webpage type of the first subtask is a detail page, and the second subtask The webpage type is an index navigation page; understandably, with the development of computer technology, if other types of webpages appear in the future, other types of webpages may be set corresponding to one subtask, and structured data is obtained from the webpage. The extraction to enrich the domain knowledge base.
进一步的,通信单元1001具体可以用于:Further, the communication unit 1001 can be specifically configured to:
接收用户通过用户手持的终端发送的知识库构建任务,或者,接收用户通过所述控制器的用户交互界面发送的知识库构建任务。Receiving a knowledge base construction task sent by the user through the terminal held by the user, or receiving a knowledge base construction task sent by the user through the user interaction interface of the controller.
进一步的,在本发明实施例中,为了使控制器方便地查询到与任务名称对应的任务配置,所述通信单元1001,还可以用于:Further, in the embodiment of the present invention, in order to enable the controller to conveniently query the task configuration corresponding to the task name, the communication unit 1001 may further be configured to:
在接收知识库构建任务之前,接收创建请求;所述创建请求包含:所述 任务名称以及任务属性;存储所述任务名称与所述任务属性的对应关系。Receiving a creation request before receiving the knowledge base construction task; the creation request includes: a task name and a task attribute; storing a correspondence between the task name and the task attribute.
进一步的,处理器1002具体用于:Further, the processor 1002 is specifically configured to:
依次向抓取器发送所述至少两个子任务,或者同时向抓取器发送所述至少两个子任务,本发明实施例对比不进行限定。The at least two subtasks are sent to the crawler in turn, or the at least two subtasks are sent to the crawler at the same time, which is not limited in the embodiment of the present invention.
进一步的,在对至少两个结构化数据进行合并的,所述处理器1002具体可以用于:Further, in combining the at least two structured data, the processor 1002 may specifically be used to:
将同一领域对象的至少两个结构化数据进行去重后组合在一起;例如,构建音乐知识库的时候,可以获取到多个歌曲中每个歌曲的至少两个结构化数据,此时,可以将某首歌曲的至少两个结构化数据去重后合并在一起。At least two structured data of the same domain object are deduplicated and combined; for example, when constructing the music knowledge base, at least two structured data of each song of the plurality of songs can be acquired, and at this time, At least two structured data of a song are deduplicated and merged together.
由于,对于任一领域对象而言,在目标网站中都具有唯一的标识信息,因此,在本发明实施例中,标识信息相同的至少两个结构化数据进行去重后组合在一起。Since, for any domain object, there is unique identification information in the target website, in the embodiment of the present invention, at least two structured data with the same identification information are deduplicated and combined.
进一步的,为了方便用户进行知识信息查询,所述通信单元1001还可以用于:Further, in order to facilitate the user to query the knowledge information, the communication unit 1001 can also be used to:
接收用户发送的查询请求;所述查询请求包含:所述任务名称;Receiving a query request sent by the user; the query request includes: the task name;
所述处理器1002,还可以用于在所述通信单元1001接收到所述查询请求后,查询与所述任务名称相对应的知识库,将所述知识库中的结构化数据反馈给所述用户。The processor 1002 is further configured to: after the communication unit 1001 receives the query request, query a knowledge base corresponding to the task name, and feed back the structured data in the knowledge base to the user.
进一步的,由于领域知识信息在不断的进行更新,为了使构建的知识库中的知识信息为当前最新知识信息,所述通信单元1001,具体可以用于:Further, the domain information is continuously updated. In order to make the knowledge information in the built knowledge base the current latest knowledge information, the communication unit 1001 may be specifically configured to:
定期接收知识库构建任务;Receive knowledge base build tasks on a regular basis;
所述处理器1002,具体可以用于:The processor 1002 is specifically configured to:
删除所述知识库中已有的结构化数据,将当前合并后的结构化数据存入知识库。The structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.
其中,可理解的是,还可以定期更新控制器中存储的任务配置,增加一些新的子任务或者对现有子任务中的抽取模板中增加新的属性,以获取最丰富、最新的知识信息。 It can be understood that the task configuration stored in the controller can be updated periodically, some new sub-tasks are added, or new attributes are added to the extracted templates in the existing sub-tasks to obtain the most abundant and up-to-date knowledge information. .
需要说明的是,本发明实施例所述的定期接收知识库构建任务可以指:间隔预设时间接收知识库构建任务,其中,预设时间可以根据需要进行设定,本发明实施例对比不进行限定。It should be noted that the task of the periodic receiving knowledge base construction in the embodiment of the present invention may be: receiving the knowledge base construction task at an interval preset time, wherein the preset time may be set according to requirements, and the comparison in the embodiment of the present invention is not performed. limited.
由上可知,本发明实施例提供一种控制器,接收知识库构建任务,查询与所述任务名称相对应的包含至少两个子任务的任务配置,每个子任务对应一类网页,然后,将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,遍历不同种类的网页得到至少两个结构化数据,合并所述至少两个结构化数据,将合并后的结构化数据存入与所述任务名称对应的知识库。如此,通过对多种类型的网页的知识抽取实现知识库的构建,由于不同类型的网页包含不同属性的知识信息,此时,将不同网页抽取到的知识信息进行合并汇总,可以很大程度的丰富知识信息的种类,实现丰富完善领域知识库的目的,避免了现有仅对单一类型的页面(如:详情页面)的内容进行抽取,导致获取到的知识信息不够充分,进而使构建的领域知识库不够丰富的问题。As can be seen from the above, an embodiment of the present invention provides a controller, which receives a knowledge base construction task, and queries a task configuration including at least two subtasks corresponding to the task name, and each subtask corresponds to a type of webpage, and then Transmitting at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks, traversing different kinds of webpages to obtain at least two structured data, and combining the at least two structured data, The merged structured data is stored in a knowledge base corresponding to the task name. In this way, the knowledge base is constructed by extracting knowledge of various types of web pages. Since different types of web pages contain knowledge information of different attributes, at this time, the knowledge information extracted from different web pages can be combined and summarized, which can be largely Enrich the types of knowledge information, realize the purpose of enriching and perfecting the domain knowledge base, and avoid the existing extraction of only the content of a single type of page (such as the details page), resulting in insufficient knowledge information acquired, and thus the domain of construction The knowledge base is not rich enough.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。 It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and are not limited thereto; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that The technical solutions described in the foregoing embodiments are modified, or the equivalents of the technical features are replaced. The modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (13)

  1. 一种知识库构建方法,应用于控制器,其特征在于,包括:A knowledge base construction method is applied to a controller, which is characterized in that:
    接收知识库构建任务;所述知识库构建任务包含标识待构建知识库的任务名称;Receiving a knowledge base construction task; the knowledge base construction task includes a task name identifying the knowledge base to be built;
    查询与所述任务名称相对应的任务配置;所述任务配置包含:至少两个子任务,每个子任务对应设置有:目标网站、抽取模板以及网页类型;每个子任务用于:指示抓取器根据所述抽取模板,在所述目标网站中与所述网页类型对应的页面上进行结构化数据抽取;Querying a task configuration corresponding to the task name; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; each subtask is used to: indicate the crawler according to Extracting a template, performing structured data extraction on a page corresponding to the webpage type in the target website;
    将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,得到至少两个结构化数据;Sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;
    接收所述抓取器返回的所述至少两个结构化数据;Receiving the at least two structured data returned by the grabber;
    合并所述至少两个结构化数据,将合并后的结构化数据存入与所述任务名称对应的知识库。The at least two structured data are merged, and the merged structured data is stored in a knowledge base corresponding to the task name.
  2. 根据权利要求1所述的方法,其特征在于,所述至少两个子任务包含:第一子任务和第二子任务;其中,所述第一子任务的网页类型为详情页;所述第二子任务的网页类型为索引导航页。The method according to claim 1, wherein the at least two subtasks comprise: a first subtask and a second subtask; wherein the webpage type of the first subtask is a detail page; the second The page type of the subtask is the index navigation page.
  3. 根据权利要求1或2所述的方法,其特征在于,在接收知识库构建任务之前,所述方法还包括:The method according to claim 1 or 2, wherein before receiving the knowledge base construction task, the method further comprises:
    接收创建请求;所述创建请求包含:所述任务名称以及任务属性;Receiving a creation request; the creation request includes: the task name and a task attribute;
    存储所述任务名称与所述任务属性的对应关系。And storing a correspondence between the task name and the task attribute.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述将所述至少两个子任务发送给所述抓取器具体包括:The method according to any one of claims 1-3, wherein the sending the at least two subtasks to the crawler comprises:
    分别将所述至少两个子任务发送给所述抓取器;Transmitting the at least two subtasks to the crawler respectively;
    或者,同时将所述至少两个子任务发送给所述抓取器。Alternatively, the at least two subtasks are simultaneously sent to the grabber.
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    接收用户发送的查询请求;所述查询请求包含:所述任务名称; Receiving a query request sent by the user; the query request includes: the task name;
    查询与所述任务名称相对应的知识库,将所述知识库中的结构化数据反馈给所述用户。Querying a knowledge base corresponding to the task name, and feeding back structured data in the knowledge base to the user.
  6. 根据权利要求1所述的方法,其特征在于,所述接收知识库构建任务具体包括:The method according to claim 1, wherein the receiving the knowledge base construction task comprises:
    定期接收知识库构建任务;Receive knowledge base build tasks on a regular basis;
    将合并后的结构化数据存入与所述任务名称对应的知识库具体包括:The storing the merged structured data into the knowledge base corresponding to the task name specifically includes:
    删除所述知识库中已有的结构化数据,将当前合并后的结构化数据存入知识库。The structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.
  7. 一种控制器,其特征在于,包括:A controller, comprising:
    接口单元,用于接收知识库构建任务;所述知识库构建任务包含标识待构建知识库的任务名称;An interface unit, configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;
    任务调度单元,用于查询与所述接口单元接收到的任务名称相对应的任务配置;所述任务配置包含:至少两个子任务,每个子任务对应设置有:目标网站、抽取模板以及网页类型;每个子任务用于:指示抓取器根据所述抽取模板,在所述目标网站中与所述网页类型对应的页面上进行结构化数据抽取;a task scheduling unit, configured to query a task configuration corresponding to the task name received by the interface unit; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; Each sub-task is used to: instruct the crawler to perform structured data extraction on the page corresponding to the webpage type in the target website according to the extraction template;
    以及,将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,得到至少两个结构化数据;And sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;
    接收所述抓取器返回的所述至少两个结构化数据,并合并所述至少两个结构化数据;Receiving the at least two structured data returned by the crawler, and merging the at least two structured data;
    任务存储单元,用于将任务调度单元合并后的结构化数据存入与所述任务名称对应的知识库。The task storage unit is configured to store the structured data merged by the task scheduling unit into a knowledge base corresponding to the task name.
  8. 根据权利要求7所述的控制器,其特征在于,所述至少两个子任务包含:第一子任务和第二子任务;其中,所述第一子任务的网页类型为详情页;所述第二子任务的网页类型为索引导航页。The controller according to claim 7, wherein the at least two subtasks comprise: a first subtask and a second subtask; wherein the webpage type of the first subtask is a detail page; The page type of the second subtask is the index navigation page.
  9. 根据权利要求7或8所述的控制器,其特征在于,所述接口单元,还用于: The controller according to claim 7 or 8, wherein the interface unit is further configured to:
    在所述接口单元接收知识库构建任务之前,接收创建请求;所述创建请求包含:所述任务名称以及任务属性;Receiving a creation request before the interface unit receives the knowledge base construction task; the creation request includes: the task name and a task attribute;
    所述控制器还包括:The controller further includes:
    任务管理单元,将所述接口单元接收到的所述任务名称与所述任务属性的对应关系存储到所述任务存储单元中。The task management unit stores, in the task storage unit, a correspondence between the task name received by the interface unit and the task attribute.
  10. 根据权利要求7-9任一项所述的控制器,其特征在于,所述任务调度单元,具体用于:The controller according to any one of claims 7-9, wherein the task scheduling unit is specifically configured to:
    分别将所述至少两个子任务发送给所述抓取器;Transmitting the at least two subtasks to the crawler respectively;
    或者,同时将所述至少两个子任务发送给所述抓取器。Alternatively, the at least two subtasks are simultaneously sent to the grabber.
  11. 根据权利要求7所述的控制器,其特征在于,所述接口单元,还用于:The controller according to claim 7, wherein the interface unit is further configured to:
    接收用户发送的查询请求;所述查询请求包含:所述任务名称;Receiving a query request sent by the user; the query request includes: the task name;
    所述任务调度单元,还用于在所述接口单元接收到用户发送的查询请求后,查询与所述任务名称相对应的知识库,将所述知识库中的结构化数据反馈给所述用户。The task scheduling unit is further configured to: after the interface unit receives the query request sent by the user, query a knowledge base corresponding to the task name, and feed back the structured data in the knowledge base to the user .
  12. 根据权利要求7所述的控制器,其特征在于,所述接口单元具体用于:The controller according to claim 7, wherein the interface unit is specifically configured to:
    定期接收知识库构建任务;Receive knowledge base build tasks on a regular basis;
    所述任务存储单元,具体用于:The task storage unit is specifically configured to:
    删除所述知识库中已有的结构化数据,将当前合并后的结构化数据存入知识库。The structured data existing in the knowledge base is deleted, and the currently merged structured data is stored in the knowledge base.
  13. 一种控制器,其特征在于,包括:A controller, comprising:
    通信单元,用于接收知识库构建任务;所述知识库构建任务包含标识待构建知识库的任务名称;a communication unit, configured to receive a knowledge base construction task; the knowledge base construction task includes a task name that identifies the knowledge base to be built;
    处理器,用于查询与所述通信单元接收到的任务名称相对应的任务配置;所述任务配置包含:至少两个子任务,每个子任务对应设置有:目标网站、抽取模板以及网页类型;每个子任务用于:指示抓取器根据所述抽取模 板,在所述目标网站中与所述网页类型对应的页面上进行结构化数据抽取;a processor, configured to query a task configuration corresponding to the task name received by the communication unit; the task configuration includes: at least two subtasks, each subtask corresponding to: a target website, an extraction template, and a webpage type; Subtasks are used to: instruct the grabber to follow the extraction mode a board, performing structured data extraction on a page corresponding to the webpage type in the target website;
    以及,将所述至少两个子任务发送给所述抓取器,触发所述抓取器执行所述至少两个子任务,得到至少两个结构化数据;And sending the at least two subtasks to the crawler, triggering the crawler to execute the at least two subtasks to obtain at least two structured data;
    接收所述抓取器返回的所述至少两个结构化数据,并合并所述至少两个结构化数据;Receiving the at least two structured data returned by the crawler, and merging the at least two structured data;
    存储器,用于将处理器合并后的结构化数据存入与所述任务名称对应的知识库。 The memory is configured to store the structured data merged by the processor into a knowledge base corresponding to the task name.
PCT/CN2016/103419 2015-12-17 2016-10-26 Method for constructing knowledge base, and controller WO2017101591A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510953365.0A CN105589945A (en) 2015-12-17 2015-12-17 Knowledge base construction method and controller
CN201510953365.0 2015-12-17

Publications (1)

Publication Number Publication Date
WO2017101591A1 true WO2017101591A1 (en) 2017-06-22

Family

ID=55929524

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/103419 WO2017101591A1 (en) 2015-12-17 2016-10-26 Method for constructing knowledge base, and controller

Country Status (2)

Country Link
CN (1) CN105589945A (en)
WO (1) WO2017101591A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109471927A (en) * 2018-10-30 2019-03-15 重庆邂智科技有限公司 A kind of knowledge base and its foundation, answering method and application apparatus
CN112860714A (en) * 2019-11-12 2021-05-28 斑马智行网络(香港)有限公司 Knowledge base, database, information updating method and device

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589945A (en) * 2015-12-17 2016-05-18 华为技术有限公司 Knowledge base construction method and controller
CN107103543B (en) * 2016-02-23 2021-03-30 平安科技(深圳)有限公司 Protocol data processing method and system
CN107256226B (en) * 2017-04-28 2018-10-30 北京神州泰岳软件股份有限公司 A kind of construction method and device of knowledge base
CN107908637B (en) * 2017-09-26 2021-02-12 北京百度网讯科技有限公司 Entity updating method and system based on knowledge base
CN108595471B (en) * 2018-03-07 2022-08-02 中山大学 Knowledge acquisition method based on intelligent planning
US20200210855A1 (en) * 2018-12-28 2020-07-02 Robert Bosch Gmbh Domain knowledge injection into semi-crowdsourced unstructured data summarization for diagnosis and repair
CN111274012B (en) * 2020-01-16 2022-07-12 珠海格力电器股份有限公司 Service scheduling method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236563A (en) * 2008-02-01 2008-08-06 刘峰 Intelligent personalized service website constitution method
CN101853300A (en) * 2010-05-26 2010-10-06 中国科学技术大学 Method and system for identifying and evaluating video downloading service website
CN103425714A (en) * 2012-05-25 2013-12-04 北京搜狗信息服务有限公司 Query method and system
CN105589945A (en) * 2015-12-17 2016-05-18 华为技术有限公司 Knowledge base construction method and controller

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236563A (en) * 2008-02-01 2008-08-06 刘峰 Intelligent personalized service website constitution method
CN101853300A (en) * 2010-05-26 2010-10-06 中国科学技术大学 Method and system for identifying and evaluating video downloading service website
CN103425714A (en) * 2012-05-25 2013-12-04 北京搜狗信息服务有限公司 Query method and system
CN105589945A (en) * 2015-12-17 2016-05-18 华为技术有限公司 Knowledge base construction method and controller

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109471927A (en) * 2018-10-30 2019-03-15 重庆邂智科技有限公司 A kind of knowledge base and its foundation, answering method and application apparatus
CN112860714A (en) * 2019-11-12 2021-05-28 斑马智行网络(香港)有限公司 Knowledge base, database, information updating method and device

Also Published As

Publication number Publication date
CN105589945A (en) 2016-05-18

Similar Documents

Publication Publication Date Title
WO2017101591A1 (en) Method for constructing knowledge base, and controller
US11526533B2 (en) Version history management
US20200073987A1 (en) Technologies for runtime selection of query execution engines
JP6416374B2 (en) Fast rendering of websites containing dynamic content and old content
US10565293B2 (en) Synchronizing DOM element references
US9390124B2 (en) Version control system using commit manifest database tables
CN110119393B (en) Code version management system and method
US20080201118A1 (en) Modeling a web page on top of HTML elements level by encapsulating the details of HTML elements in a component, building a web page, a website and website syndication on browser-based user interface
US10997360B2 (en) Page display method, device, and system, and page display assist method and device
US20180367593A1 (en) Systems, methods and computer program products for dynamic user profile enrichment and data integration
WO2017107826A1 (en) Service information pushing method and device
WO2016155669A1 (en) Data storage method and device
JP2017526041A (en) Batch optimized rendering and fetch architecture
US10372783B2 (en) Persisting the state of visual control elements in uniform resource locator (URL)-generated web pages
EP3022890B1 (en) Techniques to manage state information for a web service
US11449470B2 (en) Patching JSON documents that contain arrays undergoing concurrent modification
CA2838452A1 (en) Automated user interface object transformation and code generation
WO2015070674A1 (en) Method and system for manipulating data
WO2015010566A1 (en) Method for accurately searching for comprehensive information
US9754015B2 (en) Feature rich view of an entity subgraph
WO2019041441A1 (en) Updating device and method for list view and computer-readable storage medium
US20150379155A1 (en) Optimized browser render process
WO2015074477A1 (en) Path analysis method and apparatus
CN101184091A (en) Method and apparatus for ascertaining similar documents
US11200201B2 (en) Metadata storage method, device and server

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16874648

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16874648

Country of ref document: EP

Kind code of ref document: A1