WO2012091541A1

WO2012091541A1 - A semantic web constructor system and a method thereof

Info

Publication number: WO2012091541A1
Application number: PCT/MY2011/000153
Authority: WO
Inventors: Dickson Lukose; Nur Hana SAMSUDIN; Weng Onn Kow; John Foderaro; Chuan Woo SHENG
Original assignee: Mimos Berhad
Priority date: 2010-12-28
Filing date: 2011-06-24
Publication date: 2012-07-05
Also published as: MY176053A

Abstract

A semantic web constructor system (100) supported by a plurality of web crawlers (112) in a World Wide Web upon declaring a plurality of trustworthy websites is provided, characterized in that, the system (100) includes at least one web crawler controller (110) engagable to manage the plurality of web crawlers (112), a semantic web database (116) connectable to the plurality of web crawlers (112) and a plurality of data building editors (122, 124, 126) connectable to the at least one web crawler controller (110) wherein a semantic browser (120) is further connectable to the semantic web database (116) to receive at least one natural language query from at least one user.

Description

A SEMANTIC WEB CONSTRUCTOR SYSTEM AND A METHOD THEREOF

FIELD OF INVENTION The present invention relates to a semantic web constructor system supported by a plurality of web crawlers in a World Wide Web, upon declaring a plurality of trustworthy websites in order to query a semantic browser for a plurality of websites that are page- ranked. BACKGROUND OF INVENTION

Most contemporary web search engines function based on keyword searches. Keyword based search engines usually pull out websites with occurrence of one or more of the keywords specified in the search. However, searching based on keywords tend to produce large numbers of hits which burdens the user to ensure relevance of results. Users end up spending large amount of time going through the resulting websites to identify a relevant document. This becomes a huge drain on the user.

In addition to the problem of having a large volume of hits, the user also needs to internalize the relevant material in order to make it applicable and utilize it for the problem that is being solved.

With the current web browser technology, a user is not able to issue a natural language query as a keyword based search engine is not able to understand the context of the natural language query and is unable to search for relevant information effectively from the web. Furthermore, the web is made up of a large amount and highly distributed information that may come from questionable sources. Therefore, the user must be able to discern a reliable website from an unreliable one. Most documents available on the World Wide Web are not semantically enabled for computers to interpret and to process beyond simple keyword search. Human input is required to read the document and to interpret the information. This is a very slow and inefficient process made painfully obvious with the current information explosion online. U.S. 6,044,375 describes a method of automatically extracting metadata from a document through a neural network to generate metadata guesses including word guesses, compound guesses and document guesses along with confidence factors associated with the guesses indicating the likelihood that each of the guesses is correct. However, the described invention does not look at the information on the website itself and does not extract data from the website itself.

Therefore, there is a need for a solution in order to search for relevant data, for users who are looking for specific and relevant data without having to spend too long manually reading through long lists of results.

SUMMARY OF INVENTION

Accordingly there is provided a semantic web constructor system supported by a plurality of web crawlers in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the system includes at least one web crawler controller engageable to manage the plurality of web crawlers, a semantic web database connectable to the plurality of web crawlers and a plurality of data building editors connectable to the at least one web crawler controller wherein a semantic browser is further connectable to the semantic web database to receive at least one natural language query from at least one user.

There is also provided a method of constructing a semantic web supported by a plurality of web crawlers in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the method includes the steps of crawling the web to select unprocessed websites, selecting websites based on trustworthiness calculated by comparing a trustworthiness numeric value to a predetermined threshold value, extracting a plurality of text from the selected websites, tokenizing the extracted plurality of text, applying a predetermined set of data transformation rules to the tokenized extracted plurality of text, converting the tokenized extracted plurality of text to metadata and storing the metadata in semantic web database.

There is further provided a method of querying a semantic web supported by a plurality of web crawlers in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the method includes the steps of receiving a query from a user, parsing the query into an internal representation format, searching the semantic web database using the internal representation format, ranking a plurality of websites based on trustworthiness and returning the ranked plurality of websites to the user. The present invention consists of several novel features and a combination of parts hereinafter fully described and illustrated in the accompanying description and drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be fully understood from the detailed description given herein below and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, wherein:

Figure 1 is a block diagram illustrating architecture of a preferred embodiment of a semantic web constructor system;

Figure 2 is flowchart illustrating a preferred embodiment of the steps of constructing a semantic web supported by a plurality of web crawlers; and Figure 3 is a diagram showing an example of a natural language query using the preferred embodiment of a semantic web constructor method and system.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The present invention relates to a semantic web constructor system supported by a plurality of web crawlers in a World Wide Web, upon declaring a plurality of trustworthy websites in order to query a semantic browser for a plurality of websites that are page- ranked. Hereinafter, this specification will describe the present invention according to the preferred embodiment of the present invention. However, it is to be understood that limiting the description to the preferred embodiment of the invention is merely to facilitate discussion of the present invention and it is envisioned that those skilled in the art may devise various modifications and equivalents without departing from the scope of the appended claims.

The following detailed description of the preferred embodiment will now be described in accordance with the attached drawings, either individually or in combination.

The present invention provides a semantic web constructor system (100) supported by a plurality of web crawlers (1 12) in a World Wide Web upon declaring a plurality of trustworthy websites as seen in Figure 1. The system (100) includes at least one web crawler controller (110) engage able to manage the plurality of web crawlers (112). A semantic web database (116) is further connectable to the plurality of web crawlers (112). A plurality of data building editors (122, 124, 126) is connectable to the at least one web crawler controller (110) wherein a semantic browser (120) is further connectable to receive at least one natural language query from at least one user. A trust engine (1 8) is connectable to the at least one web crawler controller (110).

A plurality of data building editors (122, 124, 126) further include a website list editor (122), a rule editor (124) and a concept editor (126). However, it is to be appreciated by one skilled in the art that the plurality of data building editors may also include other editors of a similar nature in other embodiments of the system (100). In order for the system (100) to be functional, the system (100) must first be provided with a plurality of relevant concepts, a predetermined set of transformation rules and an initial set of websites where relevant information associated with the relevant concepts are found. The concept editor (126) is used to specify a plurality of relevant concepts, properties and relationships in a subject area of interest. The concept editor (126) uses external resources such as Wordnet to expand on a specified concept. The rule editor (124) is used for defining data transformation rules. A website list editor (122) is used to specify an initial set of websites where data related to the subject area of interest may be found.

An example of a web crawler (112) is a spider as used in a search engine to retrieve information from the World Wide Web. Accordingly, an example of a web crawler controller (110) is a spider controller. The web crawler controller (110) as seen in Figure 1 filters websites to ensure that trustworthy websites are processed first by the plurality of web crawlers (112). Each web crawler (1 12) is assigned to a different website. Websites that are to be processed by the plurality of web crawlers (112) are delegated in a balanced manner by the web crawler controller (110). The web crawler controller (110) maintains a plurality of websites that are related to the subject area of interest. Initially, the websites are provided by the user. However, the plurality of websites increases as the web crawlers (112) encounter newly linked Uniform Resource Locators (URLs) on websites that are processed. An intermediary database (114) is connectable to the plurality of web crawlers (112) and the semantic web database (116). An example of the intermediary database (114) is a merger database. A method of constructing a semantic web supported by a plurality of web crawlers (1 12) in a World Wide Web upon declaring a plurality of trustworthy websites is described herein as seen in Figure 1. The method includes crawling the World Wide Web to select unprocessed websites using a plurality of web crawlers ( 10) such as spiders. An expansion of the initial plurality of relevant concepts is first performed, wherein each concept provided to the concept editor (126) is expanded to include a plurality of similar concepts. This step is carried out by means of an external lexical database such as Wordnet. Upon declaring the initial set of websites, the trust engine (118) determines trustworthiness of each declared website. Websites are selected based on trustworthiness calculated by comparing a trustworthiness numeric value to a predetermined threshold value. For example, the trustworthiness value is a numeric value from 0 to 100. The trust engine (118) accepts websites with a trustworthiness value that is higher than the predetermined threshold value. A trusted list of websites is created as shown in Figure 1.

A plurality of text is extracted from the selected trusted list of websites. It is to be understood that the plurality of text may also be extracted from documents found on the World Wide Web. Further, the extracted plurality of text is tokenized and a predetermined set of data transformation rules is applied to the tokenized extracted plurality of text. The tokenized extracted plurality of text is converted to metadata, such as Resource Description Framework (RDF) data and the RDF data is stored in a semantic web database (116) as seen in Figure 2. Identification of new Uniform Resource Locators (URLs) on documents as found on the World Wide Web is carried out continuously. Upon encountering unprocessed URLs, the plurality of web crawlers (110) will pass the unprocessed URLs to the web crawler controller (110) for processing.

Each web crawler (110) processes allocated web pages in a non-intrusive manner. Figure 2 shows a method performed by a web crawler (112) to process a web page. A document or web page is classified to determine relevance of each document. Sentence extraction is performed by identifying a targeted sentence from the document or web page and extracting the sentence. The extracted sentence is then tokenized in order to apply a set of predetermined data transformation rules to the tokenized extracted sentence. The tokenized sentence is then converted to metadata such as RDF data. The RDF data is then stored in a web crawler's local semantic web crawler database (201 ).

In an event where there are no more URLs to be assigned to any web crawlers ( 12), an intermediary database (114) such as a merger database retrieves RDF data collected by each local semantic web crawler database (201 ) and merges the RDF data with the semantic web database (116).

Upon merging all RDF data into the semantic web database, the system (100) is ready to respond to natural language or structured natural language query processing. A method of querying a semantic web supported by a plurality of web crawlers (112) in a World Wide Web upon declaring a plurality of trustworthy websites is now described as seen in Figure 1. The method includes the steps of receiving a query from a user and parsing the query into an internal representation format. A semantic browser (120) is used to receive a query from the user and a query parser is used to parse queries entered by users. The semantic web database (116) is searched using the internal representation format. The internal representation format is then passed to a semantic search engine that performs the search. The semantic web database (116) responds with a set of answers to the query by the user as well as references to a plurality of websites where the set of answers may be found. The plurality of websites is ranked based on trustworthiness by a page-ranker. The ranked plurality of websites is then returned to the user who issued the query.

An example of the described embodiment of system (100) and method is seen Figure 3. A typical website called "Bali Thai" is identified. The plurality of web crawlers (1 12) then process the website and construct RDF data as seen in Figure 3. A user may issue a query as follows:

QUERY: Find me a Thai restaurant that is Halal, not too expensive, no alcohol served, near Jurong Point Shopping Centre in Singapore.

The system (100) then responds with a reference to the website entitled "Bali Thai". It is to be appreciated that it is not possible to receive meaningful results such as those seen in this embodiment in a conventional search engine by issuing a natural language query of this nature. The meaningful results in this embodiment are specific websites that contains information as searched by the user. Each individual web crawler (1 12) processes data extraction and data transformation of websites locally by using concepts and rules. Further, semantic web databases that are local, are created and then merged with global semantic web databases. This architecture allows the system (100) to be inherently scalable, which is a critical requirement for creating a semantic web database for the World Wide Web. Ontology based concepts and rules are used to automatically extract data of interest from websites. It is to be understood that the usage of websites in this description includes web pages, documents, messages and other information in a text format. Formats of all concepts and rules used are compatible with World Wide Web Consortium (W3C) ontology standards in order to be created or edited using commercially available ontology editing tools or with a specialized editor. The described method and system can be used to transform data of interest into RDF knowledge representation to enable downstream semantic search, knowledge discovery and process automation. Therefore, the described system and method is able to transform natural language queries into appropriate internal semantic query. This produces results that directly answer the query rather than relying on users to check each search result for relevancy. The described invention can be applied, but not restricted to, create a knowledge database for any domain from any data source that is unstructured, semantically unintelligible documents to transform said documents into a computer-understandable, semantic database to perform knowledge discovery and semantic information search efficiently and accurately.

Claims

A semantic web constructor system (100) supported by a plurality of web crawlers (1 12) in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the system (100) includes:

i. at least one web crawler controller (1 10) engage able to manage the plurality of web crawlers (1 12);

ii. a semantic web database (1 16) connectable to the plurality of web crawlers (1 12); and

iii. a plurality of data building editors (122, 124, 126) connectable to the at least one web crawler controller (110)

wherein a semantic browser (120) is further connectable to the semantic web database (1 16) to receive at least one natural language query from at least one user.

The system (100) as claimed in claim 1 , wherein a trust engine (1 18) is connectable to the at least one web crawler controller (1 10).

The system (100) as claimed in claim 1 , wherein the plurality of data building editors (122,124, 126) further include a website list editor (122), a rule editor (124) and a concept editor (126).

The system (100) as claimed in claim 1 , wherein the plurality of web crawlers (1 12) is a plurality of spiders.

The system (100) as claimed in claim 1 , wherein the at least one web crawler controller (1 10) is at least one spider controller. The system as claimed in claim 1 , wherein an intermediary database (114) is connectable to the plurality of web crawlers (1 12) and the semantic web database (1 16).

A method of constructing a semantic web supported by a plurality of web crawlers (112) in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the method includes the steps of:

i. crawling the web to select unprocessed websites;

ii. selecting websites based on trustworthiness calculated by comparing a trustworthiness numeric value to a predetermined threshold value;

iii. extracting a plurality of text from the selected websites;

iv. tokenizing the extracted plurality of text;

v. applying a predetermined set of data transformation rules to the tokenized extracted plurality of text;

vi. converting the tokenized extracted plurality of text to metadata; and vii. storing the metadata in semantic web database (1 6).

The method as claimed in claim 8, wherein the metadata used is Resource Description Framework (RDF) data.

A method of querying a semantic web supported by a plurality of web crawlers (112) in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the method includes the steps of:

i. receiving a query from a user;

ii. parsing the query into an internal representation format;

iii. searching the semantic web database (1 16) using the internal representation format; ranking a plurality of websites based on trustworthiness; and returning the ranked plurality of websites to the user.