WO2012091541A1 - A semantic web constructor system and a method thereof - Google Patents

A semantic web constructor system and a method thereof Download PDF

Info

Publication number
WO2012091541A1
WO2012091541A1 PCT/MY2011/000153 MY2011000153W WO2012091541A1 WO 2012091541 A1 WO2012091541 A1 WO 2012091541A1 MY 2011000153 W MY2011000153 W MY 2011000153W WO 2012091541 A1 WO2012091541 A1 WO 2012091541A1
Authority
WO
WIPO (PCT)
Prior art keywords
web
plurality
websites
semantic
system
Prior art date
Application number
PCT/MY2011/000153
Other languages
French (fr)
Inventor
John Foderaro
Weng Onn Kow
Dickson Lukose
Nur Hana SAMSUDIN
Chuan Woo SHENG
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to MYPI2010006268 priority Critical
Priority to MYPI2010006268 priority
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2012091541A1 publication Critical patent/WO2012091541A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

A semantic web constructor system (100) supported by a plurality of web crawlers (112) in a World Wide Web upon declaring a plurality of trustworthy websites is provided, characterized in that, the system (100) includes at least one web crawler controller (110) engagable to manage the plurality of web crawlers (112), a semantic web database (116) connectable to the plurality of web crawlers (112) and a plurality of data building editors (122, 124, 126) connectable to the at least one web crawler controller (110) wherein a semantic browser (120) is further connectable to the semantic web database (116) to receive at least one natural language query from at least one user.

Description

A SEMANTIC WEB CONSTRUCTOR SYSTEM AND A METHOD THEREOF

FIELD OF INVENTION The present invention relates to a semantic web constructor system supported by a plurality of web crawlers in a World Wide Web, upon declaring a plurality of trustworthy websites in order to query a semantic browser for a plurality of websites that are page- ranked. BACKGROUND OF INVENTION

Most contemporary web search engines function based on keyword searches. Keyword based search engines usually pull out websites with occurrence of one or more of the keywords specified in the search. However, searching based on keywords tend to produce large numbers of hits which burdens the user to ensure relevance of results. Users end up spending large amount of time going through the resulting websites to identify a relevant document. This becomes a huge drain on the user.

In addition to the problem of having a large volume of hits, the user also needs to internalize the relevant material in order to make it applicable and utilize it for the problem that is being solved.

With the current web browser technology, a user is not able to issue a natural language query as a keyword based search engine is not able to understand the context of the natural language query and is unable to search for relevant information effectively from the web. Furthermore, the web is made up of a large amount and highly distributed information that may come from questionable sources. Therefore, the user must be able to discern a reliable website from an unreliable one. Most documents available on the World Wide Web are not semantically enabled for computers to interpret and to process beyond simple keyword search. Human input is required to read the document and to interpret the information. This is a very slow and inefficient process made painfully obvious with the current information explosion online. U.S. 6,044,375 describes a method of automatically extracting metadata from a document through a neural network to generate metadata guesses including word guesses, compound guesses and document guesses along with confidence factors associated with the guesses indicating the likelihood that each of the guesses is correct. However, the described invention does not look at the information on the website itself and does not extract data from the website itself.

Therefore, there is a need for a solution in order to search for relevant data, for users who are looking for specific and relevant data without having to spend too long manually reading through long lists of results.

SUMMARY OF INVENTION

Accordingly there is provided a semantic web constructor system supported by a plurality of web crawlers in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the system includes at least one web crawler controller engageable to manage the plurality of web crawlers, a semantic web database connectable to the plurality of web crawlers and a plurality of data building editors connectable to the at least one web crawler controller wherein a semantic browser is further connectable to the semantic web database to receive at least one natural language query from at least one user.

There is also provided a method of constructing a semantic web supported by a plurality of web crawlers in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the method includes the steps of crawling the web to select unprocessed websites, selecting websites based on trustworthiness calculated by comparing a trustworthiness numeric value to a predetermined threshold value, extracting a plurality of text from the selected websites, tokenizing the extracted plurality of text, applying a predetermined set of data transformation rules to the tokenized extracted plurality of text, converting the tokenized extracted plurality of text to metadata and storing the metadata in semantic web database.

There is further provided a method of querying a semantic web supported by a plurality of web crawlers in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the method includes the steps of receiving a query from a user, parsing the query into an internal representation format, searching the semantic web database using the internal representation format, ranking a plurality of websites based on trustworthiness and returning the ranked plurality of websites to the user. The present invention consists of several novel features and a combination of parts hereinafter fully described and illustrated in the accompanying description and drawings, it being understood that various changes in the details may be made without departing from the scope of the invention or sacrificing any of the advantages of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be fully understood from the detailed description given herein below and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, wherein:

Figure 1 is a block diagram illustrating architecture of a preferred embodiment of a semantic web constructor system;

Figure 2 is flowchart illustrating a preferred embodiment of the steps of constructing a semantic web supported by a plurality of web crawlers; and Figure 3 is a diagram showing an example of a natural language query using the preferred embodiment of a semantic web constructor method and system.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The present invention relates to a semantic web constructor system supported by a plurality of web crawlers in a World Wide Web, upon declaring a plurality of trustworthy websites in order to query a semantic browser for a plurality of websites that are page- ranked. Hereinafter, this specification will describe the present invention according to the preferred embodiment of the present invention. However, it is to be understood that limiting the description to the preferred embodiment of the invention is merely to facilitate discussion of the present invention and it is envisioned that those skilled in the art may devise various modifications and equivalents without departing from the scope of the appended claims.

The following detailed description of the preferred embodiment will now be described in accordance with the attached drawings, either individually or in combination.

The present invention provides a semantic web constructor system (100) supported by a plurality of web crawlers (1 12) in a World Wide Web upon declaring a plurality of trustworthy websites as seen in Figure 1. The system (100) includes at least one web crawler controller (110) engage able to manage the plurality of web crawlers (112). A semantic web database (116) is further connectable to the plurality of web crawlers (112). A plurality of data building editors (122, 124, 126) is connectable to the at least one web crawler controller (110) wherein a semantic browser (120) is further connectable to receive at least one natural language query from at least one user. A trust engine (1 8) is connectable to the at least one web crawler controller (110).

A plurality of data building editors (122, 124, 126) further include a website list editor (122), a rule editor (124) and a concept editor (126). However, it is to be appreciated by one skilled in the art that the plurality of data building editors may also include other editors of a similar nature in other embodiments of the system (100). In order for the system (100) to be functional, the system (100) must first be provided with a plurality of relevant concepts, a predetermined set of transformation rules and an initial set of websites where relevant information associated with the relevant concepts are found. The concept editor (126) is used to specify a plurality of relevant concepts, properties and relationships in a subject area of interest. The concept editor (126) uses external resources such as Wordnet to expand on a specified concept. The rule editor (124) is used for defining data transformation rules. A website list editor (122) is used to specify an initial set of websites where data related to the subject area of interest may be found.

An example of a web crawler (112) is a spider as used in a search engine to retrieve information from the World Wide Web. Accordingly, an example of a web crawler controller (110) is a spider controller. The web crawler controller (110) as seen in Figure 1 filters websites to ensure that trustworthy websites are processed first by the plurality of web crawlers (112). Each web crawler (1 12) is assigned to a different website. Websites that are to be processed by the plurality of web crawlers (112) are delegated in a balanced manner by the web crawler controller (110). The web crawler controller (110) maintains a plurality of websites that are related to the subject area of interest. Initially, the websites are provided by the user. However, the plurality of websites increases as the web crawlers (112) encounter newly linked Uniform Resource Locators (URLs) on websites that are processed. An intermediary database (114) is connectable to the plurality of web crawlers (112) and the semantic web database (116). An example of the intermediary database (114) is a merger database. A method of constructing a semantic web supported by a plurality of web crawlers (1 12) in a World Wide Web upon declaring a plurality of trustworthy websites is described herein as seen in Figure 1. The method includes crawling the World Wide Web to select unprocessed websites using a plurality of web crawlers ( 10) such as spiders. An expansion of the initial plurality of relevant concepts is first performed, wherein each concept provided to the concept editor (126) is expanded to include a plurality of similar concepts. This step is carried out by means of an external lexical database such as Wordnet. Upon declaring the initial set of websites, the trust engine (118) determines trustworthiness of each declared website. Websites are selected based on trustworthiness calculated by comparing a trustworthiness numeric value to a predetermined threshold value. For example, the trustworthiness value is a numeric value from 0 to 100. The trust engine (118) accepts websites with a trustworthiness value that is higher than the predetermined threshold value. A trusted list of websites is created as shown in Figure 1.

A plurality of text is extracted from the selected trusted list of websites. It is to be understood that the plurality of text may also be extracted from documents found on the World Wide Web. Further, the extracted plurality of text is tokenized and a predetermined set of data transformation rules is applied to the tokenized extracted plurality of text. The tokenized extracted plurality of text is converted to metadata, such as Resource Description Framework (RDF) data and the RDF data is stored in a semantic web database (116) as seen in Figure 2. Identification of new Uniform Resource Locators (URLs) on documents as found on the World Wide Web is carried out continuously. Upon encountering unprocessed URLs, the plurality of web crawlers (110) will pass the unprocessed URLs to the web crawler controller (110) for processing.

Each web crawler (110) processes allocated web pages in a non-intrusive manner. Figure 2 shows a method performed by a web crawler (112) to process a web page. A document or web page is classified to determine relevance of each document. Sentence extraction is performed by identifying a targeted sentence from the document or web page and extracting the sentence. The extracted sentence is then tokenized in order to apply a set of predetermined data transformation rules to the tokenized extracted sentence. The tokenized sentence is then converted to metadata such as RDF data. The RDF data is then stored in a web crawler's local semantic web crawler database (201 ).

In an event where there are no more URLs to be assigned to any web crawlers ( 12), an intermediary database (114) such as a merger database retrieves RDF data collected by each local semantic web crawler database (201 ) and merges the RDF data with the semantic web database (116).

Upon merging all RDF data into the semantic web database, the system (100) is ready to respond to natural language or structured natural language query processing. A method of querying a semantic web supported by a plurality of web crawlers (112) in a World Wide Web upon declaring a plurality of trustworthy websites is now described as seen in Figure 1. The method includes the steps of receiving a query from a user and parsing the query into an internal representation format. A semantic browser (120) is used to receive a query from the user and a query parser is used to parse queries entered by users. The semantic web database (116) is searched using the internal representation format. The internal representation format is then passed to a semantic search engine that performs the search. The semantic web database (116) responds with a set of answers to the query by the user as well as references to a plurality of websites where the set of answers may be found. The plurality of websites is ranked based on trustworthiness by a page-ranker. The ranked plurality of websites is then returned to the user who issued the query.

An example of the described embodiment of system (100) and method is seen Figure 3. A typical website called "Bali Thai" is identified. The plurality of web crawlers (1 12) then process the website and construct RDF data as seen in Figure 3. A user may issue a query as follows:

QUERY: Find me a Thai restaurant that is Halal, not too expensive, no alcohol served, near Jurong Point Shopping Centre in Singapore.

The system (100) then responds with a reference to the website entitled "Bali Thai". It is to be appreciated that it is not possible to receive meaningful results such as those seen in this embodiment in a conventional search engine by issuing a natural language query of this nature. The meaningful results in this embodiment are specific websites that contains information as searched by the user. Each individual web crawler (1 12) processes data extraction and data transformation of websites locally by using concepts and rules. Further, semantic web databases that are local, are created and then merged with global semantic web databases. This architecture allows the system (100) to be inherently scalable, which is a critical requirement for creating a semantic web database for the World Wide Web. Ontology based concepts and rules are used to automatically extract data of interest from websites. It is to be understood that the usage of websites in this description includes web pages, documents, messages and other information in a text format. Formats of all concepts and rules used are compatible with World Wide Web Consortium (W3C) ontology standards in order to be created or edited using commercially available ontology editing tools or with a specialized editor. The described method and system can be used to transform data of interest into RDF knowledge representation to enable downstream semantic search, knowledge discovery and process automation. Therefore, the described system and method is able to transform natural language queries into appropriate internal semantic query. This produces results that directly answer the query rather than relying on users to check each search result for relevancy. The described invention can be applied, but not restricted to, create a knowledge database for any domain from any data source that is unstructured, semantically unintelligible documents to transform said documents into a computer-understandable, semantic database to perform knowledge discovery and semantic information search efficiently and accurately.

Claims

A semantic web constructor system (100) supported by a plurality of web crawlers (1 12) in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the system (100) includes:
i. at least one web crawler controller (1 10) engage able to manage the plurality of web crawlers (1 12);
ii. a semantic web database (1 16) connectable to the plurality of web crawlers (1 12); and
iii. a plurality of data building editors (122, 124, 126) connectable to the at least one web crawler controller (110)
wherein a semantic browser (120) is further connectable to the semantic web database (1 16) to receive at least one natural language query from at least one user.
The system (100) as claimed in claim 1 , wherein a trust engine (1 18) is connectable to the at least one web crawler controller (1 10).
The system (100) as claimed in claim 1 , wherein the plurality of data building editors (122,124, 126) further include a website list editor (122), a rule editor (124) and a concept editor (126).
The system (100) as claimed in claim 1 , wherein the plurality of web crawlers (1 12) is a plurality of spiders.
The system (100) as claimed in claim 1 , wherein the at least one web crawler controller (1 10) is at least one spider controller. The system as claimed in claim 1 , wherein an intermediary database (114) is connectable to the plurality of web crawlers (1 12) and the semantic web database (1 16).
A method of constructing a semantic web supported by a plurality of web crawlers (112) in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the method includes the steps of:
i. crawling the web to select unprocessed websites;
ii. selecting websites based on trustworthiness calculated by comparing a trustworthiness numeric value to a predetermined threshold value;
iii. extracting a plurality of text from the selected websites;
iv. tokenizing the extracted plurality of text;
v. applying a predetermined set of data transformation rules to the tokenized extracted plurality of text;
vi. converting the tokenized extracted plurality of text to metadata; and vii. storing the metadata in semantic web database (1 6).
The method as claimed in claim 8, wherein the metadata used is Resource Description Framework (RDF) data.
A method of querying a semantic web supported by a plurality of web crawlers (112) in a World Wide Web upon declaring a plurality of trustworthy websites, characterized in that, the method includes the steps of:
i. receiving a query from a user;
ii. parsing the query into an internal representation format;
iii. searching the semantic web database (1 16) using the internal representation format; ranking a plurality of websites based on trustworthiness; and returning the ranked plurality of websites to the user.
PCT/MY2011/000153 2010-12-28 2011-06-24 A semantic web constructor system and a method thereof WO2012091541A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
MYPI2010006268 2010-12-28
MYPI2010006268 2010-12-28

Publications (1)

Publication Number Publication Date
WO2012091541A1 true WO2012091541A1 (en) 2012-07-05

Family

ID=46383351

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2011/000153 WO2012091541A1 (en) 2010-12-28 2011-06-24 A semantic web constructor system and a method thereof

Country Status (1)

Country Link
WO (1) WO2012091541A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156305A (en) * 2016-06-30 2016-11-23 北京奇虎科技有限公司 Method and device for displaying and pushing data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
US5694592A (en) * 1993-11-05 1997-12-02 University Of Central Florida Process for determination of text relevancy
WO2002031705A1 (en) * 2000-10-10 2002-04-18 Science Applications International Corporation Method and system for facilitating the refinement of data queries
US20050010553A1 (en) * 2000-10-30 2005-01-13 Microsoft Corporation Semi-automatic annotation of multimedia objects
US7117207B1 (en) * 2002-09-11 2006-10-03 George Mason Intellectual Properties, Inc. Personalizable semantic taxonomy-based search agent
US20070050338A1 (en) * 2005-08-29 2007-03-01 Strohm Alan C Mobile sitemaps
US7603350B1 (en) * 2006-05-09 2009-10-13 Google Inc. Search result ranking based on trust
US20110022598A1 (en) * 2009-07-24 2011-01-27 Yahoo! Inc. Mixing knowledge sources for improved entity extraction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694592A (en) * 1993-11-05 1997-12-02 University Of Central Florida Process for determination of text relevancy
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
WO2002031705A1 (en) * 2000-10-10 2002-04-18 Science Applications International Corporation Method and system for facilitating the refinement of data queries
US20050010553A1 (en) * 2000-10-30 2005-01-13 Microsoft Corporation Semi-automatic annotation of multimedia objects
US7117207B1 (en) * 2002-09-11 2006-10-03 George Mason Intellectual Properties, Inc. Personalizable semantic taxonomy-based search agent
US20070050338A1 (en) * 2005-08-29 2007-03-01 Strohm Alan C Mobile sitemaps
US7603350B1 (en) * 2006-05-09 2009-10-13 Google Inc. Search result ranking based on trust
US20110022598A1 (en) * 2009-07-24 2011-01-27 Yahoo! Inc. Mixing knowledge sources for improved entity extraction

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156305A (en) * 2016-06-30 2016-11-23 北京奇虎科技有限公司 Method and device for displaying and pushing data

Similar Documents

Publication Publication Date Title
Auer et al. Dbpedia: A nucleus for a web of open data
Wan et al. Single Document Keyphrase Extraction Using Neighborhood Knowledge.
US7668825B2 (en) Search system and method
CA2536265C (en) System and method for processing a query
CA2281645C (en) System and method for semiotically processing text
US7627548B2 (en) Inferring search category synonyms from user logs
US8639708B2 (en) Fact-based indexing for natural language search
CA2669236C (en) Extending keyword searching to syntactically and semantically annotated data
US8756245B2 (en) Systems and methods for answering user questions
US8060513B2 (en) Information processing with integrated semantic contexts
US8468156B2 (en) Determining a geographic location relevant to a web page
EP2181405B1 (en) Automatic expanded language search
CN100478949C (en) Query rewriting with entity detection
US7676465B2 (en) Techniques for clustering structurally similar web pages based on page features
US8099423B2 (en) Hierarchical metadata generator for retrieval systems
US7680858B2 (en) Techniques for clustering structurally similar web pages
AU2005217413B2 (en) Intelligent search and retrieval system and method
US8352463B2 (en) Integrated full text search system and method
JP4644420B2 (en) Method and machine-readable storage device for retrieving and presenting data over a network
Bethard et al. Who should I cite: learning literature search models from citation behavior
US7730013B2 (en) System and method for searching dates efficiently in a collection of web documents
US20110196670A1 (en) Indexing content at semantic level
US20090292685A1 (en) Video search re-ranking via multi-graph propagation
US9280535B2 (en) Natural language querying with cascaded conditional random fields
US8402021B2 (en) Providing posts to discussion threads in response to a search query

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11852653

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11852653

Country of ref document: EP

Kind code of ref document: A1