AU2007100279A4

AU2007100279A4 - Systems and methods of directionally guided, discriminate crawling of internet real estate listings

Info

Publication number: AU2007100279A4
Application number: AU2007100279A
Authority: AU
Inventors: Breez Brander
Original assignee: Individual
Current assignee: Individual
Priority date: 2007-04-08
Filing date: 2007-04-08
Publication date: 2007-05-10
Anticipated expiration: 2015-04-08

Description

SSYSTEMS AND METHODS OF DIRECTIONALLY GUIDED, DISCRIMINATE CRAWLING OF INTERNET REAL ESTATE LISTINGS Cl This invention relates to systems and methods for Internet (WWW or web) based real estate listings searching, crawling, filtering, organizing and displaying. More particularly, the present invention relates to systems and methods for retrieving relevant real estate 00 listings data from documents available publicly on the web, storing the retrieved data and displaying the data in a humanly readable form.

Search engines are used to explore the World Wide Web and build indices of available web pages. Search engines typically have three major elements: the spider (crawler), the Cl index or database, and the front-end search engine software. The spider visits web pages Cto extract information from them to build an indexed database for the search engine. The spider searches for new web pages, as well as changes in web pages that have already Cbeen indexed by the search engine. The index or database serves as a storage space for Sthe information found by the spider. The front-end search software component of the search engine allows users to look for web pages in the index or database containing information related to one or more search terms entered in a search query. The results of the search are displayed to the user.

Search engines are distinct from search directories/portals and real estate search engine (web crawler) described in this application is different from real estate search portals/directories. Search directories/portals require individuals to physically submit information about a web site to the search directory. They are basically just a big database. Human staff maintains the search directory/portal and classifies the submitted web page information. A search directory/portal user is only able to select from sites listed in the directory and no more. While this approach typically produces high quality indices and allows for classification of web sites in a directory structure, the growth of the web makes the task of covering a large percentage of the web increasingly difficult for the editorial staff. As a result, searches performed using search directories often return too little useful information.

Each web page has a distinct address called its uniform resource locator (URL), which at least in part identifies the location or host computer of the web page. Many of the documents on the world wide web are written in standard document description languages HTML, XML, XHTML, PHP, ASP, etc.). These languages allow an author of a document to create hypertext links to other documents. Hypertext links allow a reader of a web page to access other web pages by clicking on links to the other pages.

These links are typically highlighted in the original web page. A web page containing hypertext links to other web pages generally refers to those pages by their URL's. A URL may be referred to more generally as a data set address, which corresponds to a web page, or data set. Links in a web page may refer to web pages that are stored in the same or different host computers.

A web crawler is a program that automatically finds and downloads documents from host computers in the intranet or world wide web. A computer with a web crawler installed on rl- it may also be referred to as a web crawler. When a web crawler is given a set of starting CURL's, the web crawler downloads the corresponding documents and the documents C linked from the initial target documents. The web crawler then extracts any URL's $contained in all those downloaded documents. Before the web crawler downloads the documents associated with the newly discovered URL's, the web crawler needs to find out whether these documents have already been downloaded. If the documents associated with the newly discovered URL's have not been downloaded, the web crawler downloads the documents and extracts any URL's contained in them. This process repeats indefinitely or until a predetermined stop condition occurs.

Perpetual level hyperlink (which contain URLs) crawling is the most common form of Cl searching and crawling used by web crawlers. The content nature of a web page is C identified by scanning and filtering the title of the web page, web page's text and information contained in meta tags. Web crawlers generally pull out and index words that are believed to be significant. Words that appear near the top of a document and words Sthat appear frequently throughout a document are more likely to be considered important.

C"l There are several problems associated with perpetual level hyperlink crawling. This method cannot zero in and target only particular niche type of content on the Internet. It simply follows indiscriminately all hyperlinks found on all pages it finds and indexes the content. That is why known search engines and crawlers to date index all general content.

The only way of searching through the index of these search engines is then keyword searching. Keyword searching has a hard time distinguishing between words that are spelled the same way, but have different meanings. Thus, a keyword search can produce results that are irrelevant to the intention of a user's query.

Unlike perpetual level hyperlink crawling, directional and discriminate real estate listings crawling that is the subject of this invention is a concept-based crawling, that attempts to determine the exact nature of the crawled document (web page or other) through either directionally targeting URLs of a particular known text content. In the case this method does not produce the anticipated web page text content a dynamic multilingual dictionary and thesaurus parameters follow up to filter the correct elements from the crawled document. Concept-based searching often involves "clustering," where the meanings of words are examined in relation to the words found nearby. The web crawler follows therefore only the hyperlinks that correspond and point to the searched for subject matter, in this case real estate listings web pages. A concept-based search engine therefore generally returns a list of documents that exactly match the parameters of the user's search.

Exact search results relevancy is becoming increasingly critical to users as the volume of information available on the web grows. Users typically do not have time to sift through hundreds of documents or links to determine relevance and find what they are looking for. Some search engines use search term frequency as a method of determining whether a document is relevant. However, if the search term entered by the user in a search query is relatively common, or has multiple meanings, a search engine can produce search 17- results, which a user considers irrelevant to the user's intended search. A directional, Sdiscriminate crawling and exact parameter searching solves this long-time problem.

Systems and methods for interactively crawling, retrieving, categorizing, searching and Ssummarizing web documents are presented. Features of embodiments of the invention are 4described below accompanied by references to the crawler and search engine procedure figure.

00 The present invention is a web document retriever, processor, organiser and displayer (later called "search engine") that analyses the content of documents, such as web pages or text documents, downloaded from a computer network, such as the Internet, in response to a automated web crawl sequence to find internet based real estate listings.

The search engine allows a user to find desired information with complete accuracy the first time, with no successive search query refinements or additions.

Upon initialisation of the web crawl sequence the search engine's web crawler software Slocates documents related to real estate, parses words and links in the documents into a word set, filters out unnecessary words, groups the documents into categories, provides labels for the categories, constructs summaries of the documents in each category, indexes the sets of data so processed, inserts the data into a database, presents the categories and summaries to the user based upon the user's search query.

The user may view a document by reading the indexed summary of the document, together with images and by selecting and clicking a link to the original online document.

MAIN WEB CRAWLER PROCEDURE The web crawler system uses modules of thread crawlers to download and process documents. The web crawler module is given a set of initial internet entry URL's corresponding to real estate and begins crawling and parsing web page text and hyperlinks in the found documents. Various data structures and methodologies are used to keep track of which web page hyperlinks should be followed in the initial and all successive URLs and which corresponding documents (web pages) each web crawler in the web crawler system should download and process.

The web crawler module thread determines the initial URLs, from which information is to be downloaded and the crawl is to be directed further, typically by retrieving the preset entry URLs and their corresponding web pages with real estate content on them. The module thread then downloads the document and web page corresponding to the entry URLs, and processes the document. If the entry URL contains real estate listings, the crawler thread filters the web page by pre-set filters and extracts the relevant words and hyperlinks. It then processes and categorises the filtered data and inputs it into a database.

After the data on the initial page has been imputed into the database successfully the crawler thread then filters the same page for hyperlinks to similar real estate listings pages on the same website and processes those web pages the same as the initial one.

Hyperlink filters utilise unique dynamic dictionary and thesaurus procedures. If the crawler thread does not find any pre-set typical content on the initial crawled web page it utilises the same dictionary and thesaurus procedures to filter the web page to determine Sif the web page has merely changed in source code or really does not contain any real estate listings data. If with this procedure it determines that it indeed does contain real Sestate listings data and that the web page source code has changed it updates the filters to remember the new code and processes the web page and filters out the real estate listings words and hyperlinks (data), processes and categorises the data and inputs it into the 00 database. After the data on the initial page has been imputed into the database successfully the crawler thread then filters the same page for hyperlinks to similar real estate listings pages on the same website and processes those web pages the same as the initial one. If the crawler thread does indeed not find any real estate listings data on the initial page after employing both procedures explained above it simply filters the page for Sany possible hyperlinks that by the words in them would indicate they lead to a page on Sthis website where real estate listings might be found.

t'I C)Once the crawler thread processes the entire initial website it then follows all outbound C)hyperlinks (hyperlinks to other websites on different domain names and/or Internet Protocol addresses) that it found on this website and flagged (remembered) during the above explained crawl procedure. The entire crawl procedure then starts a new on the next website with the same procedure that was performed on the initial website and on and on until the crawler finds no more matching websites to crawl.

Claims

2. The method of claim 1, wherein the processing of documents and web pages to determine whether they contain real estate listings information and data includes these steps: determining the initial URLs, from which information is to be downloaded and the crawl is to be directed further, by retrieving the pre-set entry URLs and their corresponding real estate content web pages; downloading the documents and web pages corresponding to the entry URLs; determining whether the documents and web pages contain any real estate listings information and data by utilizing pre-set page source code filters and regular expressions; determining whether any real estate listings information and data was found in the previous step; if real estate listings data and information was found, following up with processing it and storing it in the database; if real estate listings data and information was not found, employing dynamic dictionary and thesaurus filtering procedures to determine whether the web pages and documents have merely changed in source code or whether they really do not contain any real estate listings data; updating the filters to remember the possible new code and processes the web pages data and information and imputing it into the database; filtering the hyperlinks of the same web pages and documents to determine if they lead to more similar real estate listings web pages on the same website and/or to other websites by utilizing the dictionary and thesaurus filtering procedures; remembering the found matching hyperlinks leading to other websites that might contain real estate listings; following the hyperlinks leading to more potential real estate listings web pages and documents on the same website; processing the newly found web pages and documents on the same website the same way as the initial ones; processing all the web pages and documents on the same website with the above steps; following all found, matched and remembered outbound hyperlinks (hyperlinks to other websites on different domain names and/or Internet Protocol addresses); repeating the entire procedure outlined in all of the above steps from the downloading of the documents step onwards for the new t' found websites and repeating it for all possible websites until no more possible C matching websites are found. Cl
3. A system for interactive Internet real estate listings crawling and searching comprising of these steps: retrieving Internet based documents and web pages with a plurality of web crawlers; parsing documents and web pages; processing 0 the documents and web pages to determine whether they contain real estate listings information and data; processing the documents and web pages to determine whether they contain hyperlinks to other real estate listings websites and resources; clustering the retrieved sets of real estate listing data and information from document and web pages into categories; labelling the Cl categories from the clustering; summarizing the retrieved data and information; Cdisplaying the labelled categories and document summaries; parsing words of the retrieved data and information to create word sets before the clustering of retrieved data and information into categories; and filtering a set of predefined Owords from the word sets; imputing processed the real estate listings data and Cl information into a database; displaying the labelled categories, and real estate listings web page and document summaries.
4. The system of claim 3, wherein the processing of documents and web pages to determine whether they contain real estate listings information and data includes these steps: determining the initial URLs, from which information is to be downloaded and the crawl is to be directed further, by retrieving the pre-set entry URLs and their corresponding real estate content web pages; downloading the documents and web pages corresponding to the entry URLs; determining whether the documents and web pages contain any real estate listings information and data by utilizing pre-set page source code filters and regular expressions; determining whether any real estate listings information and data was found in the previous step; if real estate listings data and information was found, following up with processing it and storing it in the database; if real estate listings data and information was not found, employing dynamic dictionary and thesaurus filtering procedures to determine whether the web pages and documents have merely changed in source code or whether they really do not contain any real estate listings data; updating the filters to remember the possible new code and processes the web pages data and information and imputing it into the database; filtering the hyperlinks of the same web pages and documents to determine if they lead to more similar real estate listings web pages on the same website and/or to other websites by utilizing the dictionary and thesaurus filtering procedures; remembering the found matching hyperlinks leading to other websites that might contain real estate listings; following the hyperlinks leading to more potential real estate listings web pages and documents on the same website; processing the newly found web pages and documents on the same website the same way as the initial ones; processing all the web pages and documents on the same website with the above steps; following all found, matched and remembered outbound hyperlinks (hyperlinks to other websites on different domain names and/or Internet Protocol addresses); repeating the entire procedure outlined in all of the above steps from the downloading of the documents step onwards for the new found websites and repeating it for all possible websites until no more possible matching websites are found. A system for displaying the real estate listings data and information obtained by the system of claim 3 corresponding to a users search query. 00 Breez Brander 20 March 2007 C'