GB2470563A - Populating a database - Google Patents

Populating a database Download PDF

Info

Publication number
GB2470563A
GB2470563A GB0909019A GB0909019A GB2470563A GB 2470563 A GB2470563 A GB 2470563A GB 0909019 A GB0909019 A GB 0909019A GB 0909019 A GB0909019 A GB 0909019A GB 2470563 A GB2470563 A GB 2470563A
Authority
GB
United Kingdom
Prior art keywords
database
record
entity
search
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0909019A
Other versions
GB0909019D0 (en
Inventor
John Robinson
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to GB0909019A priority Critical patent/GB2470563A/en
Publication of GB0909019D0 publication Critical patent/GB0909019D0/en
Publication of GB2470563A publication Critical patent/GB2470563A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A search system comprising a database having a plurality of records, each record being provided for an entity, such as a company, and comprising a plurality of fields for information relating to the respective entity. The system further comprises a web crawler and a database manager for associating crawled web pages with respective records. A search engine is adapted to receive one or more search terms from a client, search at least one field and the associated web pages of each record in the database, determine a record as relevant based on a correspondence between a said search tern and data in the at least one field and the associated web pages and return to the client information on the entity for which the determined record is provided.

Description

I
SEARCH SYSTEM
Field of the Invention
The present invention relates to the field of on-line searching.
Background of the Invention
The use of Internet search engines is now well-established. Users of a personal computer need simply to navigate via a web browser to the search page of an Internet searching site and enter search terms in a search box provided for the purpose. Alternatively, the web browser may be provided with a toolbar, which has a search or query box for this purpose, thereby obviating the need to navigate to the search page. Typically the query is split between hundreds or even thousands of computers in a cluster and an indexed copy of the World Wide Web is searched via an index server. The copy of the World Wide Web is created using a crawler that crawls unique resource locators (URLs) on the network and caches the located web pages. Once the index server compiles the results, a doôument server pulls the relevant documents or parts of documents from the copy of the World Wide Web and these are compiled in a page builder to display the results of the query. As a result, it is common to provide a hyperlink to a web page including the keyword(s) or to the associated home page, together with an excerpt of the relevant part of the page.
This provides users with a vast resource of easily accessible information.
However, the sheer scale of the World Wide Web has the disadvantage that it can be very difficult to locate precisely the data sought. This is particularly the case where a user is attempting to locate information about a business or businesses in a particular sector and/or location. Generally, such searches are likely to throw up a large number of spurious or only partially relevant hits, making it difficult to find the desired data.
Moreover, results tend to be ranked based on the number of times the pages are visited or the number of sites linking to those pages. However, this is often not a good reflection of the relevance of a business to the searcher.
Proprietary business databases accessible by the Internet are also known. Such databases commonly include records for a large number of businesses, each record including basic data about the respective business, such as the name, address, telephone and fax numbers of the business, optionally with the ability to view the business activity and/or business address on a map.
However, in view of the basic nature of the data, users are unable to make informed choices about the different businesses thrown up by a search and instead need to do a considerable amount of further research. It is noted that.
results of searches of such proprietary databases are often thrown up as results of an Internet search query of the World Wide Web, exacerbating the difficulties of such searches.
Summary of the Invention
According to the present invention, there is provided
Brief Description of the Drawings
Embodiments of the present invention will now be described by way of further example only and with reference to the accompanying drawings, in which: Fig. Us a schematic drawing of a search system according to the present invention; Fig. 2 is a schematic drawing of a computer system and a client computer that may be used in accordance with the present invention; Fig. 3 is a schematic representation of a database; Fig. 4 is a schematic representation of a database record according to the present invention; Fig. 5 is a flow chart showing a method of database creation and maintenance according to the present invention; Fig. 6 is an illustration of a web browser page adapted in accordance with the present invention; -Fig. 7 is a flow chart showing a search method according to the present invention; Fig. 8 is an illustration of a web browser page showing a result in accordance with the present invention; Fig. 9 is an illustration of a web browser page adapted in accordance with the present invention; Fig. 10 is an illustration of a further web browser page adapted in accordance with the present invention; Fig. 11 is an illustration of another web browser page adapted in accordance with the present invention; and Fig. 12 is a flow chart showing a method performed by a toolbar according to the present invention.
Detailed Description
Fig. I shows one embodiment of a search system according to the present invention. As shown in Fig. 1, the system comprises a server 10 storing a database 12 and including a database manager 14. The system further comprises a search engine 16 for receiving queries from client computers 70 over the Internet 80 and searching the database 12 based on the queries. A page builder 18 builds pages including the results of the search, and a page server 20 serves the pages to the client computers 70.
The server 10 shown in Fig. I can be implemented using a computer system having, for example, the architecture shown in Fig. 2. In Fig. 2, the computer 1800 may interface to external systems through a modem or network interface 1801 such as an analogue modem, ISDN modem, cable modem, token ring interface, or satellite transmission interface. As shown in Fig. 2 the computer system 1800 includes a processing unit 1806, which may be a conventional microprocessor, such as an Intel Pentium microprocessor, an Intel Core Duo microprocessor, or a Motorola Power PC microprocessor, which are known to one of ordinary skill in the computer art. System memory 1805 is coupled to the processing unit 1806 by a system bus 1804. System memory 1805 may be a DRAM, RAM, static RAM (SRAM) or any combination thereof. Bus 1804 couples processing unit 1806 to system memory 1805, to non-volatile storage 1808, to graphics subsystem 1803 and to input/output (I/O) controller 1807. Graphics subsystem 1803 controls a display device 1802, for example a cathode ray tube (CRT) or liquid crystal display, which may be part of the graphics subsystem 1803. The I/O devices may include one or more of a keyboard, disk drives, printers, a mouse, a touch screen and the like as known to one of ordinary skill in the computer art. A digital image input device 1810 may be a scanner or a digital camera, which is coupled to I/O controller 1807. The non-volatile storage 1808 may be a magnetic hard disk, an optical disk or another form for storage for large amounts of data. Some of this data is often written by a direct memory access process into the system memory 1806 during execution of the software in the computer system 1800.
In a preferred embodiment, the database 12 shown in Fig. 1 is stored in the non-volatile storage 1808 and other components shown in Fig. 1 are executed as software routines stored in the memory 1805 and performed by the processing unit 1806.
Alternatively, a cluster of such standard purpose computers may be used or a specifically adapted computer having a similar architecture but a large storage capacity and fast processing capability may be used.
The database 12 is a proprietary database and, as shown in Fig. 3 comprises a plurality of records A.. .XXXXX, each record storing information about a respective entity. It is preferred that the entities are people and businesses such as companies, partnerships and sole traders. However, it will be appreciated by those skilled in the art that this is non-limiting and records may also be provided for other types of entities, such as charities, business organisations, government departments and so on.
A plurality of fields is provided for each record. An example of a record is shown in Fig. 4. In this figure, the records are categorised into four types: who, where, what and miscellaneous.
The "who" fields include a name field for the name of the respective entity and one or more name fields for storing each of the directors (which includes partners and executive officers in this specification), shareholders, proprietors, and other employees of the entity (if the entity is a legal person such as a company, partnership etc rather than a natural person). "Who" fields may also include one or more URL fields containing a URL for the entity or other fields including content from the entity's about us website page.
The "where" fields include a plurality of address fields, each address field being broken down into sub-fields for street number, street, town, county, state, postal code, telephone number, fax number, e-mail address and other information such as content from the entity's contact us website page. Optionally, the "where" fields may be considered as a subset of the "who" fields.
The "what" fields may comprise an activity field, a standard industrial classification (SIC) field, other classification descriptions and a proprietary business industry code (BIC) field. The entry for the activity field comprises a short description of the entity and its activities or a number of keywords describing its activities (for example: plumbing, building, joinery). Many governments provide a standard industry classification, which is usually a three or four digit code for classifying industries and the appropriate code or codes for the entity are stored in the SIC field. In a similar manner, a BIC can be allocated to each entity. Usually, the BIC provides a more detailed classification of businesses.
The "miscellaneous" fields are for additional information about the business.
They may include, for example, fields for business turnover; annual profit; time business has been trading; a feedback score; negative credit, such as court judgements against the business; business accounts; and internet URLs and/or hyperlinks. In the present embodiment, a miscellaneous field is also provided for the entity's website content and other web pages connected with the business.
Creation and maintenance of the database 12 is schematically shown in Fig. 5.
In step SI the initial database is populated. The bulk of the database may be built in a known manner, including by automatically downloading data from already established databases and other sources. For example, the backbone of the database may be built using a government-owned and controlled database of all registered companies, illustrated as external database 50 in Fig. 1. In step SlO, the database manager 14 checks to see whether updates from the external database are available. If yes, the process proceeds to step S20, in which updates to the external database 50 are automatically imported by the database manager 14 and the relevant fields of the affected records are updated accordingly. Those skilled in this art will recognise that many such external databases 50 could be used. The, external databases 50 may make their content available to the server 10 for free, for a one-off licensing fee or for a fee for each (or each batch) of items downloaded.
It should also be appreciated that the database 12 of the present invention need not import and store all the information in the external database 50, but the server 10 may instead provide a link to the external database 50 so the information in the external database 50 is presented to the user as though from the server 10. As an example, which will become clearer from the following description, if a user desires to access the annual accounts of a selected company, the server 10 may retrieve the annual accounts from the external database 50 only at the time of the request from the user and present those annual accounts as though directly served by the server 10. This mechanism has advantages in reduced data storage requirements but will increase the response time to the user.
If there are no updates available or after the database 12 has been updated, the method proceeds to step S30. An important aspect of the present invention is that each record is or can be associated with one or more web pages, or the content of one or more web pages, corresponding to the respective business or other entity. In the present embodiment, this association is made by using a web crawler 40 to crawl the World Wide Web 30, as shown in step S30. As the crawler 40 crawls the web 30, it interacts with the database manager 40 to compare crawled web pages with records in the database 12 in step S40.
In particular, if the URL of a crawled web page matches a URL stored in a URL field of a record, the web page is cached in a "miscellaneous" field for that record. Two URLs may be considered to match when their roots match. For example, if the URL stored in the URL field is www.companyA.com, all crawled pages including the string "www.companyA.com" in their URL are cached in an appropriate field for the record. In this way, web pages having URLs such as www.companyA.com/home, www.companyA.com/aboutus and so forth will be cached for the respective record.
Put another way, for each web page crawled, the database manager 14 will determine the URL of the web page and compare it with each URL stored in the URL field of each record. If the URL of the web page and the URL of a record match, the web page is cached in the miscellaneous web page cache field for that record. Preferably, if a web page having the same URL is already cached, it is replaced in order to keep the database 12 up to date.
It will be apparent to those skilled in the art that many algorithms for matching URLs in this or a similar manner are possible and all are included within the scope of the present invention.
In addition or alternatively, the web crawler 40 or database manager 14 may be adapted to extract "who" and "where" data from crawled web pages (such as name, address and contact information, for example e-mail addresses and telephone and fax numbers). In one embodiment, the web crawler 40 or database manager 14 recognises the "about us" or "contact us" page of any web site that it visits and extracts the data from those pages. The database manager 14 compares this extracted data with data previously stored in the relevant fields of all the records in the database 12. II a match is found, the corresponding web pages are cached for that record. Optionally, incomplete who and where fields for the record may be completed using the extracted data. It would also be
possible to update fields using the data.
For example, the database manager 14 may compare the "domain" part of any extracted e-mail address (the part following @)with the domain parts of all e-mail addresses stored for all records. Typically, the database manager 14 will store the most common domain parts used by the general public and business, such as. . . @yahoo.com or. . .googie.com, in order to filter out false positives.
It will be apparent to those of skill in the art that many algorithms for matching addresses, fax and telephone numbers and e-mail addresses (as examples of "who" and "where" information) in this or a similar manner are possible and all are included within the scope of the present invention.
In addition to or instead of comparing extracted data with previously stored data, the database manager 40 may match the URL of the crawled web page with a URL stored in a URL fietd of a record in the same manner as described above.
In either case, if a match is found, "who" and "where" fields in the database may be completed and/or updated based on the extracted information.
In a further embodiment, if no match is found between extracted data and any of the records, the web crawler 40 or the database manager 14 may determine whether the website from which the contact details have been extracted is a business (or other organisational entity) website. To make this determination, the web crawler 40 or database manager 14 may be adapted to recognise whether a web page is an "about us" or "contact us" page and, if a web site is recognised as including one or both such pages, the web site is determined as being a business website. Since no match is found between the extracted data and any of the records, a new record is created using the information extracted from the web site. Naturally, the web pages for the new record are also cached.
In the foregoing manner, the database 12 includes in its record fields not only the data usually provided in a database of businesses, but also, as far as possible, a full cache of the website for each business in the database 12. Naturally, the web crawler 40 continues to crawl the web 30 to ensure that an update to copy of the web pages is cached. The resultant database 12 allows significant enhancements for searches for entities compared with both existing World Wide Web searches and existing business database searches.
As shown in Fig. 1, the server 10 is connected to the Internet 80 and is able to receive search queries from client computers 70 and to serve web pages to those client computers 70 over the Internet. Such client computers 70 may comprise desktop computers, laptop computers, mobile phones, PDAs or any other computing device. Such computers may have the structure illustrated in Fig. 2 and described above.
As will be described below, the search queries may be entered in a query box in a web page that is created by page builder 18, served by page server 20 to the client computer 70 and displayed by an Internet browser application running on the client computer 70. Alternatively, search queries may be entered in query boxes included in a toolbar running on the client computer or included in multiple purpose unit (MPU) adverts and shown, for example, in an Internet browser window.
As an example, Fig. 6 shows a browser page 100 as a graphical user interface of a web browser. The browser page 100 includes an address bar 110 showing the URL of a web page and a display pane 130 in which the web page associated with that URL is displayed. Although not shown, the displayed web page could be a web page, which shows a query box and is built by page builder 18 and served by the page server 20 of server 10. The browser page 100 also includes a toolbar 120 according to the present invention, which is caused to be displayed in the web browser page 100 by means of an applet stored on the client computer 70. The toolbar 120 includes a query box 128 and a number of buttons 122, 124, 126, whose function will be described later.
The user types search terms into the search query box 128, which are transmitted to the server 10 and, as shown in Fig. 7, received by the search 4 4 engine 16 in step SiOO. The search engine 16 searches all the records in the database 12 to locate the same or similar terms in one or more fields.
Although optional, in the present embodiment, in step Si 10 the search engine 16 interrogates a "synonyms" database 13, which may take several forms. In one arrangement, the synonyms database 13 may store synonyms of common keywords as a group, such as builder, architect, constructor, and craftsman. If any one of the keywords is entered as a search term, the synonyms database 13 returns all the other terms in the group to the search engine 12, which then proceeds to search the database 12 using the original search term and the search terms returned by the synonyms database 13.
In another arrangement, the synonyms database 13 stores each of the SIC and BIC codes in combination with industry sector descriptors for each code. When one of the keywords is entered as a search term, the synonyms database 13 returns all the SIC and/or BIC codes associated with that keyword to the search engine 12, which then proceeds to search the database 12 using the original search term and the SIC and BIC codes returned by the synonyms database 13.
Similarly, although optional, in the present embodiment, the search engine 16 interrogates a "locations" database 15. In one arrangement, the locations database 15 may store a tree of locations and regions, starting from regions at an initial branch level (eg North West, North East etc), moving through counties or states at a sub-branch level and ending at towns or villages at a leaf level. If any one of the locations is entered as a search term, the synonyms database 13 returns all the other locations existing at the same level and below and depending from the same branch. For example, if a town is entered as a search term, the locations database may return all towns and villages in the same county to the search engine 14, which then proceeds to search the database 12 using the original search term and the search terms returned by the locations database 13. Alternatively, the locations database 15 may simply return the county in which the town is located to the search engine 14, which then proceeds to search the database 12 using the original town and the county returned by the locations database 13.
Various ways of implementing the synonyms database 13 and the locations database 15, or of implementing a similar functionality, will be apparent to those skilled in the art and all are included within the scope of the present invention.
In step S120, the search engine 16 obtains the first record and in step S130 determines whether any of the search terms or additional synonyms, SIC and BIC codes and/or locations returned by the synonyms database 13 and the locations database 15 occur in any or selected of the fields of the record. If so, the record is marked as relevant in step S140. After that, or if the record is not determined as being relevant, the process moves to step SI 50, where it is determined whether any unsearched records remain. If unsearched records remain, the next record is obtained in step Si 75 and the process returns to step S130. Thus, the search engine 16 interrogates all or selected fields of all the records to see if the search terms or additional synonyms, SIC and BIC codes and/or locations returned by the synonyms database 13 and the locations database 15 occur in any of the fields. It should be noted that the searched fields include the text of web pages cached in the miscellaneous fields of records.
Accordingly, if a search term (or optionally a part of a search term) occurs in one of the fields, including in one of the cached web pages, the corresponding record is determined to be relevant.
The records are then ranked in step S160. In step S170 all or at least the most relevant records are returned to the page builder 18 in the server 10. The page builder 18 compiles a list of the relevant records ranked in order of relevance and causes to be displayed for each record in the list a hyperlink and a short contextualised description of the record. Similar list displays are returned by existing Internet World Wide Web search programs, such as GoogleTM, and existing Internet business database search programs, such as YeIITM. In the present invention, the contextualised description is based on where in the record the search term or terms are located. Thus, if a search term is located in one of the cached web pages, a relevant extractof the web page including the search term may be used as part or all of the contextualised description. Of course, the page builder may include only a limited number of records from the list in each page, together with a hyperlink for a web page for a different set of records in the list, as well known in the art, thereby allowing users to move through the list.
In addition, the page builder 18 may include an advertisement from the ad server in the built web page in a known manner. For example, page builder 18 may always include an MPU in the web page, with the content of the MPU being dictated by the ad server 60.
Finally, the page server 20 serves the page, including part or the whole of the list of results, in a known manner to the client computer 70, which displays it in the display pane 130 of the browser window 100.
As previously noted, existing Internet search engines tend to rank results based on the number of times web pages are visited or the number of web sites linking to those pages. However, this is often not a good reflection of the relevance of a business to the searcher. By contrast, the present invention allows ranking to be based on a combination of the relevance of the content of the cached web pages and the content of the remaining portions of the database. This allows considerably more relevant results to be returned to the user.
For example, relevance may be scored on points. In this example, a relevant record may accumulate relevance points depending on the number of times a search term is found in one of its fields. However, the score allocated for each "hit" differs depending on the type of field in which the hit occurs. In particular, hits in "who", "what" and "where" fields score two points, whereas hits in the "miscellaneous" fields, including in the cached web pages, score one point. If the search term occurs in one of the cached web pages a large number of times, say more than five times, two points may be scored.
Assume the search term is "plumber". Moreover, assume that record A includes "plumber" in its activity field, the BIC and SIC code for plumbers in its SIC and BIC code fields, and the word plumber is included in the cached web pages 5 times; record B does not include "plumber" in its activity field, but includes the BIC and SIC code for plumbers in its SIC and BIG code fields, and the word
S I
plumber is included in the cached web pages 12 times; and record C does not include "plumber" in its activity field, does not include the BIC and SIC code for plumbers in its SIC and BIC code fields, and the word plumber is included in the cached web pages 25 times. In this case, record A scores 7 points (2 for activity field; 2 for BIC field; 2 for SIC field; and 1 for cached web pages); record B scores 6 points (0 for activity field; 2 for BIC field; 2 for SIC field; and 2 for cached web pages); and record C scores 2 points (0 for activity field; 0 for BIC field; 0 for SIC field; and 2 for cached web pages). Accordingly, records A, B and C will be displayed in that order in the present invention, reflecting the relevance of the entries to the plumbing profession.
By contrast, an Internet search engine would be more likely to display results in the order of company C, company B and company A, purely on the strength of their respective web pages. However, the information in the "what" fields of the database 12 of the present invention in fact provides a more accurate indication of the business activities of the three companies. Accordingly, the present invention ranks the companies in an improved order of relevance.
The ability to score records based on the data included in the non-web based "where" fields in the present invention is also particularly helpful. For example, if one of the search terms is "Manchester", the present invention recognises that Manchester is a place and can use a scoring system to score records including Manchester in one or more of their address fields highly. Thus, businesses known to be based in Manchester or to have a branch in Manchester can be scored particularly highly, thereby improving the relevance of returned results.
By contrast, a web search alone will take account of the occurrence of the word Manchester in all web pages, which may skew the result away from businesses based in Manchester. Moreover, it will bring up a preponderance of business listing websites, each of which suffer the shortcomings discussed above.
It will be clear to those skilled in the art that there are many permutations of scoring systems that can be used in the present invention. The important point is that scoring is performed on a combination of the content of the cached web pages and the content in other fields of the records, which are web-independent.
Companies that have the same relevance score can then be ranked based on a further ratings system. Such a ratings system may take into account the profitability, time trading, feedback from other users, number of employees, court judgements, size of office premises etc stored in miscellaneous fields of the database 12. The particular ratings system used is not important so much as the fact that non-web based criteria are used to sort records that include cached web pages. This score could be a score stored against the business and be the same irrelevant of the search, and be an inherent indication of the relevancy of a particular business.
Of course, the scoring system and the ratings system may be amalgamated into a single scoring system. In this case, occurrence of search terms in the fields of the records are used to establish that a record may be relevant, and once this is established its relevance is determined based on the number of occurrences of the search terms in the different fields, combined with a weighting for each of those fields, combined with a further score based on the contents of additional fields providing miscellaneous information about the company that is the subject of the record. In addition, the scoring/ratings system may include a "paid for" element, allowing companies to boost their scoring by paying a fee to the provider of the server 10.
In the manner described above, the order in which results are presented is based on the business profile of the entities rather than on the page ranks of their websites or the number of times a particular term occurs in a website. This provides much more relevant data for users. Similarly, if a user is searching for a particular business, the present invention is considerably more likely to turn up the correct business, with improved information, than a web search.
As noted above, it is preferred that the page server 20 returns a page to the client computer that provides a list of relevant entities including a hyperlink and a contextualised description. In some embodiments, the hyperlink includes the name of the entity and links back to the website of the entity, if known.
Where entities have their own website and the URL is known, the hyperlink in the results list may link to another page, for example as shown in Fig. 11, which is produced by the server 10 and includes additional information about the selected entity stored in the database 12. In addition, this page may provide the user with a number of options to obtain further information about the selected entity.
A preferred example of a results page is shown in Fig. 10. In this preferred example, the results list includesfor each entity both a hyperlink 1010 to another page produced by the server 10 (such a page being shown in Fig. 11) and another hyperlink 1020 to the website of the entity, if known. In this example, a number of function buttons 1030 are provided for each result and these have the same function as the function buttons 122, 124, 126 shown in Fig. 6 and the functions buttons 222, 224, 226, 228 shown in Fig. 8, as discussed in more detail below.
In addition, in the page shown in Fig. 10, which illustrates an embodiment of the invention, the ratings score 1040 based on the information stored in the miscellaneous fields of the corresponding record (eg profitability, time trading, feedback from other users, number of employees, court judgements, size of office premises etc) is kept separate from the relevance score calculated based on the number of times a search term appears in the fields and in which fields it occurs.
If the entity does not have a website, for a fee the server may store relevant data about the entity in one or more "display" fields in the record. Such data may include a brief description and the logo of the entity, for example. In this case, when the hyperlink is selected by the user, the page builder 18 creates a page based on the information (description and logo) in the record for that entity and serves it to the client computer 70 for display in the display pane 130 of the browser window 100. Accordingly, the present invention provides a means for small entities not having a website to produce a "mini-site" at low cost and effort.
An example of such a page produced by the server 10 and as displayed by the browser window 200 is shown in Fig. 8. In this example, the display pane 230 displays the company name and logo, a brief description of the company, a map showing the location of the company and selected comments about the company left by users. In addition, a number of function buttons 222, 224, 226 and 228 are provided. In practice, they are not limited in number and their function is clearly labelled or otherwise signed on the web page.
One example of a function attributed to a said function button is providing summary information such as turnover, time trading, feedback from other users of the website served by server 10, number of employees, profitability, court judgements, size of office premises, headquarters details and so on. Of course, some or all of this summary information could also be included in the web page shown in Fig. 8. Other examples of functions attributed to the function buttons 222, 224, 226 and 228 include calling up: full accounts for the entity; a list of other businesses at the same address; a list of businesses in the same organisation or otherwise related businesses; a list of other businesses owned by a director/owner of the business or by their spouse; the number of links between the selected business and another business; a list of competitors in the region shown on a map; the downloadable toolbar 120 shown in Fig. 6; an option to provide user feedback; and an option to view user feedback. Again, some of this information may be provided on the information page, such as selected user comments shown in Fig. 8. It is preferred that a separate function button is provided for each function, however the present invention encompasses other arrangements, such as the use of a function button to call up a menu of functions, which may be related, or sub-functions.
When the function button for calling up the number of links between the selected business and another business is selected, the page server 20 serves a page requesting the user to enter another business. The search engine 14 then interrogates the records in the database 12 to establish whether it is able to link the selected businesses by common directorships; ownerships; addresses and relationships. For example, if the user seeks to establish the relationship between company A and company B, the search engine may establish that one director sits on the board of both companies A and C, that company C shares premises with company D; and that company D is owned by the spouse of a director of company B. In this way, the user is able establish the best way to approach another organisation. An exemplary relationship diagram is shown in Fig. 9. The user can move the cursor over any item in the relationship map to see the name of the business or CEO/director displayed.
When the function button for calling up the user feedback option is selected, the user is taken to another page allowing him to provide user feedback. This may include scoring the company against a predefined set of criteria and/or leaving commentary. The scored criteria may be used in the scoring system and both the results of the scored criteria and the commentary may be made available to a user by selecting the "view user feedback" function button.
With respect to Fig. 6, it is noted that function buttons 122, 124 and 126 are provided in the toolbar 120. Again, in practice they are not limited in number and their function is clearly labelled or otherwise signed on the web page. Moreover, they have the same functionality as the function buttons 222, 224, 226 and 228.
Generally, the function buttons 122, 124 and 126 are inactive and may be greyed out to indicate this. However, if the URL displayed in the address bar 110 matches a URL stored in a URL field of a record, some or all of the function buttons may become active, depending on the data stored in the other fields associated with the associated record. The function buttons are no longer displayed as greyed out if they are active. Moreover, the colour of the toolbar changes to indicate this. In addition or instead, the toolbar may flash or the client computer 70 may be caused to output a sound or display a pop-up window indicating this. Then, if one of the function buttons 122, 124, 126 is selected, the respective function is performed with respect to the company whose record includes a URL matching the URL currently shown in the web browser address bar 110.
An exemplary such method is shown in Fig. 12. At the start of the method, an applet is automatically run when the web browser application of a client computer is opened. In step S300, the applet causes the web browser to display the toolbar 120. Next, in step S310, the applet detects any URL existing in the address bar 110 and in step S320 sends it to the server 10, which checks for a correspondence between the URL in the address bar 110 and a URL stored for a record. In step S330, the results are received. If it is determined that there is a corresponding database record in step S340, the method proceeds to step S350, in which the applet causes the web browser to display the toolbar in a different colour and the function buttons as activated. The process then loops back to step S310. By contrast, if it is determined that there is not a corresponding database record in step S340, the method proceeds to step S360, in which the applet causes the web browser to display the toolbar in the original colour and the function buttons as deactivated. The process again then loops back to step S310.
With respect to Fig. 10, it is noted that the function buttons 1030 are again not limited in number and their function is clearly labelled or otherwise signed on the web page. Moreover, they have the same functionality as the function buttons 222, 224, 226 and 228. In the example shown in Fig. 10, the function buttons are displayed only where the function attributed to them is available.
It is noted that by combining a business database with the World Wide Web, the present invention provides significantly enhanced functionality compared with either a search of the web alone or a business database alone. In particular, because web pages are cached for all or most businesses having a record on the database, the content of businesses' websites is used to help guide users to the most relevant websites. Accordingly, more relevant and more complete results are provided than by searches of existing business listing databases. Moreover, the information about businesses returned in a search is fuller than that returned by standard business listing websites and includes relevant extracts from their website as well as URLs/hyperlinks for their websites as a matter of course.
Such URL5/hyperlinks are not generally available from existing business listing databases but are automatically captured in some embodiments of the present invention.
Similarly, because web pages are associated with records in an existing database, the database also including details of businesses not having a web presence, the present invention provides considerably more focused and complete results than a search of the web alone. In particular, the present invention only turns up results of businesses or other entities, tightly focused on the activities of the business and their geographical location, as well as their web presence. Moreover, businesses can be ranked based on their business profile rather than their Internet profile alone. This improvement is possible due to the caching of web pages on the database of the present invention in association with known businesses.
Moreover, this facet of the present invention allows the provision of further functionality to users (provided by means of function buttons or otherwise) which is impossible from a search of the web alone.
It is therefore apparent that the present invention provides significant advantages over both existing Internet search engines and business listing sites provided on the Internet.
In the foregoing, a single centralised database 12 stored on a single server 10 has been described. However, it should be appreciated that a distributed database is envisaged in practice, the database being stored on a plurality of computers, none of which need be the server 10. It is further envisaged that all fields of a record need not be stored in the same place. For example, a record may be split into to two or more corresponding and related records, which may even be saved in different places. As a non-limiting example, all the cached web pages may be stored in a first database at one location and all the fields of a record except the web cache fields may be stored in another database at another location together with an address or other identifier of the cached web pages associated with that record, by which means the databases are relationally linked. Of course, other fields in records may be similarly distributed.
As another example, where a function button attributed with the function of calling up a company's accounts is provided, either the company's accounts may be stored in a field of the record for that company or they may be stored in an external database 50 maintained by a different organisation to the owner of the server 10. Once the function button is selected, the server either retrieves the company's accounts from the database 12 or from the database 50 as appropriate.
In a similar manner, all the components on the server side of Fig. I may be distributed or combined as appropriate. Thus, the crawler 40 may be combined with the database 14 manager in one machine separate from the server 10 or included in the server 10. Similarly, any or all of the database manager 14, the databases 12,13, 15, the search engine 16, the page builder 18 and the page server 20 may be provided entirely or partly outside the server 10, and the crawler 40 and ad server 60 may be included in the server 10. Of course, the ad server 60 is optional. Moreover, a different computing architecture may be used, so long as the functionality described above is implemented. For example, the system may use an index server, a document server and so forth.
In the foregoing description, the web crawler 40 crawls the entire World Wide Web 30. However, it may instead be adapted to crawl only selected portions of the web relating to the businesses already stored in the database 12. For example, the crawler 40 may be limited to crawling the websites of URLs already stored in the URL field of the records, or it may use an already established search engine to identify relevant portions of the web to crawl. For example, it may automatically enter keywords relating to business or other entity names and/or activities and crawl the relevant pages thrown up by the search engine.
In the foregoing, it is described how the search engine determines whether a search term occurs in the database 12. However, the search engine may also be adapted to establish whether character strings similar to the search term occur in fields in the database 12 and the results amended accordingly or alternative searches suggested. In this way, mis-spellings, alternative spellings, different participles and different conjugations of search terms are also provided. This functionality may be provided by a spelling server or in the synonyms database 13.
In one embodiment of the present invention, web pages are not cached. Rather, only URLs are stored in the records of the database 12 and these are used to interrogate web pages each time a search is performed.
In another embodiment, the key words entered by the user are searched using a standard Internet search engine as well as the search engine of the present invention. The results are then cross-referenced. In this case, it is not mandatory to cache web pages. In this embodiment, it is preferred that the returned results would include results from the Internet search that cross-reference with results from the database search, as well as results from the database search that do not cross-reference with results from the Internet search, but not results from the Internet search that do not cross-reference with results from the database search. Internet results that cross-reference with database results would have a higher relevance than database only results.
In the present specification, mention is made of caching web pages. It is to be understood that this and like expressions encompassing caching the whole web page and all or substantially all related code, caching only the text, caching the text and drawings, and so forth. Effectively what is meant is caching the relevant part of the web page.
The foregoing description has been given by way of example only and it will be appreciated by those skilled in the art that modifications may be made without departing from the broader spirit or scope of the invention as set forth in the claims. The specification and drawings are therefore to be regarded in an illustrative sense rather than a restrictive sense.

Claims (22)

  1. CLAIMS1. A search system comprising: storage storing a database, the database comprising a plurality of records, each record being provided for an entity and comprising a plurality of fields for information relating to the respective entity; a web crawler; a database manager for associating crawled web pages with respective records; and a search engine, wherein the search engine is adapted: to receive one or more search terms from a client; to search at least one field and the associated web pages of each record in the database; to determine a record as relevant based on a correspondence between a said search term and data in the at least one field and the associated web pages; and to return to the client information on the entity for which the determined record is provided.
  2. 2. A search system according to claim 1, wherein said entities comprise businesses.
  3. 3. A search system according to claim I or claim 2, wherein the database manager is arranged to associate crawled web pages of a web site with a record if at least one of a URL of a crawled web page and name information and address information provided in a crawled web page of the web site matches respective information stored in at least one of URL, name and address fields for said record.
  4. 4. A search system according to any one of the preceding claims, wherein each record includes a field for at least one of: person name, organisation name, street, town, county, state, postal code, standard industry code, proprietary business industry code, activity description, telephone number, facsimile number,miscellaneous description.
  5. 5. A search system according to claim 4, wherein at least one of associated web pages and text of associated web pages is stored in a field of a record.
  6. 6. A search system according to any one of claims 1 to 4, further comprising a cache database to cache crawled web pages, wherein each record comprises a cache identification field, and the database manager is arranged to store a respective cache identification code in the cache identification field to associate each record with a crawled web page stored in the cache database.
  7. 7. A search system according to any one of the preceding claims, wherein: each record includes fields for information concerning the name, location and the business activity of the entity and miscellaneous fields relating to the business entity; and the search engine determines relevance of a record based on the number of occasions a search term corresponds to data in the at least one field and the associated web pages.
  8. 8. A search system according to claim 7, wherein correspondence of a search term with data in any one of a name, location and business activity field is determined to be more relevant than correspondence of the search term with data in any one of a miscellaneous description field and a crawled web page.
  9. 9. A search system according to any one of the preceding claims, wherein the search engine determines a plurality of records as relevant and orders them based on a score, said score being determined from information stored in scoringfields included in the records.
  10. 10. A search system according to claim 9, wherein the scoring fields include at least one of profitability, time trading, feedback, the number of employees, court judgements, and size of premises.
  11. 11. A search system according to any one of the preceding claims, wherein the search engine is adapted to: receive from the client a user selection of a returned record together with a request for predetermined information; and return to the client the predetermined information concerning the entity for which the selected record is provided.
  12. 12. A search system according to claim 11, wherein the predetermined information is stored in one or more fields of the record.
  13. 13. A search system according to claim 11 or claim 12, wherein the predetermined information is stored in an associated record of another database.
  14. 14. A search system according to any one of the preceding claims, wherein the system is adapted to receive a URL from a client and to return an indication that a record in the database includes a URL that matches the URL received from the client.
  15. 15. A search system according to any one of the preceding claims, wherein the system is adapted to receive from a client a user selection of a returned record and, in response, to direct a browser application running on the client to one of a URL of a website for the entity for which the record is provided and a page providing information about the selected entity, said page corresponding to the record.
  16. 16. A computer program product comprising a computer useable storage medium to store a computer readable program that, when executed on a computer, causes the computer to perform operations for searching a database, the operations comprising: accepting user input of one or more search terms; transmitting the one or more search terms to a system storing a database, the database comprising a plurality of records, each record being provided for an entity and comprising a plurality of fields for information relating to the respective entity and being associated with crawled web pages relating to the entity; receiving the results of a search of at least one field and the associated web pages of said records in the database, records being determined as relevant based on a relevance of data in the at least one field and the associated web pages to the at least one search term; and displaying the results.
  17. 17. A computer program product comprising a computer useable storage medium to store a computer readable program that, when executed on a computer, causes. the computer to perform operations, the operations comprising: displaying a toolbar in a graphical user interface of a web browser application; transmitting the URL in an address bar of the web browser application to a system storing a database, the database comprising a plurality of records, each record being provided for an entity and comprising a plurality of fields for information relating to the respective entity including at least one URL field storing a URL for the entity; receiving the results of a query of the URL field of said records in the database, a record being determined as relevant based on a match between the URL in the address bar and a URL in the URL field; and outputting an indication that information on the entity associated with the URL in the address bar is available.
  18. 18. A computer program product according to claim 17, wherein the toolbar further comprises at least one icon that can be selected when the indication is output, selection of said at least one icon causing the computer to transmit a request to the system for respective, predetermined information on the entity from the database.
  19. 19. A computer program product according to claim 18, one said icon causing the computer to transmit a request to the server for at least one of summary information; accounts; court judgements against the entity; organisational structure; companies an individual is associated with; companies associated with the home address of an individual; companies associated with the business address of the entity.
  20. 20. A computer program product according to claim 17, wherein the toolbar further comprises a profile icon that can be selected when the indication is output, selection of said profile icon causing the computer to transmit a request to the server to serve a predetermined profile page providing information about the entity, said predetermined profile page corresponding to a respective record.
  21. 21. A computer program product according to claim 171 wherein the toolbar further comprises a map icon that can be selected when the indication is output, selection of said map icon causing the computer to transmit a request to the server to serve a map showing the location of the entity and the location of competitor entities.
  22. 22. A computer apparatus comprising: a database store, the database comprising a plurality of records, each record being provided for an entity and comprising a plurality of fields for information relating to the respective entity; a web crawler; and a database manager for associating crawled web pages with respective records.
GB0909019A 2009-05-26 2009-05-26 Populating a database Withdrawn GB2470563A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0909019A GB2470563A (en) 2009-05-26 2009-05-26 Populating a database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0909019A GB2470563A (en) 2009-05-26 2009-05-26 Populating a database

Publications (2)

Publication Number Publication Date
GB0909019D0 GB0909019D0 (en) 2009-07-01
GB2470563A true GB2470563A (en) 2010-12-01

Family

ID=40862980

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0909019A Withdrawn GB2470563A (en) 2009-05-26 2009-05-26 Populating a database

Country Status (1)

Country Link
GB (1) GB2470563A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002010989A2 (en) * 2000-07-31 2002-02-07 Eliyon Technologies Corporation Method for maintaining people and organization information
US20020032677A1 (en) * 2000-03-01 2002-03-14 Jeff Morgenthaler Methods for creating, editing, and updating searchable graphical database and databases of graphical images and information and displaying graphical images from a searchable graphical database or databases in a sequential or slide show format
US20030033299A1 (en) * 2000-01-20 2003-02-13 Neelakantan Sundaresan System and method for integrating off-line ratings of Businesses with search engines
US20030033274A1 (en) * 2001-08-13 2003-02-13 International Business Machines Corporation Hub for strategic intelligence
US20030046311A1 (en) * 2001-06-19 2003-03-06 Ryan Baidya Dynamic search engine and database
US20050149507A1 (en) * 2003-02-05 2005-07-07 Nye Timothy G. Systems and methods for identifying an internet resource address
US20060026114A1 (en) * 2004-07-28 2006-02-02 Ken Gregoire Data gathering and distribution system
WO2006094206A2 (en) * 2005-03-02 2006-09-08 Google Inc. Generating structured information
WO2007023498A2 (en) * 2005-08-24 2007-03-01 Spearcast Ltd. A system and a method for generating evaluative information about commercial service providers
US20070266024A1 (en) * 2006-05-11 2007-11-15 Yu Cao Facilitated Search Systems and Methods for Domains
US20080147631A1 (en) * 2006-12-14 2008-06-19 Dean Leffingwell Method and system for collecting and retrieving information from web sites

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030033299A1 (en) * 2000-01-20 2003-02-13 Neelakantan Sundaresan System and method for integrating off-line ratings of Businesses with search engines
US20020032677A1 (en) * 2000-03-01 2002-03-14 Jeff Morgenthaler Methods for creating, editing, and updating searchable graphical database and databases of graphical images and information and displaying graphical images from a searchable graphical database or databases in a sequential or slide show format
WO2002010989A2 (en) * 2000-07-31 2002-02-07 Eliyon Technologies Corporation Method for maintaining people and organization information
US20030046311A1 (en) * 2001-06-19 2003-03-06 Ryan Baidya Dynamic search engine and database
US20030033274A1 (en) * 2001-08-13 2003-02-13 International Business Machines Corporation Hub for strategic intelligence
US20050149507A1 (en) * 2003-02-05 2005-07-07 Nye Timothy G. Systems and methods for identifying an internet resource address
US20060026114A1 (en) * 2004-07-28 2006-02-02 Ken Gregoire Data gathering and distribution system
WO2006094206A2 (en) * 2005-03-02 2006-09-08 Google Inc. Generating structured information
WO2007023498A2 (en) * 2005-08-24 2007-03-01 Spearcast Ltd. A system and a method for generating evaluative information about commercial service providers
US20070266024A1 (en) * 2006-05-11 2007-11-15 Yu Cao Facilitated Search Systems and Methods for Domains
US20080147631A1 (en) * 2006-12-14 2008-06-19 Dean Leffingwell Method and system for collecting and retrieving information from web sites

Also Published As

Publication number Publication date
GB0909019D0 (en) 2009-07-01

Similar Documents

Publication Publication Date Title
US8745067B2 (en) Presenting comments from various sources
US7831596B2 (en) Systems and processes for evaluating webpages
US9652537B2 (en) Identifying terms associated with queries
US9305100B2 (en) Object oriented data and metadata based search
KR101215791B1 (en) Using reputation measures to improve search relevance
JP5608286B2 (en) Infinite browsing
US9659067B2 (en) Providing a search results document that includes a user interface for performing an action in connection with a web page identified in the search results document
US9092756B2 (en) Information-retrieval systems, methods and software with content relevancy enhancements
US8886650B2 (en) Algorithmically choosing when to use branded content versus aggregated content
US20110004504A1 (en) Systems and methods for scoring a plurality of web pages according to brand reputation
KR100896614B1 (en) Retrieval system and method
US8103678B1 (en) System and method for establishing relevance of objects in an enterprise system
US20160103861A1 (en) Method and system for establishing a performance index of websites
US9864768B2 (en) Surfacing actions from social data
US20100082658A1 (en) Systems and methods for surfacing contextually relevant information
US20220171822A1 (en) System and Method of Creating and Processing Semantic URL
US20150302090A1 (en) Method and System for the Structural Analysis of Websites
JP2010049372A (en) Content search apparatus
Espadas et al. Web site visibility evaluation
US9697281B1 (en) Autocomplete search methods
US20090013068A1 (en) Systems and processes for evaluating webpages
RU105759U1 (en) INTERACTIVE SEARCH AND INFORMATION DISPLAY SYSTEM
US20120130974A1 (en) Search engine for ranking a set of pages returned as search results from a search query
US20100223116A1 (en) Community Based Search and Revenue Allocation System and Method
WO2010124334A1 (en) System and method for providing computer-enabled employment search services

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)