GB2364403A

GB2364403A - An information search system and a text translation system

Info

Publication number: GB2364403A
Application number: GB9926759A
Authority: GB
Inventors: Everhard Nicholas Wrigley; Neil Bowers
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1999-11-11
Filing date: 1999-11-11
Publication date: 2002-01-23
Also published as: GB9926759D0

Abstract

A system for searching for information units 1a-1c such as Web pages stored at a plurality of logical locations 2a-2c operates by storing metadata for an information unit in association with the information unit, 3a-3c. The meta data contains classification data for the information units. The metadata is read to form a database 6 of logical location identifications and corresponding classification data. When a classification query is received for the database, the identities of any logical locations matching the query are determined to allow the retrieval of information from the or each logical location. Further, a system for obtaining translations of text is disclosed in which text to be translated is sent to a translation database and used to look up any translations of the text in the database. If the text has no translations in the translation database, text is added to the translation database for the translation to be added later. The translation system can be used in conjunction with the information search system.

Description

2364403 AN INFORMATION SEARCH SYSTEM AND A TEXT TRANSLATION SYSTEM The

present invention generally relates to a method and apparatus for searching for information stored at a 5 plurality of logical locations using metadata for the information at each logical location. The present invention also generally relates to an automatic translation system for obtaining a translation of text using a database of previously translated text.

10 In an information processing system, information can be provided at a number of logical locations. Such logical locations may be geographically distinct. An example of such an information processing system is the world wide web. Web pages contain information which is 15 stored at logical locations and are hosted at web sites which are logical sites which need not be geographically separate e. g. more than one web site can be hosted by a web server at single locations.

In such an information processing system, since 20 information is contained at each logical location, each logical location must be searched. Searching for information on the world wide web can be undertaken using web search engines. However, such search engines are limited in that keyword searching e.g. by Alta Vista, or 25 classif ication searching e. g. by Yahoo is provided.

Classification searching is achieved by manually classifying the information at the search engine. This central classification suffers from a disadvantage that a great deal of work is required to classify information 5 on a large number of web pages. Also, the classification may not be accurate since the person carrying out the classification may not be familiar with the information. Keyword searching suffers from the disadvantage of not providing a user with any information on what might be 10 available from web pages.

An object of a first aspect of the present invention is to overcome this deficiency in the prior art and to provide a more efficient and accurate information search system.

15 In accordance with this aspect of the present invention there is provided an information search system for searching for information stored in a plurality of information units at a respective plurality of logically distinct locations. Metadata for the information units 20 to be searched is stored associated with each of a plurality of said logical locations. The metadata for a logical location contains classification data for the information at that logical location. The metadata is read and used to form a database containing information 25 identifying the logical locations and corresponding 3 classification data. When a classification query is received for the database, information identifying any logical locations which match the classification query is returned to allow the retrieval of information from 5 any matched logical location.

Thus in accordance with this aspect of the present invention, the task of classifying the information is distributed to the person responsible for each logical location i.e. the person responsible for the information 10 unit. Thus the classification is delegated. Metadata is stored in addition to the information unit and thus a centralised database can be formed simply by collating the metadata for the logical locations. Each logical location can be provided at geographically distinct sites 15 or the same geographical site.

The information unit can be organised into groups of information. Each group is stored at a logically distinct site (although the information units in a group can be geographically dispersed). An example of such an 20 organisation is provided by an embodiment implemented for the world wide web in which a logically distinct site comprises a web site comprised of a number of web pages, each web page comprising an information unit and being locatable by its logical location i.e. URL.

The metadata for a logical location can be stored separately to the information unit either as a single set of metadata for a group of information units, e.g. a metadata file for a web site, or as a number of sets of 5 metadata, one per information unit. In the latter case, the metadata can be combined with the information and stored together e.g. it can be meta elements within the HTML for web pages. The former method has the advantage that the metadata can contain hierarchically arranged 10 classification data for a plurality of information units thus avoiding the duplicated storing of metadata.

The classification data can comprise any form of classification which will enhance the searching of the information across the logical locations. For example, 15 classifications can include data on the subject of the information, data on the geographical origins of the information, and data on the language of the information.

In a further embodiment of the present invention, the information comprise hypertext. Each logical 20 location holds hypertext for a hypertext document with zero or more hypertext links to other hypertext documents. The metadata thus contains classification data for each hypertext document and the database contains logical location and document identifications 25 and corresponding classification data. In a specific embodiment, a root, hypertext document at each logical location, e.g. the index page of a web site, is read and any hypertext links to hypertext documents are identified. Also corresponding metadata is read. The 5 identified hypertext documents and corresponding metadata are then read to form database entries for the added hypertext documents. These steps are then repeated for all of the hypertext links. The extent of the hypertext documents which are entered in the database can be 10 restricted by limiting the process to predetermined logical locations. Any hypertext links to hypertext documents out of the desired domain can be ignored. In this way, the scope of the search can be limited to the predetermined logical locations. This is particularly 15 advantageous for a company or consortium of information providers who wish to restrict the search to designated web pages.

Because the information at the logical locations can change, the metadata may be updated. Also, metadata may 20 be updated at the logical location purely for reorganisational purposes. Thus in an embodiment of the present invention, the reading of the metadata at the information logical locations is repeated for example periodically or after notification of changes to the 25 metadata in order to update the database.

The present invention is particularly suited to implementation over the world wide web wherein the information comprises web pages organised into web sites. The reading of the web pages and the formation of the 5 database in such an embodiment is carried out by a web robot (a computer program) which visits (accesses) each web site to read the metadata for each web page as well as the information in order to form the database on a server. This database can thus be made available for 10 searching either on the server on which the robot operates, or on another server. Further, multiple copies of the database may be made on multiple servers so as to spread the database load when there are a large number of queries and also to reduce network delay by providing 15 a reduced network distance between clients and the server. When multiple copies of the database exist, the copy which is updated by the robot directly is the master copy and each of the other copies has to be "synchronised" with the master copy.

20 In an embodiment of the present invention the information at one or more logical locations contains text and at least some of the text is added to the database. In this way, in response to a query, the text in the database can be returned together with the logical 25 locations identification, e.g. URL in a web search 7 system, so as to provide more information on the results of the search. The text held in the database can comprise a precis of the text, keywords and/or the title appearing in the text.

5 where information held at the logical locations can be held in any one of a number of languages, the query can include an indication of a preferred language for text to be returned as a result of the search. When text is returned from the database, if the language of the 10 text does not match the preferred language, a translation of at least part ofthe text can be obtained so that it can be returned in response to the query. The translation can be obtained by a look-up operation using a translation database. If there is no translation for 15 the text, the text can be added to the translation database for a translation to be added therefor at a later date.

A second aspect of the present invention is concerned with the problem of providing automated 20 translations of text.

Unfortunately, although there has been a great deal of progress in automating translations in order to provide online translation, there is still a great deal of progress needed in order to provide accurate 25 translations. Accurate translations can only be provided 8 with human intervention. Thus, this aspect ofthe present invention is concerned with providing an accurate translation even if this only results in a partial translation text.

5 The second aspect of the present invention provides an automated translation system wherein text to be translated is sent to a translation database. Any translations of at least parts of the text are looked up in the database and retrieved. Thus, at least part of the 10 text can be translated by reference to a database. This database can be maintained or updated manually or automatically offline without affecting the rate of translation to text.

In an embodiment of the present invention, when 15 there is no translation for text in the database, the text is added to the database so that a translation can be added f or the text at a later time (of f line).

In an embodiment of the present invention, the translation database can contain translations of text in 20 several languages. The text to be translated can thus be supplemented with an indication of at least one preferred language and in this way the database look-up operation can be controlled to output a translation in at least one preferred language. In a specific 25 embodiment, the preference can be ranked and the look-up 9 operation can be controlled to output the translation in the database which has the highest ranking in preference to other translations of lower ranking. Thus for example, the user can specify English as the highest ranked 5 language with French as a second language and thus the database will return all English translations where available and French translations of the text when there is no English translation. When there is text that has neither English or French translations e.g. Japanese, the 10 Japanese text will be returned. Thus the result will be a document which may be formed of passages of text with several languages. This is beneficial since it will at least provide a partial translation in the or each preferred language.

15 This aspect of the present invention can be implemented over the world wide web and can be used for obtaining a translation of text in web pages. When a web page is requested, the text can be extracted from it and translations can be obtained for the text. The web page 20 can then be reconstructed using the translations and any untranslated text and the reconstructed web page can be returned in response to the request. A translation can only be provided of what is in the database. If text does not appear in the database, the translation of the 25 document is not however delayed: a partial translation is simply returned. The translation database can then be updated at a later date and thus with use, the translation database will become more and more complete.

5 Embodiments of the present invention will now be described with reference to the accompanying drawings in which:

Figure 1 is a schematic diagram of a f irst embodiment of the present invention implemented over the 10 world wide web; Figure 2 is a schematic diagram of a slave server of Figure 1; Figure 3 is a schematic diagram of the master server of Figure 1; 15 Figure 4 is a schematic diagram illustrating the relationship between the metadata and the web site; Figure 5 is a flow diagram illustrating the method of generating and updating the database; Figure 6 is a flow diagram illustrating the steps 20 of step S19 of Figure 6 in more detail; Figure 7 is a flow diagram illustrating the synchronisation process; Figure 8 is a flow diagram illustrating the abstracting process; 11 Figure 9 is a diagram illustrating the parse tree generated in step S51 in Figure 10; Figure 10 is a functional diagram of the operation of the slave server in response to a search query; 5 Figure 11 is a flow diagram of the method of providing at least partial translations of the text in the embodiment of Figure 1; Figure 12 is a functional diagram of the operation of a slave server when accessed by a translator's 10 machine; Figure 13 is a flow diagram illustrating the method of updating the translation database; Figure 14 is a schematic diagram of a second embodiment to the present invention for providing at 15 least partial translations of text for an application; and Figure 15 is a flow diagram illustrating the operation of the embodiment of Figure 14.

20 The first embodiment of the present invention will now be described with reference to Figures 1 to 13.

This embodiment to the present invention is implemented over the world wide web and enables the searching of web sites within a domain using 25 classifications provided by metadata stored in associated 12 with each web site. The domain in this embodiment comprises a corporate domain such as a collection of web sites owned by or related to a company.

Figure 1 is a schematic diagram of the system.

5 Information to be searched for is provided as one or more web pages la, lb and Ic at each web site on respective web servers 2a, 2b and 2c. Also a metadata. f ile 3a, 3b and 3c is provided on each web server 2a, 2b and 2c. Each web server 2a, 2b and 2c is connected to the Internet 10 (world wide web) 4. A master server 5 is provided with a master metadata database 6 and a master translation database 7. The master server 5 is connected over the internet 4 to allow connection to the web servers 2a, 2b and 2c. Public access to the marker server 5 is not 15 allowed.

Two slave servers 8a and 8b are provided for public access to data in respective metadata database 9a. and 9b.

The slave servers 8a and 8b are also provided with a respective translation database 10a and 10b.

20 Slave server 1 8a. is connected to the master server via a local area network (LAN). Also, a connection is provided by the internet 4. The slave server 2 8b is only connected to the master server 5 via the internet 4. This is located remotely and out of range of a LAN 25 connection. The slave servers 8a and 8b are provided 13 with copies of the master metadata database 6 and the master translation database 7 which form the metadata databases 9a and 9b and the translation database 10a and 10b. The data within the metadata databases 9a and 9b are 5 available for public access over the internet 4 by a client machine 11 operating a web browser 12. Thus the client machine 11 is able to generate a search query for searching for data in the metadata, databases 9a or 9b dependent upon the server to which a query is addressed.

10 A translator's machine 13 provided with a web browser 14 is able to access any of the servers 5, 8a or 8b via the internet 4 to allow restricted access to any one of the translation databases 7, 10a or 10b. In this way translations for text can be added to any of the 15 translation databases 7, 10a or 10b. Once changes have been made to one of the translation databases 7, 10a or 10b, it is necessary to "resynchronisell the translation databases 7, 10a and 10b in order to ensure that they are consistent.

20 Figure 2 illustrates in more detail the architecture of a slave server. The slave server includes the translation database 10a and the metadata database 9a.

A database application programmer's interface (API) 21 is provided to interface to the metadata database. A 25 translation database API 22 is provided to interface to 14 the translation database 10a. A web interface 20 is provided by web server software to receive hypertext transfer protocol (HTTP) requests and, if the request is a search request, it is passed to the database API 21, 5 whereas if this is a request for access to the translation database API 22, the web interface 20 links to the translation database API 22. Also a LAN interface 23 is provided for connection to the master server 5 over a local area network. The LAN interface 23 provides a 10 connection to the database interface 21 and the translation database API 22 to allow for the synchronisation of the databases 9a and 10a to the master databases 6 and 7.

Figure 3 is a schematic diagram of the functional 15 architecture of the master server 5.

The master server 5 is provided with the master metadata database 6 and the master translation database 7. A web interface 30 is provided to interface over the internet to the remote web servers 2a, 2b and 2c. A 20 database API 31 is provided between the master metadata database 6 and a web interface 30 in order to receive queries resulting f rom the search requests and in order to generate and update the master metadata database 6. A translation database API 32 is provided between the 25 master translation database 7 via the web interface 30 in order to allow the translator's machine to add to the translation database 7 and the web interface 30. The master server also includes the web robot 33 which traverses the internet searching web sites within a 5 predetermined domain and sends classification data to the database API 31 in order to form the master metadata database 6. The web robot 33 also retrieves the text f rom the web page which is passed to an abstractor 34 which generates an abstract for the web page. The output 10 of the abstract 34 is also passed on to the database interface 31 in order to be added to the master metadata database 6.

A LAN interface 35 is provided to the database API 31 and the translation database API 32 to provide for the 15 synchronisation of the master translation database 7 and the translation databases 10a and 10b and the master metadata. database 6 and the metadata databases 9a and 9b.

Further, a report generator 36 is provided to provide reports for administrative purposes.

20 In this embodiment of the present invention the web robot 33 of the master server 5 comprises a program module for traversing the internet 4 looking for information on a prespecified list (domain) of web servers 2a, 2b and 2c in order to read the metadata f iles 25 3a, 3b and 3c and the web pages la, lb and 1c. The metadata 3a, 3b and 3c is provided to contain classification information. It is provided locally on each server 2a, 2b and 2c specifically for the web pages la, lb, lc containing on the web sites hosted by the 5 respective servers 2a, 2b and 2c. Each metadata, file 3a, 3b or 3c contains classification information classifying the information content of the web pages on the respected web site. Each metadata file can thus hold a hierarchical classification structure. This is 10 illustrated in Figure 4. The metadata file contains classification data which can be applicable to the whole site i.e. all web pages. It also contains classification data which is applicable to number of web pages forming a group. There can be classification data for several 15 groups. It can further hold classification data specific to individual pages. It will also be noted that some web pages within the site may not be classified at all e.g. page 4, in Figure 4.

20 The classifications used in this embodiment are:

1. Region - to indicate the geographical region in which the site is represented as located; 2. Country - to indicate the country in which the site is represented as located; 3. Language - to indicate the language of the web page; 4. Product - to indicate the products mentioned in the web page; 5. Product Category - to indicate the category the products mentioned in the web page fall intof 6. Page Type - to indicate the type of information content of the web page; 7. Company Type - to indicate the type of company 10 represented as hosting the site.

There are various values that each category can take. Table 1 below indicates the values for the region category.

15 TABLE 1

Region Two Letter Region Code Europa and Africa eu America am Oceania oc Asia as F7 ip For the country categories, the ISO two letter code 25 identifier is used to indicate the country for the web site. Similarly, for a language the ISO two letter code 18 identifier is used to indicate the language used for the web site or for individual web pages e.g. len' for English language.

The product category simply gives the product name.

The values for the product category (prodcat) in an embodiment are given in Table 2 below: TABLE 2 prodcat value Meaning printer Printers (including BJ) camera Cameras (including digital) copier Photocopier fax Facsimiles multi Multi-function devices image Image management systems office Office equipment (typewriters, calculators, etc) medical Medical equipment semiconductor Semiconductor equipment computer Computer equipment The page type category value defines the category of a web page i.e. the type of information provided. The values are given in Table 3 below:

TABLE 3 pagetype value Meaning prodinfo Product information corpinfo Corporate information jobs Jobs available download Software download pages feature Special feature new What's New? contact Contact information The company type category identifies the type of company. Typical values are given in Table 4 below:

TABLE 4 cotype value Meaning marketing Marketing oriented sa es Sales oriented research Research & Development All of these categories are available for classification of the content of the web pages on the web site.

As illustrated in Figure 4 some information need 25 only be provided once for a site. For example, region, country and company type are likely to be the same for all pages of a web site. Thus, the metadata file can specify categories as being generic to all web pages. For example, the metadata file can specify such by:

site list of attributes Also, attributes which are common to a group of web 10 pages can be specified by:

/group list of attributes 15 In the example above 'group' is a pattern. Any web page which matches the pattern is considered a member of the group. Any particular web page can be in zero or more groups.

21 Also, individual web pages can be given attributes by: /mypage.html 5 list of attributes A sample metadata file is given below:

10 # sample metadata file # site configuration information site 15 lang--ja # group configuration information /-e.html 20 lang--en 22 In the above example, the result is as if all web pages have the language attribute set to Japanese except for any pages where the URL ends in -e.html which have the language attribute set to English.

5 Using this structure each web site administrator is able to fully categorise their web site to facilitate an easy and accurate centralised searching which can include their web site.

The metadata can also include further information 10 which although not used in the searching, can be used for other centralised administrative purposes. For example, the metadata file can indicate the email address of the web master i.e. the site administrator. This can be used for the automatic generation and transmission of reports 15 to the administrator as will be described hereinafter. Also, the metadata file can include an index parameter indicating the default index file name. All the web servers support a directory index file which is returned by the server when a directory URL is requested. Thus an 20 HTML request with the URL "http: / lwww. cre. canon. co. uk/ " will return the index.html page as if the html request was " http: / /www. cre. canon. co. uk/ index. html 11.

Further, the metadata file can include an encoding attribute which is required for pages where the text content of the page is encoded using a non ISO 8859 coding. This typically applies to Japanese web pages. 5 Table 5 below gives possible encoding attribute values: TABLE 5 Encoding Attribute Value Meaning jis JIS Encoding sjis Shift-JIS euc Extended Unix Code (EUC) The operation of the web robot 43 in the master server 5 will now be described with reference to Figures 15 5 to 11.

Figure 5 is a flow diagram illustrating the steps carried out when the robot visits a site. When a database has been formed, and the database is to be generated for the first time, the database entries will 20 simply comprise the web site directory URLs.

The database structure comprises a number of rows indexed by the web page URL. Each database row indexed by URL has the following database entries:

1. Site identifier 24 2. Last time the page was updated Extracted Inf ormation 3. Keyword 4. Abstract 5 5. Title Categories 6. Region 7. Country 8. Language 10 9. Product Type 10. Product Category 11. Page Type 12. Company Type Other Attributes 15 13. Encoding In addition the database contains information about the web site:

1. Web Master 2. Index 20 3. Time the robot last visited Thus it can be seen that in addition to the categories, each URL has additional information.

Keywords, abstract and title are provided to allow f or conventional keyword searching of the database in addition to the category searching technique.

Referring to Figure 5, when the robot 33 visits a 5 web site, in step Sl a URL list is generated for the site from the database. Thus, if the database has been completed previously and the robot is returning to update the database, all of the URLs' previously visited by the robot will be listed.

10 In step S2 the first URL is selected from the list and in step S3 a page visit time for a database row corresponding to a URL is updated.

In step S4 a head HTTP request is made for the web page. It is at this point that in step S5 a determination 15 can be made as to whether the page no longer exists. If the page no longer exists, an error will be returned as a result of the request. If the page no longer exists, in step S7 the row corresponding to the URL is deleted in the database and the process proceeds to step S15 20 where it is determined whether there are any subsequent URLs in the list. If not, in step S17 reports are generated as will be described in more detail hereinafter 26 and in step S18 the process terminates. If there are further URLs to be processed in the list in step S17 the next URL on the list is selected and the process returns to step S3.

5 If in step S5 it is determined that the page still exists because there is a response to the request, in step S6 the time the page was last modified is extracted and compared with the last time the page was updated given in the database for the URL. If this indicates in 10 step S8 that the page has not changed, in step S14 it is determined whether the metadata for the page has changed, if not, in step S15 it is determined whether there are any subsequent URLs in the list and if so the next URL in the list is selected in step S16 and the process 15 returns to step S3. Otherwise in step S17 reports are generated as will be described in more detail hereinafter and the process terminates in step S18.

If in step S14 it is determined that the metadata file has changed, in step S19 the category fields for the

20 corresponding row in the database is determined using the metadata file as will be described in more detail with reference to Figure 6. The category fields for the row

27 corresponding to the current URL are then replaced in step S20 with the new category fields and the process returns to step S15.

If in step S8 it is determined that the page has 5 changed, in step S9 thepage content is retrieved. In step S10 the keywords, abstract, title and page links are extracted or generated. In step S11 it is then determined whether there are any new hypertext links and if so in step S12 whether the new links are within the 10 current web site. If in step S11 or step S12 it is determined that there are no links within the site, in step S19 the category fields for the corresponding row are determined using the metadata file and the process proceeds to step S20. If in step S12 it is determined

15 that there are new links within the site, in step S13 the link URLs are added to the URL list and the process proceeds to step S19.

In this way for each site the robot is able to process the metadata for each web page in order to build 20 up or update the database to provide data in each of the fields for each URL row.

Figure 6 is a f low diagram illustrating the steps of step S19 in more detail. The process starts in step S30. In step S31 it is determined whether there is a metadata file at the remote web site. If not, in step 5 S32 it is determined whether there is a metadata file at the master server 5. If not, in step S37 all the category fields are nulled and the process terminates in step S38. If in step S31 or step S32 it is determined that there is a metadata file, in step S33 the metadata

10 file is searched using the URL for the first condition in the metadata file which matches the URL. Considering for example the sample metadata file below:

site 15 country=gb region=eu language=en company type=research index=index.html 20 webmaster=webmaster@cre.canon.co.uk I 29 /download/ f pagetype=download /printer/ f pagetype=productinfo product=printer If the current URL is http://www.cre.canon.uk/printer/colorprinter.html, in step S34 a match is found at the site level and thus in 15 step S36 the categories country, region, language and company type are set to the values given in the metadata file and the attributes "index" and "webmaster" are also set in step S36. In step S35 it is determined that there are more conditions given in the metadata file. Thus in 20 step S34 it is determined whether the next condition /printer/ matches. Since it does, in step S36 the category fields, "pagetype" and "product" are set to the values given in the metadata file and in step S35 it is determined whether there are any more conditions. Since there are no more conditions the process terminates in step S38.

5 Reports can be generated by the master server 5 in step S17 at the end of the traversal of the web servers 2a, 2b and 2c by the web robot 33. The reports comprise a traversal log indicating for example how long the traversal took and which pages were traversed.

10 The master server 5 can also generate reports for many facets of the implementation of the system. For example, it can monitor the percentage of search requests which included a particular classification and it can indicate which categories are used rarely by the metadata 15 files at the web sites.

The reports can be made available on the master server 5 accessible for instance by password access. Also, the reports can be sent to site webmasters automatically using the webmaster field data in the

20 metadata database 6.

Since the robot only updates the master metadata database, in order for the public metadata databases 9a 31 and 9b to benefit from the updates, they must be synchronised to the master metadata database.

Figure 7 is a f low diagram illustrating the steps which are taken periodically in order to synchronise the 5 metadata databases. In step S40 the robot runs to update the metadata database on the master server. Then in step S41 the metadata database is copied to each slave server. At each slave server the previous metadata database is archived in step S42 for administrative purposes to allow 10 checking on previous states of the database.

In addition to completing the category fields and the attribute fields webmaster, index and encoding, for each URL, the robot also operates in conjunction with an abstractor 34 to generate an abstract of the text content

15 of the web pages. The robot is also able to extract the title of the web page for entry into the title field and keywords for entry in the keywords field.

The operation of the abstractor will now be described with reference to Figures 8 and 9.

In step S50 the HTML for the web page is input and in step S51 this is parsed. The parsing process generates the parse tree as illustrated in Figure 9. The 32 parsing process operates simply by identifying tags within the HTML in order to generate the parse tree.

In step S52 segments or "chunks" of text in the HTML tree are located and then in step S53 the "chunks" are 5 split into sentences by identifying "whitespace " (spaces, tabs or blank characters) following punctuation which normally finishes a sentence. A score is assigned to the first sentence in each chunk dependent upon its parent tags in step S54. Table 6 below gives the scores used in an embodiment.

TABLE 6

HTML Element Score Paragraph (P) 100 15 Heading 1 (Hl) 90 Heading 2 (H2) 80 Heading 3 (H3) 70 Default 0 20 In step S55 all the sentences are sorted by score and in step S56 all of the text comprising the sentences is truncated at a predetermined size of N bytes. In this way only the significant text is kept as an abstract. In step S57 the sentences are then sorted in dependence upon 25 their original order in the page and output in step S58 33 as an abstract of N bytes. This abstract can then be entered in the abstract field in the URL row.

Thus as described hereinabove, the web robot is able to periodically traverse the predetermined sites in order 5 to read the metadata. Thus, if the web pages have changed or if the metadata file has changed (e.g. the site administrator has modified the classification of the information at the web site) then the database will be updated. Thus the use of the metadata file enables the 10 compilation of a database for each URL which includes classification data which can be searched using classification queries as will be described in more detail hereinafter.

The operation of this embodiment of the present 15 invention in response to a classification query will now be described with reference to Figures 10 and 11.

Figure 10 is a functional diagram of a client 40 and slave server 42. The client 40 implements a web browser 41 which generates an HTTP request to a web server 20 implemented within the slave server 42. When the web browser 41 accesses the search web page, a user interface is provided allowing a user to select or enter one or 34 more classifications to be searched. This interface takes the form of the display of the web page with selectable options and a search button. When a user selects the search button an HTTP request is generated which is 5 routed to the slave server 42. The request comprises a URL pref ix, e. g. http: //www. cre. canon. co. uk/, to identify the slave server 42. The request also includes a suffix which comprises the classification query e. g. search?il=ja&prodcat=printer. In this example the 10 product category is selected by the user. The language can be chosen or can be automatically selected dependent upon the web page from which the request is made. The user may be using the English search page and thus it is reasonable to assume the user wishes the search results 15 to be returned for English web pages. The web server passes the suf fix of the HTTP request as CGI to a perl script 44 implemented within the slave server 42. The perl script 44 converts the CGI to an SQL database query which is passed to the metadata database 45. In response 20 to the SQL query the metadata database returns metadata comprising the URL and text e.g. the abstract and/or title. Thus, in the example given above, the SQL query will request any rows having the category fields prodcat=printer and the web page will be generated in Japanese.

The perl script 44 can then use the metadata 5 returned to form HTML which can be passed to the web server 43. Where a large number of URLs are returned matching the SQL query, the perl script can group the URLs into groups of for example 10 to provide a search result interface of 10 "hits" at a time as is 10 conventional with search engines. The HTML formed is returned by the web server 43 to the web browser 41 which will generate a web page to be viewed at the client 40 to view the URLs and abstract and/or title. The HTML generated by the perl script 44 will provide hypertext 15 links to the web pages identified by the URLs within the HTML as conventionally provided in web search engines.

The slave server 42 is also provided with a translation database 46. Since the metadata database can contain text in the keywords, abstract and title fields

20 in many different languages from many different sites, the results of the search may not be understandable in view of the language difficulty experienced by the user 36 of the client machine in understanding the text. Therefore, the request from the client 40 to the slave server 42 can include an indication of a preferred language in which text is to be returned in. This 5 indication can either take the form of an identification of the origin of the request, e. g. a request may have come from an English client, or a request can include an additional parameter identifying one or more preferred languages. When the perl script 44 receives the CGI from 10 the web server 43 including information identifying a preferred language, and when the text is returned with the metadata from the metadata database 45, if any of the text is not in the preferred language, the perl script can generate a POSIX standard "gettext" API command to 15 the translation database to return any translated text in the preferred language. If text is present in the translation database 46 translated text is returned to the perl script 44. Where more than one preferred language is indicated, preferences can be ranked and the 20 gettext command can be ranked so as to return higher ranked language translations before lower ranked language translations. The text sent to the translation database 46 for translation is in segments of the text e.g. phrases or sentences. Thus if not all of the text can be translated into a preferred language, the perl script can still construct HTML for a web page which combines 5 segments of translated and any translated text. This is then passed by the web server 43 to the web browser 41.

Figure 11 is a flow diagram illustrating the steps carried out in the operation of the process of Figure 10. In step S60 a client HTTP request is made to the web 10 server identifying one or more preferred languages. In step S61 the perl script generates an SQL query to the metadata database from the suffix of the HTTP request. In step S62 metadata is returned from the metadata database and in step S63 for the text in the page 15 (including static page text and metadata) the "gettext" API command is used to retrieve any translated text in the translation database. In step S64 the perl script then forms the HTML for the web page and in step S65 the web server responds to the client request with the formed 20 HTML.

In this embodiment the HTML f or the web page can be brought up sequentially as text is retrieved from the metadata database and translated. Alternatively, all of the text can be retrieved and subsequently translated.

When the translation database 46 in Figure 10 does not contain a translation of the text into a preferred 5 language, the perl script 44 can upload a copy of the text into the translation database. The translation database 46 holds the text with an empty translation f ield to allow a translation of the text to be entered in the database f ield at a later date. The fact that 10 there are empty fields in the translation database indicates that translation work is required in order to complete the database. In this way the translation database 46 is continually added to and improved.

In addition to allowing the searching for web pages 15 using the classification data, the system allows conventional keyword searching of the content of web pages. Thus a user can generate a combined search query which includes selected classifications and keywords. The entered keywords can be used to compare with the 20 keyword, abstract and/or title fields in the metadata database.

39 The method of adding translations to the translation database will now be described with reference to Figures 12 and 13.

Figure 12 is a functional diagram illustrating the 5 process of adding to the translation database. A translator's machine 50 implements a web browser 52 via interfacing of the internet to a web server 53 hosted on a server such as the slave server 51. The translator's machine 50 is able to input translated text as a suffix 10 to a HTML request. The web server 53 extracts the suffix and passes it as a CGI to the translator's interface 54.

The translator's interface is then able to enter the translated text into the translation database 55 as an SQL entry. When the translator's machine 50 requests text 15 to be translated, this is output from the translation database to the translator's interface 54 and the translator's interface 54 generates HTML for a web page which is passed by the web server 53 to the web browser 52 and contains the text to be translated.

20 The operation of the system of Figure 12 will now be described with reference to the flow diagram of Figure 13.

In step S70 a translator accesses the translator's log-on page on the web server. In step S71 the web server returns HTML which is interpreted by the web browser 52 to generate a log on interface. In step S72 the 5 translator's logs-on by providing a username and password to identify themselves. In step S73 the web server receives the log-on information. In step S74 the text requiring translation is retrieved from the translation database by the translator's interface. The translator's 10 interface then forms HTML for the web page in step S75 and in step S76 the web server sends the HTML to the translator's machine. In step S77 the translator's machine inputs the translated text and in step S78 the translated data is received by the web server 53. The 15 translated data is passed to the translator's interface which inputs the translated text into the translation database in step S69.

Thus, as is apparent, this feature of the present invention enables the translation database to be updated 20 offline. Further, because of the web interface, the translation database can be accessed from any machine over the internet. once a translation database has been 41 updated, the other copies of the translation database must also be updated in a synchronising phase.

Although in this embodiment the metadata has been described as being provided in a metadata file for a 5 site, in an alternative embodiment the metadata can be provided f or each web page within the HTML f or the web page within META tags.

The translation of text in this embodiment is not limited to the translation of text obtained from the 10 database i.e. the dynamic text. Also the text in the HTML for the page can be translated i.e. the static text.

The preferred language(s) can either be positively selected by a user or assumed f rom the web search page providing the interface to the user. For example, if a 15 user is viewing the English language version of the search page rather than the Japanese version, it is reasonable to assume that the user wishes to receive text in English. Another method of automatically selecting the preferred language is to use the referrer URL i.e.

20 the identification from where the search request was made this will identify a machine or server originating the search request. This can be used to select the preferred 42 language if the language requirements of potential users of the server are known. Further, HTTP provides a mechanism where any given HTTP request can specify one or more preferred languages. Most web browsers allow a 5 user to specify this as a personal preference.

Also, the present invention is not limited to a system having geographically separate sites. The present invention is applicable to a single machine in which the logical locations provide separate information sources 10 e.g. separate databases in a single machine. The information which can be searched for by the present invention is not limited to web pages or even to text. The information can comprise any type of information e.g. video, audio, images, or multimedia data for which there 15 is provided metadata classifying the information. The benefits of the present invention lies in the fact that the classification of the information is performed and maintained at the information source.

20 A second embodiment of the present invention will now be described with reference to Figures 14 and 15.

In this embodiment to the present invention an application 60 requiring text to be translated outputs the request for text to be translated to a translation API 61. This is transmitted to a translation database 5 62 and if a translation is available this is returned to the translation API 61 which in turn returns the translation to the application 60. A translator's interface 63 is provided to the translation database 62 to enable the translation database to be added to and 10 updated. The application 60 is also able to send to the translation API 61 text which could not be translated. The translation API 61 then inputs this text into the translation database 62 whereupon the text can be accessed by the translation API 63 to allow the input of 15 a translation for that text in the translation database 62.

Figure 15 is a flow diagram of the operation of the embodiment of Figure 14 and in step S80 the application requests text to be translated and identifies a one or 20 more preferred language. In step S81 the translation interface requests and receives any translations for the requested text in the translation database. In step S82 44 the translation API 61 then responds to the request in the application with any translated text which has been provided by the translation database 62.

This embodiment of the present invention is 5 applicable to any computer implemented application which processes text. The application is able to request the translation of the text and if necessary receive only a partial translation. The separate database of translated segments of text is independent of the application and 10 can be updated offline without affecting the operation of the application. This embodiment to the present invention can be implemented over a network and the application 60 is implemented on one computer and the translation API 61 and translation database 62 are 15 provided on another computer. The translation interface can then be provided over the network in a similar manner to that described with regard to the first embodiment. This allows a remote translator to access the translation database over the network e.g. internet and perform on- 20 line translation of the text in the database.

Alternatively, the arrangement of Figure 14 can be provided in a single computer. The application 60 and the translation interface 61 can be implemented as computer program modules within the same computer and the database can be provided f or example on the hard disk of the computer. The translator's interface 63 can also be 5 provided as a program module on the same computer to allow the adding or updating of the translation database 62 either manually or by automatic translation at a later date.

This embodiment of the present invention is 10 particularly suited to the translation of text in small sections. The use of sections of text in the translation database 62 increases the likelihood of the translation of some of the text, because of the likelihood of successfully querying the database using a section of 15 text rather than a lengthy passage.

It will thus be apparent to a skilled person in the art that the present invention can be implemented either in hardware or in suitably programmed general purpose computers. When implemented in software, the elements of 20 the invention can be provided by software modules located on several computers over a network e.g. local area network or a wide area network such as the internet, or 46 the software modules can be implemented with a single machine.

Since the present invention can be implemented by software, the present invention can be embodied as a 5 medium such as a f loppy disk, CD ROM, magnetic tape or EPROM, carrying computer implementable instructions f or controlling a computer to carry out the process. Further, since the software can be transmitted over a network between computers, the present invention can be 10 embodied as a signal carrying computer implementable instructions for controlling a computer to implement the process. The invention thus encompasses any form of software carrier medium carrying software for controlling a processor to implement the method.

Claims

CLAIMS:

1. A method of searching for information stored in a plurality of information units at a respective plurality of logical locations, the method comprising:

storing metadata for the information units in association with the information units, said metadata containing classification data for information units; reading said metadata to form a database of logical location identifications and corresponding classification 10 data; receiving a classification query for said database; and responding to the classification query to return the identities of any logical locations matching the 15 classification query to allow the retrieval of information from the or each logical location.

2. A method according to claim 1, wherein a group of one or more information units are stored at each of one 20 or more logically distinct sites, said metadata containing classification data, for the or each information unit at each respective logical site.

3. A method according to claim 2, wherein said metadata is stored for each information unit in association with each respective information unit.

5 4. A method according to claim 2, wherein said metadata is stored as a unit for the or each logical site for the respective group of information units.

5. A method according to claim 4, wherein said metadata 10 stored at the or each logical site contains hierarchically arranged classification data for a plurality of information units stored at the or each logical site.

15 6. A method according to any one of claims 2 to 5, wherein said database contains logical site and information unit identifications and corresponding classification data and in response to a classification query, the identities of any logical sites and 20 information units are returned.

7. A method according to claim 1, wherein said metadata is stored within said information units.

8. A method according to any preceding claim wherein 5 said classification data includes data on at least one of the subject of the information, geographical origins of the information, and language of the information.

9. A method according to any preceding claim, wherein 10 said information includes hypertext, each logical location holds hypertext for a hypertext document with zero or more hypertext links to other hypertext documents, said metadata contains classification data for the or each hypertext document, and said database 15 contains logical locations and document identifications and corresponding classification data.

10. A method according to claim 9, wherein said reading step comprises:

20 (a) reading a root hypertext document at each logical location and corresponding metadata to form a database entry for the root hypertext document, (b) identifying any hypertext links to hypertext documents, (c) reading the identified hypertext documents, and corresponding metadata to form database entries for the 5 identified hypertext documents, and (d) repeating steps (b) and (c) for the identified hypertext documents until there are no identified hypertext documents without a database entry.

10 11. A method according to claim 10, wherein only the identified hypertext documents at predetermined logical locations are read.

12. A method according to any preceding claim, wherein 15 if said metadata for an information unit cannot be found at the logical location, another logical location is queried for said metadata.

13. A method according to any preceding claim, wherein 20 said metadata can change, and said reading step is repeated to update said database.

14. A method according to any one of claims 2 to 6, wherein said information units comprises text on web pages, said logical locations comprise web page URLs, and said reading step is carried out by a web robot which 5 visits each web page to read the metadata f or each web page and to form said database on a server.

15. A method according to claim 14, including the step of copying said database to at least one further server, 10 wherein a said further server receives said classification query and responds to return the identities of any logical locations matching the classification query.

15 16. A method according to any preceding claim, wherein said information contains text; the reading step further comprising the step of reading and adding at least some of the text to said database for at least one logical location, and the text in said database corresponding to 20 the matched logical locations is returned in response to the classification query.

17. A method according to claim 16, wherein said reading and adding step comprises the steps of generating a precis of said text, and adding said precis to said database for at least one logical location.

18. A method according to claim 16 or claim 17, wherein said reading and adding step comprises the steps of identifying keywords in said text, and adding said keywords to said database for at least one logical location.

19. A method according to any one of claims 16 to 18, wherein said classification query includes an indication of a preferred language, the method including the steps 15 of determining the language of the text returned from said database in response to the classification query, and, if the text is not in the preferred language, obtaining a translation of at least a part of the text to return in response to the classification query.

53 20. A method according to claim 19, wherein said translation is obtained by looking up the text in a translation database.

5 21. A method according to claim 20, wherein any text for which there is no translation is added to said translation database for a translation to be added therefor.

10 22.Information source apparatus for use in the method of any preceding claim as at least one logical location for information units, the apparatus comprising:

storage means for storing information units and metadata for said information units, said metadata 15 containing classification data for said information units; and means to allow the reading of said metadata to form a database of logical location identifications.

20 23. Database apparatus for use in the method of any one of claims 1 to 21, the apparatus comprising:

54 reading means for performing the reading step to read said metadata associated with the information units to form said database of logical location identifications and corresponding classification data; and 5 database storage means for storing said database.

24. Information search apparatus for use in the method of any one of claims 1 to 21, the apparatus comprising:

receiving means for performing the receiving step 10 of receiving a classification query for said database; and responding means for performing the responding step of responding to the classification query to return the identities of any logical locations matching the 15 classification query to allow the retrieval of information from the or each logical location.

25. A system for searching for information stored in a plurality of information units at a respective plurality of logical locations, the system comprising: database forming means for reading metadata for an information unit, said metadata containing classification data for information units, and for forming a database of logical location identifications and corresponding classification data; database storage means for storing the formed 5 database; receiving means for receiving a classification query for said database; and responding means for responding to the classification query to return the identities of any 10 logical locations matching the classification query to allow the retrieval of information from the or each logical location.

26. A system according to claim 25, wherein a group of 15 one or more information units are stored at each of one or more logically distinct sites, and said database forming means is adapted to read said metadata containing classification data for the or each group of information unit at each respective logical site.

27. A system according to claim 26, wherein said database forming means is adapted to read said metadata 56 for each information unit associated with each respective information unit.

28. A system according to claim 27, wherein said 5 metadata stored at at least one logical site contains hierarchically arranged classification data for a plurality of groups of information held at the or each logical site.

10 29. A system according to any one of claims 26 to 28, wherein said database forming means is adapted to form said database to contain logical site and group identifications and corresponding classification data, and said responding means is adapted to respond to a 15 classification query by returning the identities of any sites and groups matching the classification query.

30. A system according to claim 25, wherein said metadata is stored with said information.

31. A system according to any one of claims 25 to 30, wherein said information includes hypertext, each logical 57 location holds hypertext f or at least one hypertext document with one or more hypertext links to other hypertext documents, said metadata contains classification data for the or each hypertext document, 5 and said database means stores said database containing logical location and document identifications and corresponding classification data.

32. A system according to claim 31, wherein said 10 database forming means is adapted to:

(a) read a root hypertext document at each logical location and corresponding metadata to form a database entry for the root hypertext document, (b) identify any hypertext links to hypertext 15 documents, (c) read the identified hypertext documents and corresponding metadata to form database entries for the identified hypertext documents, and (d) repeat steps (b) and (c) for the identified 20 hypertext documents until there are no identified hypertext documents without a database entry.

33. A system according to claim 32, wherein said database forming means is adapted to read only the identified hypertext documents at predetermined logical locations.

34. A system according to any one of claims 25 to 33, wherein said metadata at at least one logical location can change, and said database forming means adapted to repeatedly operate to update said database.

35. A system according to any one of claims 25 to 34 wherein if said database forming means cannot find said metadata at a said logical location said database forming means is adapted to read said metadata from an 15 alternative logical location.

36. A system according to any one of claims 25 to 35, wherein said information units comprises web pages, logical locations comprise web page URLs, and said 20 database forming means includes a web robot for visiting each web page to read the metadata for each web page and said database means is provided on a server.

37. A system according to claim 36, including at least one further server for receiving a copy of said database, for receiving said classification query, and for responding to return the identities of any logical 5 locations matching the classification query.

38. A system according to any one of claims 25 to 37, wherein the information at at least one logical location contains text, said database forming means being adapted 10 to read and add at least some of the text to said database for the or each logical location containing text, and said responding means is adapted to return the text in said database corresponding to the matched logical location in response to the classification query.

39. A system according to claim 38, wherein said database forming means is adapted to generate a precis of said text, and to add said precis to said database for the or each logical location containing text.

40. A system according to claim 38 or claim 39, wherein said database forming means is adapted to identify keywords in said text, and to add said keywords to said database for the or each logical location containing text.

5 41. A system according to any one of claims 38 to 40, wherein said receiving means is adapted to receive the classification query including an indication of a preferred language, the system including determining means for determining the language of the text returned 10 in response to the classification query from said database, and translations means for obtaining a translation of at least a part of the text to return in response to the classification query if the text is not in the preferred language.

42. A system according to claim 41, wherein said translation means includes a translation database containing translations of text, and lookup means for looking-up the text in aid translation database to obtain 20 the translation thereof.

43. A system according to claim 42, wherein said translation means is adapted to add any text for which there is no translation to said translation database for a translation to be added therefor.

44. A system according to claim 43, wherein said translation means includes a translator's interface to allow the input of translations for untranslated text in said translation database.

45. Instruction code for causing a computer to be configured as the apparatus of any one of claims 22 to 24.

15 46. A carrier medium carrying the instruction code according to claim 45.

47. An automated method of obtaining translations of text, the method comprising:

20 sending text to be translated to a translation database; 62 looking-up any translations of the text in said database; retrieving any translation of the text; and if the text has no translation in said translation 5 database, adding the text to said translation database for a translation to be added.

48. An automated method according to claim 17, including accessing said translation database to identify any text 10 for which there is no translations, and receiving a translation for at least some of said text.

49. An automated method according to claim 47 or claim 48, wherein said translation database can contain 15 translations for text in several languages, the method including sending an indication of at least one preferred language to said translation database, and controlling the look-up operation to look-up any translation of the text in at least one preferred language.

50. An automated method according to claim 49 wherein said indication includes a ranking of the preference of 63 a plurality of preferred languages, and the look-up operation is controlled to look-up any translation of the text in the highest ranked language and if no translation is found, to look-up any translation in the next ranked 5 language until a translation is found or there are no preferred languages left in the ranking and no translation has been found, whereupon an indication that no translation was found is returned.

10 51. An automated method according to claim 49 or claim 50, wherein said indication comprises an indication of the location of the origin of the text to be translated.

52. An automated method according to claim 49 or claim 15 50, wherein said indication comprises an identification of one or more preferred languages.

53. An automated method according to any one of claims 47 to 52, for obtaining a translation of text in web 20 pages, the method including receiving a web page request and an indication of at least one preferred language for text in the web page, locating and retrieving the web 64 page requested, extracting the text for translation from the web page, reconstructing said web page with any translations found of the text, and responding to the web page request with the reconstructed web page. 5 54. An automated method according to claim 53, including accessing said translation database using a web interface to identify any text for which there is no translation, and entering a translation for at least some of said text.

55. A translation server apparatus comprising: receiving means for receiving a web page request; sending means for sending text to be translated to 15 a translation database; web page retrieval means for retrieving the requested web page; extraction means for extracting the text for translation from the retrieved web page; 20 translation retrieval means for retrieving any translations of the text from a translation database; reconstruction means for reconstructing the web page to include any translations for the text of the web page; and responding means for responding to the web page 5 request with the reconstructed web page.

56. A translation server apparatus according to claim 55, wherein said receiving means is adapted to receive an indication of at least one preferred language for text 10 in the web page; and said translation retrieval means is adapted to retrieve any translations of the text from said translation database in accordance with the or each preferred language.

15 57. A translation database server apparatus comprising:

storage means for storing a translation database of text and translations for said text; receiving means for receiving an indication of text to be translated; 20 look-up means for looking up any translations for said text; 66 responding means f or responding to the received indication with any translations of the text; and translator interface means for adding translations for text to said translation database. 5 58. A translation database apparatus server according to claim 57, including database adding means for adding the text to said translation database for a translation to be added therefor if the text has no translation in 10 said translation database.

59. A text translation system comprising:

means for receiving text to be translated; translation database means containing text and 15 translations for said text; look-up means for looking up any translations of the received text in said translation database means; returning means for returning any translation of the text; and 20 database adding means for adding the text to said translation database means for a translation to be added 67 later if the text has no translation in said translation database means.

60. A text translation system according to claim 59, 5 including translator interface means for allowing the input of translations for text having no translation into said translation database means.

61. A text translation system according to claim 56 or 10 claim 60, wherein said translation database means is adapted to hold translations of text in several languages, said receiving means is adapted to receive an indication of at least one preferred language, and said look-up means is adapted to look-up any translations of 15 the text in the at least one language.

62. A text translation system according to claim 61, wherein said indication includes a ranking of the preference of a plurality of preferred languages, said 20 look-up means is adapted to look-up any translation of the text in the highest ranked language, and if no translation is found, to -look-up any translation in the 68 next ranked language until a translation is found or there are no preferred languages left in the ranking and no translation has been found, whereupon said returning means is adapted to return an indication that no 5 translation was found.

63. A text translation system according to claim 61 or claim 62, wherein said receiving means is adapted to receive said indication comprising an indication of the 10 location of the origin of the text to be translated.

64. A text translation system according to claim 61 or claim 62, wherein said receiving means is adapted to receive said indicating comprising an identification of 15 one or more preferred languages.

65. A text translation system according to any one of claims 59 to 64 for obtaining a translation of text in web pages, the system including web page request 20 receiving means for receiving a web page request, retrieval means f or locating and retrieving the requested web page, extracting means f or extracting the text for 69 translation from the web pages, reconstructing means for reconstructing said web page with any translations found for the text content, and responding means for responding to the web page request with the reconstructed web page.

66. A text translation system according to claim 65, including web interface means for accessing said translation database means to identify any text for which there is no translation, and for entering a translation 10 for at least some of said text.

67. Instruction code for controlling a computer to carry out the method of any one of claims 1 to 21 or 47 to 54.

15 68. Instruction code for controlling a computer to be configured as the server of any. one of claims 1 to 21 or 55 to 58.

69. A carrier medium carrying instruction code according 20 to claim 67 or claim 68.