EP2321743A1 - Method and apparatus for generating standard document identifiers from content references - Google Patents
Method and apparatus for generating standard document identifiers from content referencesInfo
- Publication number
- EP2321743A1 EP2321743A1 EP08798856A EP08798856A EP2321743A1 EP 2321743 A1 EP2321743 A1 EP 2321743A1 EP 08798856 A EP08798856 A EP 08798856A EP 08798856 A EP08798856 A EP 08798856A EP 2321743 A1 EP2321743 A1 EP 2321743A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- publication
- parser
- web site
- information
- script
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Definitions
- This invention relates to digital rights display and methods and apparatus for determining reuse rights for content to which multiple licenses and subscriptions apply.
- Works, or "content” created by an author is generally subject to legal restrictions on reuse. For example, most content is protected by copyright.
- content users often obtain content reuse licenses.
- a content reuse license is actually a "bundle" of rights, including rights to present the content in different formats, rights to reproduce the content in different formats, rights to produce derivative works, etc. Thus, depending on a particular reuse, a specific license to that reuse may have to be obtained.
- One of the problems in determining which rights apply to a given publication is connecting the publication to one or more agreements that convey rights so that the correct agreement can be examined to determine what rights are available to an organization.
- One prior art method for performing this connection is to embed a special "tag" in the publication.
- the tag When the publication is later opened, for example, for examination, the tag can be activated to direct the user to a specific location, such as a web site, where rights agreements are located. While this arrangement is effective, it requires each publication to contain the special tag. While this might be feasible for newly published publication, it would be prohibitive to re-publish older publications with the special tag. Thus, this system would not work with many existing publications.
- a publication that the worker is viewing may contain a reference to another web page, such as an abstract or a bibliographic page that contains the aforementioned information. Since each webpage may have a different and unique format, it is difficult to determine even where on a particular page to look for the information necessary to identify the rights that are available (called content "metadata").
- the domain name of the website in which a knowledge worker is working is sent to a rights advisor website.
- the rights advisor website uses the domain name to obtain a parser program that is specific to the domain.
- the parser program is then sent back to the browser on which the knowledge worker is viewing information and extracts content metadata from the website in which the knowledge worker is working.
- the extracted content metadata is returned to the rights advisor website and used to determine rights associated with the publications.
- the parser program extracts content metadata from the webpage displayed in the browser.
- the parser program navigates from the webpage that is being displayed in the browser to another webpage that contains content metadata and extracts the content metadata from that other page.
- content metadata such as a publication title
- content metadata is "normalized” to obtain a standard identifier that is, in turn, used to determine rights for the content.
- the rights advisor website when a URL is associated with content, the rights advisor website attempts to locate a standard identifier for the content using that URL simultaneously with attempts to obtain a parser program that is specific to the website domain and which can look for information associated with the content on the webpage.
- Figure 1 is a block schematic diagram illustrating in a high level form the basic architecture of the inventive rights location system.
- Figure 2 is a table array in a database for storing bookmarklet script information.
- Figure 3 is a typical screen display presented by a conventional search engine in a web browser.
- Figure 4 is a typical display of content located by the search engine when the keywords "nature methods" have been entered into the text box. This figure shows the hyperlinks to a rights advisor web page.
- Figure 5 is a flowchart showing the steps in an illustrative process for determining and resolving rights for a requested type of use.
- Figure 6 is a block schematic diagram illustrating the components of an agreement.
- Figure 7 is a block schematic diagram illustrating the components in a publication identifier location apparatus.
- Figures 8A and 8B when placed together, form a flowchart showing the steps in an illustrative process for locating a publication identifier which process is performed by the apparatus shown in Figure 7.
- Figure 9 is a table array in a database for storing URL parsers and metadata parsing scripts.
- Figure 10 is a screen shot of a journal web page with an International
- ISSN Standard Serial Number
- Figure 11 is the HTML code used to display the web page shown in Figure 10 in a conventional web browser illustrating how such a web page could be parsed to retrieve the identifier.
- FIG. 1 is a block schematic diagram illustrating one embodiment 100 constructed in accordance with the principles of the present invention.
- a customer can use a conventional search engine in a web browser 102 to search for content and to display that content in the display area 103.
- the web browser 102 has been modified by downloading a small executable program called a "bookmarklet" that causes the browser to interact with a "rights advisor" web server in accordance with the principles of the invention.
- a program might, for example, be a Javascript program, which is specific to a particular URL domain or to a set of URL domains.
- Figure 2 illustrates a set of database tables 200 for storing script text corresponding to various bookmarklet scripts.
- These tables include a URL base key table 204 which specifies URL map keys, each of which identifies a set of URLs to which a particular bookmarklet script applies.
- Each record of the latter table includes a URL map key identifier (URL_MAPKEY_ID), a name for the map key (URL_MAPKEY_BASE) and an identifier specifying a particular bookmarklet text that applies to that map key (BOOKMARKLET-SCRIPTJD).
- the URL domains that are members of each domain map are specified in the URL Domain Table 206.
- the URL domain table 206 contains records, each of which includes a URL domain identifier (URL_DOMAIN_ID), a map key identifier specifying the domain map to which the domain belongs (URL_MAPKEY_ID), the domain name (URL_DOMAIN), a URL configuration file (URL_CONFIG_XML), a primary parser identifier
- bookmarklet script (URL_PRIMARY_PARSER) and an identifier for an associated bookmarklet script (BOOKMARKLET_SCRIPT_ID). If a bookmarklet script is specified for a particular domain, it overrides any script specified for a domain map to which that domain belongs. [0025] The bookmarklet scripts are identified in the Bookmarklet Script Table
- Each record of table 208 contains a script identifier (SCRIPTJD), a key to select a particular script (SELECT_KEY), an indication whether the script is enabled, and various timing and retry parameters used in determining how the script is downloaded and executed.
- the actual script text is stored in the Versioned Text Table 210. This latter table contains records, each of which, in turn, includes a text identifier
- VTEXTJD a text type (VTEXT_TYPE_ID), a revision number (REVISION_NUM) and the actual text (CONTENT).
- the text table stores both script text and URL configurations as indicated by the type field. If a script is modified, its record is not overwritten. Instead, a new record in inserted into the text table. This allows review of previous script versions and rollback, if necessary.
- FIG. 3 shows a typical screen display presented by such a search engine.
- the web browser 300 includes a search field 302 that, in turn, includes a text box 304 for receiving a search phrase and a command button 306 for initiating a search for publications whose text includes the search phrase.
- the web browser 300 has been modified to include a small executable program called a "bookmarklet" that causes the browser to interact with a "rights advisor" program in accordance with the principles of the invention.
- Figure 4 shows a typical display of content located by the search engine when the keywords "nature methods" have been entered into the text box 404 in the search field 402 of the browser 400 and the command button 406 has been selected.
- the search results are shown as a plurality of rows 408-418 in the list box 407.
- Each row includes information concerning an article located in the search.
- the search engine illustrated in Figure 4 displays information including the article title, the publisher and a standard identifying number associated with the publication that contains the article.
- Each article has associated with it a hyperlink generated by the bookmarklet that enables a user to locate and display rights associated with that article.
- row 408 includes a hyperlink 420 that enables a user to locate and display rights for the "Nature Methods" article displayed in that row.
- rows 410-418 have hyperlinks 422-430 for locating and viewing rights associated with the articles displayed in those rows.
- a URL associated with an article may not refer to the publisher or the containing publication.
- the user might be examining the full text of a document containing a URL that points to a different portion of the document or to a bibliography containing text information that identifies the publisher and publication.
- the full text page of the document may not contain any URLs.
- text identifying the publication and publisher or URLs pointing to the publication and publisher may be located on another web page, for example a bibliographic page or an initial document information page.
- the locations and format of the information are generally specific to a particular domain or web site and, in some cases, only the URL of the website, containing the domain may be all that is available.
- an article can be identified by parsing a URL or by searching for article identification information embedded in other textual information on a web page. Further, the URL parsing mechanism and text search can be tailored on a per-domain basis so that different formats can be accommodated.
- the bookmarklet 104 when a hyperlink is selected, the bookmarklet 104 causes the web browser 102 to access a rights advisor web page 108 hosted by a server in a rights clearinghouse location. When the web page 108 is accessed, the bookmarklet generates a unique bookmarklet key which is used to identify the member and a "session" during which rights for the displayed article will be retrieved and displayed.
- the bookmarklet also sends any available information regarding the article to the rights advisor web page 108.
- the rights advisor web page 108 uses the article information to try and locate rights associated with the article.
- the web server that displays the rights advisor web page also searches the web site from which the article is displayed in order to attempt to locate additional information concerning the article.
- the process performed by the rights advisor web page 108 to locate and resolve rights is set forth in Figure 5.
- This process begins with step 500 and proceeds to step 502 where the rights advisor web page 108 receives article information, the organization member context and a desired type of use from the bookmarklet 104.
- Rights that are available for an organization are defined by agreements that are stored in the rights database 112.
- Rights database 112 is arranged as a plurality of tables where rights are stored in a table separate from the content identifiers. Such a database is described in detail in U.S. Patent No. 5,991 ,876, the content of which is incorporated in its entirety by reference.
- the rights database 112 contains information regarding agreements.
- An agreement is any construct under which an organization obtains or expresses rights related to secondary use of content.
- Such agreements could include a copyright license for an entire collection of publications obtained from a rights clearinghouse.
- An example of such an agreement is an annual copyright license obtained from the Copyright Clearance Center.
- Agreements may also be made directly with a publisher, such as the Pharmaceutical Documentation Ring agreement made with the publisher Elsevier.
- Another type of agreement could be made with other Reproductive Rights Organizations such as a contract with the Copyright Licensing Agency in the United Kingdom.
- Agreements can also be obtained from various content aggregators. Such an agreement might be a Factiva license.
- Agreements can also be implied by statutory law, for example, Swiss law allows Swiss companies to share content without royalties. Still other agreements may involve company policy.
- the rights advisor 108 accesses the rights database as indicated schematically by arrow 114 and retrieves all agreements that apply to the organization.
- the components of an agreement 600 as represented in the rights database 112 are shown in Figure 6. These components include boundaries 602, titles included 610, rights 620 and terms 621. Boundaries 602 specify the member context, or various constraints, an organization member must meet in order to be covered by the agreement and are defined by three variables: country, location and organization defined attributes.
- the country variable has values corresponding to global nationalities, such as United States or France.
- the location variable has values that correspond to various site location of the organization, such as the Waltham site or the Wilmington site.
- the organization defined variable may have any values that determine, within that organization, whether the agreement applies to a member of that organization.
- variable may specify that a member of the organization must be part of the marketing department or part of the research and development department, etc. to be covered by the agreement.
- the country, location and organization defined variables may be assigned the value "any" which indicates that the agreement would apply to any member context which meets the other boundary variables.
- the organization defined variable may be assigned a value of "any.” In this case the agreement would apply to any member who meets the country and location boundary variables.
- An agreement 600 also includes a designation 610 of the publications or titles that it covers.
- the agreement 600 may apply to collections 612, which are any grouping of publications. For example, an agreement may apply to all the titles that are included in an EBSCO subscription package. This would be considered a "public" collection; the titles included are defined by the information provider and are standard for all purchasers of the package. Another alternative would be a "private" collection. For example, an organization may create an "a Ia carte" subscription from a provider like EBSCO.
- the agreement 600 may also apply to separate publications 616 in addition to, or as an alternative to, collections 612.
- the third component of an agreement is the rights 620 associated with the agreement. Each right is associated with a specific type of use.
- an illustrative set of predefined rights could include (1 ) emailing a copy of the publication to a member of the organization, (2) emailing a copy of the publication to a person who is not a member of the organization, (3) storing a copy of the publication on a local hard drive, (4) storing a copy of the publication on a shared network drive, (5) scan and then email a copy of the publication to a member of the organization, (6) scan and then email a copy of the publication to a person who is not a member of the organization, (7) photocopy publication and share with a member of the organization, (8) photocopy publication and share with a person who is not a member of the organization, (9) share a printed copy of the publication with a member of the organization, (
- Rights may be associated with each type of use.
- rights can be specified for the agreement 600 as indicated schematically by arrow 622, for a collection covered by the agreement as indicated schematically by arrow 624 or for individual publications within that collection as indicated schematically by arrow 626.
- Rights can also be assigned to separate publications that are covered individually by the agreement as indicated schematically by arrow 628.
- Terms 621 may also be associated with each agreement. Terms include rights holder terms, contract terms that cannot be expressed programmatically as a right, certain statutory laws, such as Swiss law allowing publication sharing with other Swiss employees and company policies. Terms may be assigned at the publication, collection and agreement levels. In general, terms associated with rights are tagged as “Restrictive” or “Nonrestrictive”. The “Restrictive” tag indicates that the associated right (such as a right to photocopy a publication) is limited by the text of the terms (for example, a restrictive term might be "only internal distribution is allowed”). The “Nonrestrictive” tag indicates the terms do not limit the applicability of the right, perhaps because they extend the scope of the permitted activity (for example, nonrestrictive terms might include "There are no restrictions on the distribution of photocopies of this content").
- the rights advisor accesses a metadata database 122 as indicated schematically by arrow 120, and attempts to obtain a standard number for the publication containing the article for which the member has requested rights information in order to determine whether any of the retrieved agreements are applicable to that publication.
- the rights advisor tries to lookup the publication using two separate methods that are performed in parallel.
- the rights advisor uses information that it receives from the member's browser to attempt to lookup the publication. If this information includes the article title and recognized standard identifying numbers, such as an ISSN or an ISBN number for the publication, then a lookup of the publication may be possible using just this information.
- the rights advisor web page 108 attempts to map, or translate, the URL into a standard identifier, where such an identifier is available. Using this standard identifier, the rights advisor web page 108 can then access the metadata database 122 to obtain a standard number that identifies the publication. This standard number can be applied to the retrieved agreements for the organization to determine which agreements apply to the specified publication.
- URL mapping performed by the rights advisor relies on a variety of URL parsers, each of which uses a parsing algorithm, and a supporting database of URL formats 118.
- the rights advisor program 108 has a set of rules for determining which parsers are applicable to a particular URL and a set of parsers that are each able to separate a particular URL into web-site specific identifiers useful for the URL mapping task. Once these specific identifiers have been obtained, they are applied, as schematically indicated by arrow 1 16, to a database 1 18 of rules for translating the web-site specific identifiers into standard identifiers such as ISSN or ISBN identifiers.
- Apparatus 700 for obtaining a standard identifier from article information is illustrated in Figure 7 and the steps in the lookup process are illustrated in Figures 8A and 8B.
- the lookup process begins in step 800 and proceeds to step 802 where information 702 concerning the displayed article is received from the member web browser 102.
- step 804 an attempt is made by the web server to lookup the corresponding publication in a metadata table 734 (as schematically indicated by arrow 703) using information, such as a title or any standard numbers present in the information received from the browser. If this attempt is successful, as determined in step 806, a standard identifier is returned as indicated schematically by arrow 736 and the process proceeds, via off-page connectors 822 and 828 to finish in step 848.
- step 808 the domain name is saved by the web server using, for example, a store and forward dispatcher.
- This storage operation triggers two processes that operate in parallel and attempt to locate the standard identifier for the applicable publication.
- the first process is set forth in steps 810, 814, 818, 824, 830 and 834.
- the second process is illustrated by steps 812, 816, 820, 832, 836, 840 and 842.
- the stored domain name in the URL is matched against the set of parser rules to select rules that apply to that domain.
- the selected rules are then used to select and configure the parsers.
- Figure 9 shows an illustrative embodiment 900 for the parser rule set 704.
- the parser rule set is implemented as a set of relational database tables 902, 904, 906, 910 and 912.
- Each content provider is provided in the content provider table 902 with a record containing a unique identifier (CONTENT_PROVIDER_ID) and a name (CONTENT_PROVIDER_NAME).
- a content provider may be associated with one or more Internet domains via the Content Provider Domain table 904.
- Table 904 contains one or more records for each content provider and each record contains a domain identifier (CONTENT_PROVIDER_DOMAIN_ID), a domain name (CONTENT_PROVIDER_DOMAIN), a reference to the content provider (CONTENT_PROVIDER_ID) and a precedence level (PRECEDENCE_LEVEL) that indicates, if there are a plurality of domains, which domain should be examined first.
- the table also includes a URL segment map identifier (URL_SEGMENT_MAP_ID) and URL parser identifier (URL_PARSER_ID) for each domain.
- the URL segment map identifier identifies a record in the URL Segment Map table 912 which contains data indicating the structure of the URL, which can consist of three segments
- URL_SEGMENT_1 URL_SEGMENT_2 and URL_SEGMENT_3.
- a standard publication number or a publication identifier may be directly associated with a domain name. If this is the case, these identifiers are stored in the URL segment map (STD_NO and PUBJD).
- a URL parser is associated with the content provider domain via a reference to the URL Parser table 906.
- Table 906 includes a parser identifier (URL_PARSER_ID) and a parser name (URL_PARSER_NAME) and an indication whether the particular parser is enabled.
- a further table 910 (the URL Parser Param table) contains parameters that are used with a particular parser.
- a parser consists of the instructions for extracting from a URL the data fields necessary to use translation rules to determine a standard identifier.
- One such set of data fields includes three members: the key base, the journal key and the publication date.
- the key base specifies a context in which the derived identifier is meaningful; in other words, a particular publisher may give all of the publications on its web site unique, proprietary numbers, and use this numbering system in the URLs for the articles on its web site.
- the key base in this case can be any string that specifies the publisher's web site, such as 'PUB 1 '; the journal key is then the publisher's own proprietary identifier.
- Parsers such as parsers 706-708, are defined to extract data in particular formats. For instance, many publishers follow an informal convention in which the URL for an article contains the concatenation of a unique string identifying the publication with four numeric digits signifying the year and month of publication of the article. A variety of well-known parsing techniques can be used to locate this string and split it into the desired components. Once a parser is created to extract this concatenated string from a URL and split the string into its two useful components, the parser can be configured with parser rules, such as those set forth above, to perform the same task for URLs of any publisher that follows this convention.
- Any selected parsing technology must be able to implement at least the following capabilities: within a given string, locate a specified prefix string; extract characters following the prefix string until a specified suffix string is located; and split an extracted string into multiple substrings according to simple format specifications.
- Conventional UNIX- or Perl-like regular expressions are easily capable of performing these parsing and extraction tasks.
- new parser rules and parsers can be added to support new URL formats.
- the process then proceeds, via off-page connectors 818 and 824 to step 830, where the extracted data field values are presented to the translation rule database 714 as indicated schematically by arrows 710 and 712.
- the translation database includes a plurality of entries, each entry constituting a translation rule that, in turn, includes at least three fields: the key base, the journal key and the standard identifier and may include other fields, such as date fields.
- the key base and journal keys are used as key fields. If the data field values presented to the translation rule database match these fields, the associated standard identifier is returned.
- journal key is internal data for a particular publisher, there is no guarantee that journal keys will be unique outside the context of a particular website or website subset.
- the key base provides a mechanism for ensuring that the journal keys can be mapped accurately to standard identifiers, such as an ISSN. If, in step 834, it is determined that such a standard identifier results from the database query, then the URL mapping process proceeds to step 836 where an attempt is made to lookup the publication using the standard identifier. If the publication is found, as determined in step 840, the process finishes in step 848.
- the second process for obtaining a standard identifier for a publication begins in step 812. As previously mentioned, this process is initiated when a domain name is stored (in step 808) and proceeds in parallel with the aforementioned URL parsing process.
- a script called a metadata parser, which is specific to the domain, is retrieved from the URL mapping database 118. Illustratively, this script might be a Javascript.
- metadata parser scripts are stored in a metadata parser table 908.
- Each record in this table includes a parser identifier (METADATA_PARSER_ID), a domain key (SELECT_KEY), an indication whether the script is enabled, the script text (SCRIPT_TEXT) and several timing entries that control the timeout interval and retry policies for script execution.
- Each script can also be disabled for a particular customer by making an entry into the METADATA_CUST_DISABLED table 914.
- the URL of the web page is used as the domain key to access the table 914 and retrieve script text that is specific to that domain.
- the retrieved script text is downloaded to the member's browser and appended to the bookmarklet script already running in the browser using the timing and retry numbers stored with the script.
- the script text parses the HTML code of one or more web pages on the current web site and attempts to locate additional data concerning the desired publication, again using the timing and retry information stored with the script.
- such a script may parse the HTML code of the web page that the member is currently viewing.
- the script may navigate from one page to another web page and then parse the HTML code on the second web page.
- this might occur in situations where the member is viewing a full-text version of an article, but publication information for that article is available on a different web page that displays publication abstracts.
- the scripts are typically designed by a human operator who visits the web site, notes the location of the additional information and then writes the script to retrieve the information.
- An operator can generate scripts from "scratch" using the general workflow of the site for extracting content identifiers as the functional boundary, but in most cases, operators will begin with an existing metadata parser script.
- Scripts designed to process similar types of publications are typically similar in construction so that each publication type has a template script that can be modified for a specific domain.
- template scripts might be designed for trade journals, news articles, patents and press releases or other sites which have similar page layouts and types, or from which similar metadata will be extracted or which have similar site structures.
- an operator may decide to use an existing script if the content or documents to be parsed are a similar type and structure or contain the same kind of metadata, such as an author's name or the year in a copyright notice. For example, a copyright notice appears in most Web pages in the footer, at the bottom of the page. A typical format is "Copyright ⁇ 2006 copyright holder name.” Because the format and the data captured are similar, a system user can modify an existing script to perform the same function on a different Web site or page set.
- a template or existing script can be selected by an operator based on the URL of the web site as schematically illustrated by arrow 719 in Figure 7. Then, the operator would typically log onto the web site with a temporary account, note the location of the relevant information and modify the template script to navigate to that location. This modification is performed by applying the existing script to a script editor 721 to generate a new script 723 which is them stored in the metadata script parser table 720.
- a template script for a trade journal could skip over the first one thousand bytes of the HTML code in order to avoid header information and then parse the remaining HTML script with parsers similar to those discussed above for URLs.
- the scripts are site specific.
- a web page made up of HTML code and other formatting elements will require a different parser than a page coded in PDF (Portable Document Format).
- a web page with a similarly-located standard identifier such as an ISBN (International Standard Book Number), an ISSN (International Standard Serial Number), or a DOI (Digital Object Identifier) may be used by many sites and services for a particular kind of work (e.g., books, journals or research articles).
- Figure 10 is a screen shot of an illustrative web page from a website of the publisher Elsevier which contains an ISSN (0304-4203) identifying the journal "Marine Chemistry" embedded in the web page text.
- Figure 11 shows the HTML code which causes the web page display shown in Figure 10 to be generated in a conventional web browser.
- the parser steps through the DOM (Document Object Model) to pick out the document identifier, (ISSN, DOI, etc.)
- the DOM provides a hierarchical structure of the web page along with values, allowing a program to search and step through a fixed format to gather the information required.
- Consistency of format is important, but with the page information returned by the DOM, a program can, for example, query for the ⁇ span> tag where the class equals, for example, "journalinformation" and then use various conventional methods, such as regular expression matching, to extract the standard number from the block of returned text.
- the code at lines 1-35 defines various variables that are used in the following subroutines.
- the code at lines 36-84 defines a subroutine that determines whether the protocol of the web page under examination is http or https.
- the code at lines 85-118 defines a subroutine that searches a text string for the occurrence of another text string.
- the code at lines 119-204 uses the search function defined in lines 85-118 to sequentially search the web page html code for predetermined character patterns that indicate the presence of the publication standard identifier and document title.
- Lines 205-218 define a sub routine that stores the
- the code at lines 219-243 displays debugging information and rests the web page following the metadata storage operation.
- the code at lines 244-251 refreshes the user's display window and the code at lines 252-262 executes the parsing subroutines.
- the metadata is not contained on a Web page which is initially displayed, but only on a preceding page or on a page that requires user interaction with the Web site. Therefore, some metadata parser scripts can perform the functions necessary to navigate the client browser up or down in the browsing history before extracting metadata.
- the system user may begin generating a new metadata parser script from an existing one that contains this functionality.
- step 816 the script is executed in the member's browser and parses the HTML code to locate relevant publication information.
- This information can include, for example, the publication title and standard publication numbers, such as an ISSN or an ISBN number.
- the process then proceeds, via off-page connectors 820 and 826, to step 832, where any information extracted from the website is stored using the bookmarklet key as a retrieval key. Storage of the information is necessary at this point because the first process may still be proceeding.
- step 834 the first process determines that a publication identifier could not be located by parsing the publication URL or, in step 840, that an attempt to lookup the publication with the located identifier failed, then the process proceeds to step 838 where a determination is made whether any publication information has been stored in step 832 by accessing storage with the aforementioned bookmarklet key. If no publication information is located, the process proceeds to step 846 where a search web page is displayed that allows the member to perform a manual search for the publication information, and the process then finishes in step 848.
- step 838 it is determined that stored information obtained by the second process is located with the bookmarklet key, then an attempt is made in step 842 to lookup the publication using the stored information.
- the information may be pre-processed into a standard format in order to simplify the lookup process. For example, certain standard information may be may be removed, including HTML tags, spaces, foreign language characters and common articles. Then, the standard form is used to perform a lookup attempt. The lookup attempt itself may proceed in several stages. First, the lookup process tries to use any standard publication numbers and titles found to generate a "fingerprint" in order to lookup the publication. If that attempt fails, the process uses just the standard number looking at alternate ID numbers.
- the lookup process will use the title alone and look for a matching fingerprint.
- the publication identifier is returned in step 848.
- the aforementioned search web page is displayed that allows the member to perform a manual search for the standard publication identifier, and the process then finishes in step 848.
- step 506 once the standard publication identifier has been obtained using one of the methods described above, the process proceeds to step 508 where the rights advisor web server uses that identifier to determine all retrieved agreements that apply to the identified publication.
- step 510 a determination is made of all agreements that fit the member context. This determination is made by examining the boundaries of each agreement and then determining whether that agreement covers the member country and location and that the member meets any organization defined attributes.
- step 512 the best right for the type of use requested is determined. The process then finishes in step 514.
- the process of determining the best right as set forth in step 512 involves examining each agreement that applies to the publication and meets the member context in order to determine the most appropriate right for the specified type of use that is included in the agreement. In performing this examination, each agreement is examined from the "bottom up.” That is, more specific rights supersede more general rights. Thus, an agreement is first examined to determine whether a right for the type of use requested has been assigned directly to the specified publication, either by itself or to the publication as contained in a collection. If such a right is found it is the right used for that agreement. If no such right has been assigned to the publication, the agreement is next checked to determine whether a right for requested type of use has been assigned to a collection that includes the specified publication. If so, it is the right that is used for that publication. If no such right is found, then the agreement is checked to determine whether a right for the type of use has been assigned at the agreement level. If so, that right is used for the agreement.
- Ties among two or more rights can take several forms. For example, a tie between two or more rights without terms indicates that identical rights are available from two different agreements. Since the rights are identical and indistinguishable, one agreement is selected by a variety of techniques (for example, arbitrarily) and the rights and terms of that agreement are displayed. [0069] Alternatively, a tie between two or more rights with terms results in the display of all such rights together with the terms, so that the end user can make an informed judgment as to the permissibility of the requested activity. Another example is a tie between two or more rights with "Purchase" status. Such a tie results in the display of a list of the purchase information or capability for all such rights. In another embodiment, once a publication has been selected, the "best" rights which are available for various types of use are determined and presented to the member simultaneously.
- a software implementation of the above-described embodiment may comprise a series of computer instructions either fixed on a tangible medium, such as a computer readable media, for example, a diskette, a CD-ROM, a ROM, or a fixed disk, or transmittable to a computer system for storage thereon via a modem or other interface device over a transmission path.
- the transmission path either may be tangible lines, including but not limited to, optical or analog communications lines, or may be implemented with wireless techniques, including but not limited to microwave, infrared or other transmission techniques.
- the transmission path may also be the Internet.
- the series of computer instructions embodies all or part of the functionality previously described herein with respect to the invention.
- Such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology, present or future, including, but not limited to, semiconductor, magnetic, optical or other memory devices, or transmitted using any communications technology, present or future, including but not limited to optical, infrared, microwave, or other transmission technologies. It is contemplated that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation, e.g., shrink wrapped software, pre-loaded with a computer system, e.g., on system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, e.g., the Internet or World Wide Web.
- a removable medium with accompanying printed or electronic documentation, e.g., shrink wrapped software, pre-loaded with a computer system, e.g., on system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, e.g., the Internet or World Wide Web.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
- Storage Device Security (AREA)
Abstract
The domain name of the website in which a knowledge worker is working is used to obtain a parser program that is specific to the domain. The parser program is then sent back to the browser on which the knowledge worker is viewing information and extracts content metadata from the website in which the knowledge worker is working. In one embodiment, the parser program extracts content metadata from the webpage displayed in the browser. In another embodiment, the parser program navigates from the webpage that is being displayed in the browser to another webpage that contains content metadata and extracts the content metadata from that other page. In still another embodiment, content metadata, such as a publication title, that is returned to the rights advisor website via the parser is 'normalized' to obtain a standard identifier that is, in turn, used to determine rights for the content.
Description
METHOD AND APPARATUS FOR GENERATING STANDARD DOCUMENT IDENTIFIERS FROM CONTENT REFERENCES
BACKGROUND [0001] This invention relates to digital rights display and methods and apparatus for determining reuse rights for content to which multiple licenses and subscriptions apply. Works, or "content", created by an author is generally subject to legal restrictions on reuse. For example, most content is protected by copyright. In order to conform to copyright law, content users often obtain content reuse licenses. A content reuse license is actually a "bundle" of rights, including rights to present the content in different formats, rights to reproduce the content in different formats, rights to produce derivative works, etc. Thus, depending on a particular reuse, a specific license to that reuse may have to be obtained.
[0002] Many organizations use content for a variety of purposes, including research and knowledge work. These organizations obtain that content through many channels, including purchasing content directly from publishers and purchasing content via subscriptions from subscription resellers. Subscriptions generally include some reuse rights that are conveyed to the subscriber. A given subscription service will generally try to offer a standard set of rights across its subscriptions, but large customers will often negotiate with the service to purchase additional rights. Thus, reuse rights may vary from subscription to subscription and the reuse rights available for a particular subscription may vary even across publications within that subscription. In addition, the reuse rights conveyed in these subscriptions often overlap with other rights and licenses purchased from license clearinghouses, or from other sources. [0003] Many knowledge workers attempt to determine which rights are available for particular content before using that content in order to avoid infringing legitimate rights of hghtsholders. However, at present, determining what reuse rights an organization has for any given publication is a time-consuming, manual procedure, generally requiring a librarian or legal counsel to review in advance of the use, all license agreements obtained from content providers and purchased from other sources which may pertain to the content and its reuse. The difficulty of this determination means that sometimes an organization will overspend to purchase rights for which it
already has paid. Alternatively, knowledge workers may run the risk of infringing a reuse right for which they believe that the organization has a license, but which, in actuality, the organization does not.
[0004] One of the problems in determining which rights apply to a given publication is connecting the publication to one or more agreements that convey rights so that the correct agreement can be examined to determine what rights are available to an organization. One prior art method for performing this connection is to embed a special "tag" in the publication. When the publication is later opened, for example, for examination, the tag can be activated to direct the user to a specific location, such as a web site, where rights agreements are located. While this arrangement is effective, it requires each publication to contain the special tag. While this might be feasible for newly published publication, it would be prohibitive to re-publish older publications with the special tag. Thus, this system would not work with many existing publications. [0005] Often a user trying to locate publication rights has only a publication universal resource locator or URL associated with a publication. The primary purpose of such a URL is to indicate where on a network, such as the Internet, a copy of the publication can be located. Thus, the URL typically does not directly identify the publication itself. However, many URLs contain information that is useful in identifying the publication. Unfortunately, there is no current standard URL configuration so that such useful information may be located in various places within the URL depending on the publisher or clearinghouse. Further, the useful information may be coded in various ways. Therefore, it may be difficult to extract the information from a particular URL. [0006] In still other cases, even the URL is not available. For example, only basic information such as the publication title, author, the work in which the publication is contained and the year of publication or some combination of the aforementioned information may appear in the text of a webpage that the worker is viewing. Alternatively, a publication that the worker is viewing may contain a reference to another web page, such as an abstract or a bibliographic page that contains the aforementioned information. Since each webpage may have a different and unique format, it is difficult to determine even where on a particular page to look for the information necessary to identify the rights that are available (called content "metadata").
SUMMARY
[0007] In accordance with the principles of the invention, the domain name of the website in which a knowledge worker is working is sent to a rights advisor website. The rights advisor website uses the domain name to obtain a parser program that is specific to the domain. The parser program is then sent back to the browser on which the knowledge worker is viewing information and extracts content metadata from the website in which the knowledge worker is working. The extracted content metadata is returned to the rights advisor website and used to determine rights associated with the publications.
[0008] In one embodiment, the parser program extracts content metadata from the webpage displayed in the browser.
[0009] In another embodiment, the parser program navigates from the webpage that is being displayed in the browser to another webpage that contains content metadata and extracts the content metadata from that other page.
[0010] In still another embodiment, content metadata, such as a publication title, that is returned to the rights advisor website via the parser is "normalized" to obtain a standard identifier that is, in turn, used to determine rights for the content.
[0011] In yet another embodiment, when a URL is associated with content, the rights advisor website attempts to locate a standard identifier for the content using that URL simultaneously with attempts to obtain a parser program that is specific to the website domain and which can look for information associated with the content on the webpage.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] Figure 1 is a block schematic diagram illustrating in a high level form the basic architecture of the inventive rights location system.
[0013] Figure 2 is a table array in a database for storing bookmarklet script information. [0014] Figure 3 is a typical screen display presented by a conventional search engine in a web browser.
[0015] Figure 4 is a typical display of content located by the search engine when the keywords "nature methods" have been entered into the text box. This figure shows the hyperlinks to a rights advisor web page.
[0016] Figure 5 is a flowchart showing the steps in an illustrative process for determining and resolving rights for a requested type of use.
[0017] Figure 6 is a block schematic diagram illustrating the components of an agreement.
[0018] Figure 7 is a block schematic diagram illustrating the components in a publication identifier location apparatus. [0019] Figures 8A and 8B, when placed together, form a flowchart showing the steps in an illustrative process for locating a publication identifier which process is performed by the apparatus shown in Figure 7.
[0020] Figure 9 is a table array in a database for storing URL parsers and metadata parsing scripts. [0021] Figure 10 is a screen shot of a journal web page with an International
Standard Serial Number (ISSN) identifier embedded in the page.
[0022] Figure 11 is the HTML code used to display the web page shown in Figure 10 in a conventional web browser illustrating how such a web page could be parsed to retrieve the identifier.
DETAILED DESCRIPTION
[0023] Figure 1 is a block schematic diagram illustrating one embodiment 100 constructed in accordance with the principles of the present invention. In some cases, a customer can use a conventional search engine in a web browser 102 to search for content and to display that content in the display area 103. The web browser 102 has been modified by downloading a small executable program called a "bookmarklet" that causes the browser to interact with a "rights advisor" web server in accordance with the principles of the invention. Such a program might, for example, be a Javascript program, which is specific to a particular URL domain or to a set of URL domains. [0024] Figure 2 illustrates a set of database tables 200 for storing script text corresponding to various bookmarklet scripts. These tables include a URL base key
table 204 which specifies URL map keys, each of which identifies a set of URLs to which a particular bookmarklet script applies. Each record of the latter table includes a URL map key identifier (URL_MAPKEY_ID), a name for the map key (URL_MAPKEY_BASE) and an identifier specifying a particular bookmarklet text that applies to that map key (BOOKMARKLET-SCRIPTJD). The URL domains that are members of each domain map are specified in the URL Domain Table 206. The URL domain table 206 contains records, each of which includes a URL domain identifier (URL_DOMAIN_ID), a map key identifier specifying the domain map to which the domain belongs (URL_MAPKEY_ID), the domain name (URL_DOMAIN), a URL configuration file (URL_CONFIG_XML), a primary parser identifier
(URL_PRIMARY_PARSER) and an identifier for an associated bookmarklet script (BOOKMARKLET_SCRIPT_ID). If a bookmarklet script is specified for a particular domain, it overrides any script specified for a domain map to which that domain belongs. [0025] The bookmarklet scripts are identified in the Bookmarklet Script Table
208. Each record of table 208 contains a script identifier (SCRIPTJD), a key to select a particular script (SELECT_KEY), an indication whether the script is enabled, and various timing and retry parameters used in determining how the script is downloaded and executed. The actual script text is stored in the Versioned Text Table 210. This latter table contains records, each of which, in turn, includes a text identifier
(VTEXTJD), a text type (VTEXT_TYPE_ID), a revision number (REVISION_NUM) and the actual text (CONTENT). The text table stores both script text and URL configurations as indicated by the type field. If a script is modified, its record is not overwritten. Instead, a new record in inserted into the text table. This allows review of previous script versions and rollback, if necessary.
[0026] Figure 3 shows a typical screen display presented by such a search engine. The web browser 300 includes a search field 302 that, in turn, includes a text box 304 for receiving a search phrase and a command button 306 for initiating a search for publications whose text includes the search phrase. The web browser 300 has been modified to include a small executable program called a "bookmarklet" that causes the
browser to interact with a "rights advisor" program in accordance with the principles of the invention.
[0027] Figure 4 shows a typical display of content located by the search engine when the keywords "nature methods" have been entered into the text box 404 in the search field 402 of the browser 400 and the command button 406 has been selected. The search results are shown as a plurality of rows 408-418 in the list box 407. Each row includes information concerning an article located in the search. The search engine illustrated in Figure 4 displays information including the article title, the publisher and a standard identifying number associated with the publication that contains the article. [0028] Each article has associated with it a hyperlink generated by the bookmarklet that enables a user to locate and display rights associated with that article. For example, row 408 includes a hyperlink 420 that enables a user to locate and display rights for the "Nature Methods" article displayed in that row. Similarly, rows 410-418 have hyperlinks 422-430 for locating and viewing rights associated with the articles displayed in those rows.
[0029] In other cases, a URL associated with an article may not refer to the publisher or the containing publication. For example, the user might be examining the full text of a document containing a URL that points to a different portion of the document or to a bibliography containing text information that identifies the publisher and publication. Alternatively, the full text page of the document may not contain any URLs. Instead, text identifying the publication and publisher or URLs pointing to the publication and publisher may be located on another web page, for example a bibliographic page or an initial document information page. The locations and format of the information are generally specific to a particular domain or web site and, in some cases, only the URL of the website, containing the domain may be all that is available. However, in accordance with the principles of the present invention, an article can be identified by parsing a URL or by searching for article identification information embedded in other textual information on a web page. Further, the URL parsing mechanism and text search can be tailored on a per-domain basis so that different formats can be accommodated.
[0030] Returning to Figure 1 , when a hyperlink is selected, the bookmarklet 104 causes the web browser 102 to access a rights advisor web page 108 hosted by a server in a rights clearinghouse location. When the web page 108 is accessed, the bookmarklet generates a unique bookmarklet key which is used to identify the member and a "session" during which rights for the displayed article will be retrieved and displayed. The bookmarklet also sends any available information regarding the article to the rights advisor web page 108. The rights advisor web page 108 uses the article information to try and locate rights associated with the article. In addition, in accordance with the inventive principles, the web server that displays the rights advisor web page also searches the web site from which the article is displayed in order to attempt to locate additional information concerning the article.
[0031] The process performed by the rights advisor web page 108 to locate and resolve rights is set forth in Figure 5. This process begins with step 500 and proceeds to step 502 where the rights advisor web page 108 receives article information, the organization member context and a desired type of use from the bookmarklet 104. Rights that are available for an organization are defined by agreements that are stored in the rights database 112. Rights database 112 is arranged as a plurality of tables where rights are stored in a table separate from the content identifiers. Such a database is described in detail in U.S. Patent No. 5,991 ,876, the content of which is incorporated in its entirety by reference. In particular, the rights database 112 contains information regarding agreements.
[0032] An agreement is any construct under which an organization obtains or expresses rights related to secondary use of content. Such agreements could include a copyright license for an entire collection of publications obtained from a rights clearinghouse. An example of such an agreement is an annual copyright license obtained from the Copyright Clearance Center. Agreements may also be made directly with a publisher, such as the Pharmaceutical Documentation Ring agreement made with the publisher Elsevier. Another type of agreement could be made with other Reproductive Rights Organizations such as a contract with the Copyright Licensing Agency in the United Kingdom. Agreements can also be obtained from various content aggregators. Such an agreement might be a Factiva license. Agreements can also be
implied by statutory law, for example, Swiss law allows Swiss companies to share content without royalties. Still other agreements may involve company policy.
[0033] In step 504, the rights advisor 108 accesses the rights database as indicated schematically by arrow 114 and retrieves all agreements that apply to the organization. The components of an agreement 600 as represented in the rights database 112 are shown in Figure 6. These components include boundaries 602, titles included 610, rights 620 and terms 621. Boundaries 602 specify the member context, or various constraints, an organization member must meet in order to be covered by the agreement and are defined by three variables: country, location and organization defined attributes. The country variable has values corresponding to global nationalities, such as United States or France. The location variable has values that correspond to various site location of the organization, such as the Waltham site or the Wilmington site. The organization defined variable may have any values that determine, within that organization, whether the agreement applies to a member of that organization. For example, the variable may specify that a member of the organization must be part of the marketing department or part of the research and development department, etc. to be covered by the agreement. The country, location and organization defined variables may be assigned the value "any" which indicates that the agreement would apply to any member context which meets the other boundary variables. For example, the organization defined variable may be assigned a value of "any." In this case the agreement would apply to any member who meets the country and location boundary variables.
[0034] An agreement 600 also includes a designation 610 of the publications or titles that it covers. The agreement 600 may apply to collections 612, which are any grouping of publications. For example, an agreement may apply to all the titles that are included in an EBSCO subscription package. This would be considered a "public" collection; the titles included are defined by the information provider and are standard for all purchasers of the package. Another alternative would be a "private" collection. For example, an organization may create an "a Ia carte" subscription from a provider like EBSCO. The agreement 600 may also apply to separate publications 616 in addition to, or as an alternative to, collections 612.
[0035] The third component of an agreement is the rights 620 associated with the agreement. Each right is associated with a specific type of use. In order to standardize agreements, a set of distinct rights are predefined. In the discussion below, a set of distinct types of use have been predefined for publications. However, the set of predefined rights could include more or less distinct rights as would be understood by those skilled in the art. For example, an illustrative set of predefined rights could include (1 ) emailing a copy of the publication to a member of the organization, (2) emailing a copy of the publication to a person who is not a member of the organization, (3) storing a copy of the publication on a local hard drive, (4) storing a copy of the publication on a shared network drive, (5) scan and then email a copy of the publication to a member of the organization, (6) scan and then email a copy of the publication to a person who is not a member of the organization, (7) photocopy publication and share with a member of the organization, (8) photocopy publication and share with a person who is not a member of the organization, (9) share a printed copy of the publication with a member of the organization, (10) share a printed copy of the publication with a person who is not a member of the organization, (11 ) share a copy of the publication using Lotus Notes™, (12) upload a copy of the publication to an Internet site, (13) post a copy of the publication for advertising purposes and (14) upload a copy of the publication to an electronic paper (soft billboard.) Customers can define their own type of use, but these custom use types must map to one of the fourteen predefined use types.
[0036] Rights may be associated with each type of use. In addition, rights can be specified for the agreement 600 as indicated schematically by arrow 622, for a collection covered by the agreement as indicated schematically by arrow 624 or for individual publications within that collection as indicated schematically by arrow 626. Rights can also be assigned to separate publications that are covered individually by the agreement as indicated schematically by arrow 628.
[0037] Terms 621 may also be associated with each agreement. Terms include rights holder terms, contract terms that cannot be expressed programmatically as a right, certain statutory laws, such as Swiss law allowing publication sharing with other Swiss employees and company policies. Terms may be assigned at the publication, collection and agreement levels. In general, terms associated with rights are tagged as
"Restrictive" or "Nonrestrictive". The "Restrictive" tag indicates that the associated right (such as a right to photocopy a publication) is limited by the text of the terms (for example, a restrictive term might be "only internal distribution is allowed"). The "Nonrestrictive" tag indicates the terms do not limit the applicability of the right, perhaps because they extend the scope of the permitted activity (for example, nonrestrictive terms might include "There are no restrictions on the distribution of photocopies of this content").
[0038] Returning to Figure 5, in step 506, the rights advisor accesses a metadata database 122 as indicated schematically by arrow 120, and attempts to obtain a standard number for the publication containing the article for which the member has requested rights information in order to determine whether any of the retrieved agreements are applicable to that publication. In accordance with the principles of the invention, the rights advisor tries to lookup the publication using two separate methods that are performed in parallel. [0039] In accordance with the first method, the rights advisor uses information that it receives from the member's browser to attempt to lookup the publication. If this information includes the article title and recognized standard identifying numbers, such as an ISSN or an ISBN number for the publication, then a lookup of the publication may be possible using just this information. However, in some cases, only the URL of the article may be available. Article URLs are often arbitrary, and by themselves provide no consistent means to determine whether a given article belongs to a publication with a recognized standard identifier. Thus, the rights advisor web page 108 attempts to map, or translate, the URL into a standard identifier, where such an identifier is available. Using this standard identifier, the rights advisor web page 108 can then access the metadata database 122 to obtain a standard number that identifies the publication. This standard number can be applied to the retrieved agreements for the organization to determine which agreements apply to the specified publication.
[0040] URL mapping performed by the rights advisor relies on a variety of URL parsers, each of which uses a parsing algorithm, and a supporting database of URL formats 118. In particular, the rights advisor program 108 has a set of rules for determining which parsers are applicable to a particular URL and a set of parsers that
are each able to separate a particular URL into web-site specific identifiers useful for the URL mapping task. Once these specific identifiers have been obtained, they are applied, as schematically indicated by arrow 1 16, to a database 1 18 of rules for translating the web-site specific identifiers into standard identifiers such as ISSN or ISBN identifiers. Once the standard identifiers have been obtained, they are applied, as indicated schematically by arrow 1 14 to a database 1 12 that is keyed by the standard identifiers for publications. This database 1 12 enumerates publication titles and the rights under which the publications can be used.
[0041] Apparatus 700 for obtaining a standard identifier from article information is illustrated in Figure 7 and the steps in the lookup process are illustrated in Figures 8A and 8B. The lookup process begins in step 800 and proceeds to step 802 where information 702 concerning the displayed article is received from the member web browser 102. In step 804, an attempt is made by the web server to lookup the corresponding publication in a metadata table 734 (as schematically indicated by arrow 703) using information, such as a title or any standard numbers present in the information received from the browser. If this attempt is successful, as determined in step 806, a standard identifier is returned as indicated schematically by arrow 736 and the process proceeds, via off-page connectors 822 and 828 to finish in step 848.
[0042] Alternatively, if the attempt is not successful, then the process proceeds to step 808 where the domain name is saved by the web server using, for example, a store and forward dispatcher. This storage operation triggers two processes that operate in parallel and attempt to locate the standard identifier for the applicable publication. The first process is set forth in steps 810, 814, 818, 824, 830 and 834. The second process is illustrated by steps 812, 816, 820, 832, 836, 840 and 842. [0043] In step 810 of the first process and, as indicated by arrow 703, the URL of the website that is being viewed in the member's browser is used to query a set of parser rules 704 to determine the most applicable URL parser as well as configuration settings to determine how parsers will be used in the cases that the rules identify. In particular, the stored domain name in the URL is matched against the set of parser rules to select rules that apply to that domain. In turn, the selected rules are then used to select and configure the parsers.
[0044] Figure 9 shows an illustrative embodiment 900 for the parser rule set 704. In this embodiment, the parser rule set is implemented as a set of relational database tables 902, 904, 906, 910 and 912. Each content provider is provided in the content provider table 902 with a record containing a unique identifier (CONTENT_PROVIDER_ID) and a name (CONTENT_PROVIDER_NAME). A content provider may be associated with one or more Internet domains via the Content Provider Domain table 904. Table 904 contains one or more records for each content provider and each record contains a domain identifier (CONTENT_PROVIDER_DOMAIN_ID), a domain name (CONTENT_PROVIDER_DOMAIN), a reference to the content provider (CONTENT_PROVIDER_ID) and a precedence level (PRECEDENCE_LEVEL) that indicates, if there are a plurality of domains, which domain should be examined first. The table also includes a URL segment map identifier (URL_SEGMENT_MAP_ID) and URL parser identifier (URL_PARSER_ID) for each domain. The URL segment map identifier identifies a record in the URL Segment Map table 912 which contains data indicating the structure of the URL, which can consist of three segments
(URL_SEGMENT_1 , URL_SEGMENT_2 and URL_SEGMENT_3). In some cases, a standard publication number or a publication identifier may be directly associated with a domain name. If this is the case, these identifiers are stored in the URL segment map (STD_NO and PUBJD). [0045] If publication identifiers are cannot be directly associated with the domain name, then a URL parser is associated with the content provider domain via a reference to the URL Parser table 906. Table 906 includes a parser identifier (URL_PARSER_ID) and a parser name (URL_PARSER_NAME) and an indication whether the particular parser is enabled. A further table 910 (the URL Parser Param table) contains parameters that are used with a particular parser.
[0046] Returning to Figure 7, after selecting a parser rule set based on the domain name in the URL, one of a set of parsers, of which parsers 706 and 708 are shown, identified in the selected rule is used, in step 814, to parse the URL and generate the data field values. A parser consists of the instructions for extracting from a URL the data fields necessary to use translation rules to determine a standard identifier. One such set of data fields includes three members: the key base, the journal key and
the publication date. The key base specifies a context in which the derived identifier is meaningful; in other words, a particular publisher may give all of the publications on its web site unique, proprietary numbers, and use this numbering system in the URLs for the articles on its web site. The key base in this case can be any string that specifies the publisher's web site, such as 'PUB 1 '; the journal key is then the publisher's own proprietary identifier.
[0047] Parsers, such as parsers 706-708, are defined to extract data in particular formats. For instance, many publishers follow an informal convention in which the URL for an article contains the concatenation of a unique string identifying the publication with four numeric digits signifying the year and month of publication of the article. A variety of well-known parsing techniques can be used to locate this string and split it into the desired components. Once a parser is created to extract this concatenated string from a URL and split the string into its two useful components, the parser can be configured with parser rules, such as those set forth above, to perform the same task for URLs of any publisher that follows this convention. Any selected parsing technology must be able to implement at least the following capabilities: within a given string, locate a specified prefix string; extract characters following the prefix string until a specified suffix string is located; and split an extracted string into multiple substrings according to simple format specifications. Conventional UNIX- or Perl-like regular expressions are easily capable of performing these parsing and extraction tasks. In general new parser rules and parsers can be added to support new URL formats. A more detailed discussion of parsers and their construction is contained in U.S. Patent Application Serial No. 11/733,423 filed on April 10, 2007 by C. Howard, J. Arbo and V. Shetty and entitled "Method and Apparatus for Converting a Document Universal Resource Locator to a Standard Document Identifier." This disclosure of this application is hereby included herein in its entirety by reference.
[0048] The process then proceeds, via off-page connectors 818 and 824 to step 830, where the extracted data field values are presented to the translation rule database 714 as indicated schematically by arrows 710 and 712. The translation database includes a plurality of entries, each entry constituting a translation rule that, in turn, includes at least three fields: the key base, the journal key and the standard identifier
and may include other fields, such as date fields. The key base and journal keys are used as key fields. If the data field values presented to the translation rule database match these fields, the associated standard identifier is returned.
[0049] Since the journal key is internal data for a particular publisher, there is no guarantee that journal keys will be unique outside the context of a particular website or website subset. The key base provides a mechanism for ensuring that the journal keys can be mapped accurately to standard identifiers, such as an ISSN. If, in step 834, it is determined that such a standard identifier results from the database query, then the URL mapping process proceeds to step 836 where an attempt is made to lookup the publication using the standard identifier. If the publication is found, as determined in step 840, the process finishes in step 848.
[0050] The second process for obtaining a standard identifier for a publication begins in step 812. As previously mentioned, this process is initiated when a domain name is stored (in step 808) and proceeds in parallel with the aforementioned URL parsing process. In step 812, a script, called a metadata parser, which is specific to the domain, is retrieved from the URL mapping database 118. Illustratively, this script might be a Javascript. As shown in Figure 9 metadata parser scripts are stored in a metadata parser table 908. Each record in this table includes a parser identifier (METADATA_PARSER_ID), a domain key (SELECT_KEY), an indication whether the script is enabled, the script text (SCRIPT_TEXT) and several timing entries that control the timeout interval and retry policies for script execution. Each script can also be disabled for a particular customer by making an entry into the METADATA_CUST_DISABLED table 914. The URL of the web page is used as the domain key to access the table 914 and retrieve script text that is specific to that domain.
[0051] The retrieved script text is downloaded to the member's browser and appended to the bookmarklet script already running in the browser using the timing and retry numbers stored with the script. When activated, the script text parses the HTML code of one or more web pages on the current web site and attempts to locate additional data concerning the desired publication, again using the timing and retry information stored with the script. For example, such a script may parse the HTML
code of the web page that the member is currently viewing. Alternatively, the script may navigate from one page to another web page and then parse the HTML code on the second web page. Illustratively, this might occur in situations where the member is viewing a full-text version of an article, but publication information for that article is available on a different web page that displays publication abstracts.
[0052] The scripts are typically designed by a human operator who visits the web site, notes the location of the additional information and then writes the script to retrieve the information. An operator can generate scripts from "scratch" using the general workflow of the site for extracting content identifiers as the functional boundary, but in most cases, operators will begin with an existing metadata parser script. Scripts designed to process similar types of publications are typically similar in construction so that each publication type has a template script that can be modified for a specific domain. For example, template scripts might be designed for trade journals, news articles, patents and press releases or other sites which have similar page layouts and types, or from which similar metadata will be extracted or which have similar site structures. Alternatively, an operator may decide to use an existing script if the content or documents to be parsed are a similar type and structure or contain the same kind of metadata, such as an author's name or the year in a copyright notice. For example, a copyright notice appears in most Web pages in the footer, at the bottom of the page. A typical format is "Copyright © 2006 copyright holder name." Because the format and the data captured are similar, a system user can modify an existing script to perform the same function on a different Web site or page set.
[0053] A template or existing script can be selected by an operator based on the URL of the web site as schematically illustrated by arrow 719 in Figure 7. Then, the operator would typically log onto the web site with a temporary account, note the location of the relevant information and modify the template script to navigate to that location. This modification is performed by applying the existing script to a script editor 721 to generate a new script 723 which is them stored in the metadata script parser table 720. Illustratively, a template script for a trade journal could skip over the first one thousand bytes of the HTML code in order to avoid header information and then parse the remaining HTML script with parsers similar to those discussed above for URLs.
[0054] Generally, the scripts are site specific. For example, a web page made up of HTML code and other formatting elements will require a different parser than a page coded in PDF (Portable Document Format). However, a web page with a similarly-located standard identifier, such as an ISBN (International Standard Book Number), an ISSN (International Standard Serial Number), or a DOI (Digital Object Identifier) may be used by many sites and services for a particular kind of work (e.g., books, journals or research articles).
[0055] For example, Figure 10 is a screen shot of an illustrative web page from a website of the publisher Elsevier which contains an ISSN (0304-4203) identifying the journal "Marine Chemistry" embedded in the web page text. Figure 11 shows the HTML code which causes the web page display shown in Figure 10 to be generated in a conventional web browser. In order to parse the HTML code shown in Figure 11 , the parser steps through the DOM (Document Object Model) to pick out the document identifier, (ISSN, DOI, etc.) The DOM provides a hierarchical structure of the web page along with values, allowing a program to search and step through a fixed format to gather the information required. Consistency of format is important, but with the page information returned by the DOM, a program can, for example, query for the <span> tag where the class equals, for example, "journalinformation" and then use various conventional methods, such as regular expression matching, to extract the standard number from the block of returned text.
[0056] The following is an example of a metadata parser script which is constructed in accordance with the principles of the invention. This script is written in the JavaScript language.
// // METADATA PARSER SCRIPT for @DOMAINMASK@ PARSES CURRENT DOCUMENT // var PARSERMASK="@ PARSERMASKg"; var PROTOCOL= ' http ://'; // // PICKUP KEY, DOMAIN, and ORIGINALURL // var KEY= ' @BOOKMARKLETKEY@ ' ;
var DOMAIN= ' @HREFDOMAIN@ '; var ORIGINALURL= ' @ORIGINALURL@ '; // // PICKUP ACTION will be either 'parse' or null // var ACTION=@ACTION@; // // FIELDS TO BE IDENTIFIED // var MDATA_TITLE = "&title="; var MDATA_ISSN="&stdno="; var MDATA_RESET = "&reset=true" ; var MDATA_KEY = "&key=" + KEY; var FINDARGUMENTS="FINDARGUMENTS : "; var PROTOARGUMENTS="PROTOCOL=" + PROTOCOL; var RESULTS=" "; // // verifyProtocol - verify if it is http or https // function verifyProtocol () { var regexlnstance; var ignoreCase = "i"; var regexString = "Λ (https ://)" var searchString = document . location .href; var foundValue=null; var position=l; if ( ignoreCase == "i" ) { regexlnstance = new RegExp (regexString, ignoreCase); } else { regexlnstance = new RegExp (regexString) ; } //PROTOARGUMENTS += "verifyProtocol : "; //PROTOARGUMENTS += " Pattern=" + regexlnstance . source; //PROTOARGUMENTS += ";Position=" + position; //PROTOARGUMENTS += " ; IgnoreCase=" + regexlnstance . ignoreCase; //PROTOARGUMENTS += " ; SearchString=" + searchString + "\n"; var matchAttempt = regexlnstance . exec ( searchString ); PROTOARGUMENTS = "PROTOCOL="; if ( matchAttempt != null ) { foundValue = matchAttempt [position] ; if ( foundValue != null && foundValue . length == 0 ) { foundValue=null;
71 PROTOARGUMENTS += PROTOCOL;
72 } else {
73 PROTOCOL = foundValue;
74 PROTOARGUMENTS += " Reset to " + PROTOCOL;
75 }
76 } else { 77
78 PROTOARGUMENTS += PROTOCOL;
79
80 }
81
82 return PROTOCOL;
83 } 84
85 //
86 // findRegexValue - find issn and title
87 //
88
89 function findRegexValue ( regexString, searchString, ignoreCase, position )
90 { 91
92 FINDARGUMENTS += "\n";
93 var foundValue=null;
94 var regexlnstance;
95 if ( ignoreCase == "i" ) {
96 regexlnstance = new RegExp (regexString, ignoreCase);
97 } else {
98 regexlnstance = new RegExp (regexString) ;
99 }
100 FINDARGUMENTS += "\nPattern=" + regexlnstance . source;
101 FINDARGUMENTS += "\nPosition=" + position;
102 FINDARGUMENTS += "\nIgnoreCase=" + regexlnstance . ignoreCase; 103
104 var matchAttempt = regexlnstance . exec ( searchString );
105
106 if ( matchAttempt != null ) {
107
108 foundValue = matchAttempt [position] ;
109
110 FINDARGUMENTS += "\nFound=" + foundValue;
111
112 }
113
114 return foundValue;
115
116 }
117
118
119 //
120 // findMetadata - find issn and title
121 //
122
123 function findPublnformation ( html ) { 124
125 RESULTS = "\nResults Reported to Rightsphere : " + "\n";
126 var ignoreCaseTrueFlag="i"; 127
128 var gotlssn = false;
129 var issn=null; 130
131 if ( igotlssn ) {
132 var regexISSNOnline = " (? :<td. *>) (Wd [0-9, -
133 ]*\\d) (?:Ws*W(PrintW) Ws*) (Wd [0-9, -] *\\d) (? : \\s*\\ (OnlineW) </td>) ";
134 issn = findRegexValue (regexISSNOnline, html, ignoreCaseTrueFlag, 2) ;
135 if ( issn != null ) {
136 MDATA_ISSN += issn;
137 RESULTS += "\n\tθnline ISSN: " + issn;
138 gotlssn=true;
139 }
140 } 141
142 if ( igotlssn ) {
143 var regexISBNOnline = " ( ? : <td . *>) (\\d{ 4 } -\\d{ 3 } [ 0-9, a-z, A-
144 Z]) (?:\\s*\\(Print\\)\\s*) (\\d{ 4 } -\\d{ 3 } [0-9, a-z, A-
145 Z]) (?:\\s*\\ (OnlineW) </td>) ";
146 issn = findRegexValue (regexISBNOnline, html, ignoreCaseTrueFlag, 2) ;
147 if ( issn != null ) {
148 MDATA_ISSN += issn;
149 RESULTS += "\n\tθnline ISBN: " + issn;
150 gotlssn=true;
151 }
152 } 153
154 if ( igotlssn ) {
155 var regexISSNPrint = " (? :<td. *>) (Wd [0-9, -
156 ]*\\d) (?:\\s*\\ (PrintW)Ws*) (Wd [0-9, -] *\\d) (?: \\s*\\ (OnlineW) </td>) ";
157 issn = findRegexValue (regexISSNPrint, html, ignoreCaseTrueFlag, 1) ;
158 if ( issn != null ) {
159 MDATA_ISSN += issn;
160 RESULTS += "\n\tPrint ISSN: " + issn;
161 gotlssn=true;
162 }
163 } 164
165 if ( igotlssn ) {
166 var regexISBNPrint = " ( ? : <td . *>) (\\d{ 4 } -\\d{ 3 } [ 0-9, a-z, A-
167 Z]) (?:Ws*W(PrintW) Ws*) (\\d{ 4 } -\\d{ 3 } [0-9, a-z, A-
168 Z]) (?:\\s*\\ (OnlineW) </td>) ";
169 issn = findRegexValue (regexISBNPrint, html, ignoreCaseTrueFlag, 1) ;
170 if ( issn != null ) {
171 MDATA_ISSN += issn;
172 RESULTS += "\n\tPrint ISBN: " + issn;
173 gotlssn=true;
174 }
175 } 176
177 //<td class="labelName">ISSN</tdxtd class="labelValue">0895-4852</td> 178
179 if ( igotlssn ) {
180 var regexISSNalone = "<td. *>ISSN</td>\\s*<td. *>\\s* (\\d{ 4 } -?\\d{ 3 } [0-
181 9, a-z, A-Z] ) ";
182 issn = findRegexValue (regexISSNalone, html, ignoreCaseTrueFlag, 1) ;
183 if ( issn != null ) {
184 MDATA_ISSN += issn;
185 RESULTS += "\n\tISSN: " + issn;
186 gotlssn=true;
187 }
188 } 189
190 var title=null;
191 var gotTitle=false; 192
193 if ( lgotTitle ) {
194 var titleREGEX = "<div
195 class=\\\"?MPReader_Content_PrimitiveHeadingControlName\\\"?>\\s* ( [Λ<] *) ";
196 title = findRegexValue (titleREGEX, html, ignoreCaseTrueFlag, 1) ;
197 if ( title != null ) { 198
199 MDATA_TITLE += encodeURIComponent (title) ;
200 RESULTS += "\n\tTitle: " + title;
201 gotTitle=true;
202 }
203 } 204
205 //
206 // Store ISSN Information to Rightsphere
207 //
208
209 var d = document .getElementsByTagName ( "html") [O];
210 if (d!=null) {
211 s=d . appendChild (document . createElement ( ' script ' ) ) ;
212 s.id='naples'+KEY;
213 s . language= ' javascript ';
214 void (s . src= ' Owebapp .baseURL@/dispatcher?type=ra&target=store ' +
215 MDATA_RESET + MDATA_KEY + MDATA_ISSN + MDATA_TITLE)
216 }
217 } 218
219 //
220 // resetContentPage - display any debugging info and resets page
221 // following completion of data store
222 //
223
224 function resetContentPage () { 225
226 //
227 // Display results in a debugging alert
228 //
229
230 if ( ALLOWDEBUG == true ) {
231 if ( DEBUG != null ) {
232 var debugResults = "PARSERMASK="+PARSERMASK;
233 debugResults += "\nKEY="+KEY;
234 debugResults += "\n"+PROTOARGUMENTS;
235 debugResults += " DOMAIN="+DOMAIN;
236 debugResults += " ORIGINALURL="+ORIGINALURL;
237 debugResults += "\nACTION="+ACTION+"\n";
238 debugResults += "\n"+FINDARGUMENTS;
239 debugResults += "\n"+RESULTS;
240 alert (debugResults) ;
241 }
242 } 243
244 //
245 // refresh the user's window MUST BE LAST STATEMENT EXECUTED! !
246 //
247
248 refreshUserWindow ( ) ;
249
250 }
251
252 //
253 // Execute the findPublnformation procedure
254 //
255
256 if ( ACTION == 'parse' ) { 257
258 verifyProtocol () ;
259 findPublnformation ( document .getElementsByTagName ( "html") [ 0 ] . innerHTML
260 ) ; 261
262 }
[0057] In this example, the code at lines 1-35 defines various variables that are used in the following subroutines. The code at lines 36-84 defines a subroutine that determines whether the protocol of the web page under examination is http or https.
5 The code at lines 85-118 defines a subroutine that searches a text string for the occurrence of another text string. The code at lines 119-204 uses the search function defined in lines 85-118 to sequentially search the web page html code for predetermined character patterns that indicate the presence of the publication standard identifier and document title. Lines 205-218 define a sub routine that stores the
10 metadata information retrieved from the web page html code to the bookmarklet data storage. The code at lines 219-243 displays debugging information and rests the web page following the metadata storage operation. The code at lines 244-251 refreshes the user's display window and the code at lines 252-262 executes the parsing subroutines.
15 [0058] In some cases the metadata is not contained on a Web page which is initially displayed, but only on a preceding page or on a page that requires user interaction with the Web site. Therefore, some metadata parser scripts can perform the functions necessary to navigate the client browser up or down in the browsing history before extracting metadata. When creating scripts for new sites that require controlling
the browser location, the system user may begin generating a new metadata parser script from an existing one that contains this functionality.
[0059] In step 816, the script is executed in the member's browser and parses the HTML code to locate relevant publication information. This information can include, for example, the publication title and standard publication numbers, such as an ISSN or an ISBN number. The process then proceeds, via off-page connectors 820 and 826, to step 832, where any information extracted from the website is stored using the bookmarklet key as a retrieval key. Storage of the information is necessary at this point because the first process may still be proceeding. [0060] If, in step 834, the first process determines that a publication identifier could not be located by parsing the publication URL or, in step 840, that an attempt to lookup the publication with the located identifier failed, then the process proceeds to step 838 where a determination is made whether any publication information has been stored in step 832 by accessing storage with the aforementioned bookmarklet key. If no publication information is located, the process proceeds to step 846 where a search web page is displayed that allows the member to perform a manual search for the publication information, and the process then finishes in step 848.
[0061] Alternatively, if in step 838, it is determined that stored information obtained by the second process is located with the bookmarklet key, then an attempt is made in step 842 to lookup the publication using the stored information. Before such a lookup attempt is made, the information may be pre-processed into a standard format in order to simplify the lookup process. For example, certain standard information may be may be removed, including HTML tags, spaces, foreign language characters and common articles. Then, the standard form is used to perform a lookup attempt. The lookup attempt itself may proceed in several stages. First, the lookup process tries to use any standard publication numbers and titles found to generate a "fingerprint" in order to lookup the publication. If that attempt fails, the process uses just the standard number looking at alternate ID numbers. If the publication is still not found, the lookup process will use the title alone and look for a matching fingerprint. [0062] If the publication is found, as determined in step 844, then the publication identifier is returned in step 848. Alternatively, if the publication is not found,
as determined in step 844, the aforementioned search web page is displayed that allows the member to perform a manual search for the standard publication identifier, and the process then finishes in step 848.
[0063] Returning to Figure 5, in step 506, once the standard publication identifier has been obtained using one of the methods described above, the process proceeds to step 508 where the rights advisor web server uses that identifier to determine all retrieved agreements that apply to the identified publication. Next, in step 510, a determination is made of all agreements that fit the member context. This determination is made by examining the boundaries of each agreement and then determining whether that agreement covers the member country and location and that the member meets any organization defined attributes.
[0064] In step 512 the best right for the type of use requested is determined. The process then finishes in step 514.
[0065] The process of determining the best right as set forth in step 512 involves examining each agreement that applies to the publication and meets the member context in order to determine the most appropriate right for the specified type of use that is included in the agreement. In performing this examination, each agreement is examined from the "bottom up." That is, more specific rights supersede more general rights. Thus, an agreement is first examined to determine whether a right for the type of use requested has been assigned directly to the specified publication, either by itself or to the publication as contained in a collection. If such a right is found it is the right used for that agreement. If no such right has been assigned to the publication, the agreement is next checked to determine whether a right for requested type of use has been assigned to a collection that includes the specified publication. If so, it is the right that is used for that publication. If no such right is found, then the agreement is checked to determine whether a right for the type of use has been assigned at the agreement level. If so, that right is used for the agreement.
[0066] Then, the most applicable rights from all agreements are collected and ordered. In particular, rights are placed into a specific best to worst order based on the type of right and whether any terms are associated with the right. For purposes of resolution, rights with terms tagged as "Nonrestrictive" are treated as rights without
terms - that is, at the highest level of applicability. The order of rights from best applicability to worst applicability is (1 ) right to use granted with no associated terms, (2) right to use granted with associated restrictive terms, (3) rights available for purchase under a pre-authohzed contract, (4) rights available for purchase, but rights holder must be contacted with more information, (5) rights available for purchase, but must be special ordered, (6) contact librarian to determine rights and (7) no rights available. If a right cannot be determined it is treated as (6) above.
[0067] After the available rights have been collected and ordered, a determination is made whether the ordering yields one "clear winner." That is, one agreement includes a right that is more applicable than rights included in all other agreements. If so, this "clear winner" is used to determine the rights and terms for the requested type of use. These rights and terms are then displayed to the member in the rights advisor web page.
[0068] If no "clear winner" exists, then a "tie" exists between two or more agreements. Ties among two or more rights can take several forms. For example, a tie between two or more rights without terms indicates that identical rights are available from two different agreements. Since the rights are identical and indistinguishable, one agreement is selected by a variety of techniques (for example, arbitrarily) and the rights and terms of that agreement are displayed. [0069] Alternatively, a tie between two or more rights with terms results in the display of all such rights together with the terms, so that the end user can make an informed judgment as to the permissibility of the requested activity. Another example is a tie between two or more rights with "Purchase" status. Such a tie results in the display of a list of the purchase information or capability for all such rights. In another embodiment, once a publication has been selected, the "best" rights which are available for various types of use are determined and presented to the member simultaneously.
[0070] Once the rights have been displayed on the rights advisor web page, the process finishes in step 514.
[0071] A software implementation of the above-described embodiment may comprise a series of computer instructions either fixed on a tangible medium, such as a computer readable media, for example, a diskette, a CD-ROM, a ROM, or a fixed disk,
or transmittable to a computer system for storage thereon via a modem or other interface device over a transmission path. The transmission path either may be tangible lines, including but not limited to, optical or analog communications lines, or may be implemented with wireless techniques, including but not limited to microwave, infrared or other transmission techniques. The transmission path may also be the Internet. The series of computer instructions embodies all or part of the functionality previously described herein with respect to the invention. Those skilled in the art will appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology, present or future, including, but not limited to, semiconductor, magnetic, optical or other memory devices, or transmitted using any communications technology, present or future, including but not limited to optical, infrared, microwave, or other transmission technologies. It is contemplated that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation, e.g., shrink wrapped software, pre-loaded with a computer system, e.g., on system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, e.g., the Internet or World Wide Web.
[0072] Although an exemplary embodiment of the invention has been disclosed, it will be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the spirit and scope of the invention. For example, it will be obvious to those reasonably skilled in the art that, in other implementations, process operations different from those shown may be performed. Other aspects, such as the specific process flow and the order of the illustrated steps, as well as other modifications to the inventive concept are intended to be covered by the appended claims. [0073] What is claimed is:
Claims
1. A method for locating a standard publication identifier that identifies a publication containing an article that is referenced in a web page of a web site, the website being displayed in a browser and the method comprising: (a) creating a plurality of parser scripts, each parser script being adapted to extract publication information from at least one web page of a web site by parsing the HTML code of that web page; (b) selecting one of the parser scripts based on a domain of the web site; (c) downloading the selected parser script to the browser and executing the downloaded script in the browser; and (d) using information extracted from the web site to access a database containing standard publication identifiers indexed by publication information to determine the standard publication identifier.
2. The method of claim 1 wherein step (c) comprises using the parser script to navigate from a web page being displayed in the browser to another web page in the web site and extracting publication information by parsing the HTML code of that other web page.
3. The method of claim 1 wherein step (d) comprises processing the information extracted from the web site into a predetermined format and using the processed information to access the database.
4. The method of claim 3 wherein step (d) comprises using at least two different combinations of the processed information to access the database.
5. The method of claim 1 further comprising: (e) creating a plurality of parser rules, each parser rule being adapted to extract publication information for a particular form of universal resource locator; (f) selecting one of the parser rules based on the domain of the web site and parsing the universal resource locator for the web site with the selected rule to extract publication information; and (g) using the publication information extracted in step (f) to obtain the standard publication identifier.
6. The method of claim 5 wherein steps (e)-(f) are performed in parallel with steps (b) and (c).
7. The method of claim 5 further comprising: (h) creating a plurality of translation rules, each translation rule having publication information as an input and a standard publication identifier as an output; and (i) applying the publication information extracted in step (f) as an input to one of the plurality of translation rules and using the output of that translation rule as the standard publication identifier.
8. The method of claim 1 further comprising displaying a publication search web page in the browser if the standard publication identifier cannot be determined in step (d).
9. The method of claim 1 wherein step (a) comprises selecting a template script based on the domain of the web site, visiting the web site and modifying the template script to obtain the publication information.
10. The method of claim 1 wherein the publication information comprises at least one of the publication title and a publication identification number.
11. Apparatus for locating a standard publication identifier that identifies a publication containing an article that is referenced in a web page of a web site, the website being displayed in a browser and the apparatus comprising: means for creating a plurality of parser scripts, each parser script being adapted to extract publication information from at least one web page of a web site by parsing the HTML code of that web page; means for selecting one of the parser scripts based on a domain of the web site; means for downloading the selected parser script to the browser and executing the downloaded script in the browser; and means for using information extracted from the web site to access a database containing standard publication identifiers indexed by publication information to determine the standard publication identifier.
12. The apparatus of claim 11 wherein the means for executing the downloaded script in the browser comprises means for using the parser script to navigate from a web page being displayed in the browser to another web page in the web site and means for extracting publication information by parsing the HTML code of that other web page.
13. The apparatus of claim 11 wherein the means for using information extracted from the web site to access a database comprises means for processing the information extracted from the web site into a predetermined format and means for using the processed information to access the database.
14. The apparatus of claim 13 wherein the means for using information extracted from the web site to access a database comprises means for using at least two different combinations of the processed information to access the database.
15. The apparatus of claim 11 further comprising: means for creating a plurality of parser rules, each parser rule being adapted to extract publication information for a particular form of universal resource locator; means for selecting one of the parser rules based on the domain of the web site; means for parsing the universal resource locator for the web site with the selected rule to extract publication information; and means for using the publication information extracted by the means for parsing the universal resource locator to obtain the standard publication identifier.
16. The apparatus of claim 15 wherein the means for selecting one of the parser rules based on the domain of the web site and the means for parsing the universal resource locator for the web site with the selected rule to extract publication information operate in parallel with the means for selecting one of the parser scripts based on a domain of the web site and the means for downloading the selected parser script to the browser and executing the downloaded script in the browser.
17. The apparatus of claim 15 further comprising: means for creating a plurality of translation rules, each translation rule having publication information as an input and a standard publication identifier as an output; and means for applying the publication information extracted by the means for parsing the universal resource locator for the web site with the selected rule as an input to one of the plurality of translation rules and for using the output of that translation rule as the standard publication identifier.
18. The apparatus of claim 11 further comprising means for displaying a publication search web page in the browser if the standard publication identifier cannot be determined by the means for using information extracted from the web site to access a database to determine the standard publication identifier.
19. The apparatus of claim 11 wherein the means for creating a plurality of parser scripts comprises means for selecting a template script based on the domain of the web site and a script editor for modifying the template script to obtain the publication information.
20. The apparatus of claim 11 wherein the publication information comprises at least one of the publication title and a publication identification number.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2008/074579 WO2010024811A1 (en) | 2008-08-28 | 2008-08-28 | Method and apparatus for generating standard document identifiers from content references |
Publications (1)
Publication Number | Publication Date |
---|---|
EP2321743A1 true EP2321743A1 (en) | 2011-05-18 |
Family
ID=40451190
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP08798856A Withdrawn EP2321743A1 (en) | 2008-08-28 | 2008-08-28 | Method and apparatus for generating standard document identifiers from content references |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP2321743A1 (en) |
JP (1) | JP5438112B2 (en) |
AU (1) | AU2008360993B2 (en) |
CA (1) | CA2735215C (en) |
WO (1) | WO2010024811A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5991143B2 (en) * | 2012-10-31 | 2016-09-14 | 株式会社リコー | Information processing apparatus, system, and information registration method |
JP6075011B2 (en) * | 2012-10-31 | 2017-02-08 | 株式会社リコー | Information processing apparatus, system, and information providing method |
JP6304408B2 (en) * | 2017-01-12 | 2018-04-04 | 株式会社リコー | Information processing apparatus, information providing method, and program |
JP6451888B2 (en) * | 2018-03-08 | 2019-01-16 | 株式会社リコー | Information processing apparatus, system, and program |
CN111046629B (en) * | 2019-12-16 | 2022-03-01 | 北大方正集团有限公司 | Outline display method, device and equipment |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3543726B2 (en) * | 2000-04-17 | 2004-07-21 | 株式会社日立製作所 | Knowledge search service method and apparatus for supporting search of books and the like |
EP1422633A3 (en) * | 2002-11-25 | 2007-11-28 | Internet Disclosure Co., Ltd. | Document authoring system and authoring management program |
JP2004348241A (en) * | 2003-05-20 | 2004-12-09 | Hitachi Ltd | Information providing method, server, and program |
JP2007183833A (en) * | 2006-01-06 | 2007-07-19 | Kazutoshi Tsuda | Information processing system |
JP2007328510A (en) * | 2006-06-07 | 2007-12-20 | Ricoh Co Ltd | Content conversion device, content display device, content browsing device, content conversion method, content browsing method and program |
-
2008
- 2008-08-28 CA CA2735215A patent/CA2735215C/en active Active
- 2008-08-28 WO PCT/US2008/074579 patent/WO2010024811A1/en active Application Filing
- 2008-08-28 EP EP08798856A patent/EP2321743A1/en not_active Withdrawn
- 2008-08-28 JP JP2011524951A patent/JP5438112B2/en active Active
- 2008-08-28 AU AU2008360993A patent/AU2008360993B2/en active Active
Non-Patent Citations (2)
Title |
---|
None * |
See also references of WO2010024811A1 * |
Also Published As
Publication number | Publication date |
---|---|
JP5438112B2 (en) | 2014-03-12 |
CA2735215A1 (en) | 2010-03-04 |
AU2008360993B2 (en) | 2015-07-09 |
JP2012501490A (en) | 2012-01-19 |
AU2008360993A1 (en) | 2010-03-04 |
WO2010024811A1 (en) | 2010-03-04 |
CA2735215C (en) | 2017-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2590523C (en) | Method and apparatus for converting a document universal resource locator to a standard document identifier | |
KR101389969B1 (en) | Message Catalogs for Remote Modules | |
CA2735215C (en) | Method and apparatus for generating standard document identifiers from content references | |
EP2035950A1 (en) | A method of managing web services using integrated document | |
CA2689087C (en) | Method and apparatus for obtaining content license rights via a document link resolver | |
RU2632149C2 (en) | System, method and constant machine-readable medium for validation of web pages | |
US8201242B2 (en) | Method and apparatus for verifying content reuse rights and resolving rights in the presence of multiple licenses | |
CA2248413A1 (en) | Apparatus and method for retrieving data from a network site | |
US20090019011A1 (en) | Processing Digitally Hosted Volumes | |
US8131752B2 (en) | Breaking documents | |
JP5712496B2 (en) | Annotation restoration method, annotation assignment method, annotation restoration program, and annotation restoration apparatus | |
US20130036350A1 (en) | Modular tool for constructing a link to a rights program from article information | |
Albertsen | The paradigma web harvesting environment | |
KR101079802B1 (en) | System and Method for Searching Website, Devices for Searching Website and Recording Medium | |
WO2014027237A1 (en) | Systems and methods for web localization | |
Hawkins | E-serial titles that disappear | |
Edgar | Sitemaps | |
Alonso et al. | Disclosing Private Information from Metadata, hidden info and lost data | |
Wang | Internationalization of Faculty Websites Using XML. | |
Vassilakis et al. | A heuristics-based approach to reverse engineering of electronic services | |
GB2405497A (en) | Search engine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20110225 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL BA MK RS |
|
DAX | Request for extension of the european patent (deleted) | ||
17Q | First examination report despatched |
Effective date: 20140211 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20180731 |