WO2001035281A1 - Moteur de contenu - Google Patents

Moteur de contenu Download PDF

Info

Publication number
WO2001035281A1
WO2001035281A1 PCT/US2000/031016 US0031016W WO0135281A1 WO 2001035281 A1 WO2001035281 A1 WO 2001035281A1 US 0031016 W US0031016 W US 0031016W WO 0135281 A1 WO0135281 A1 WO 0135281A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
electronic content
electronic
filter elements
category
Prior art date
Application number
PCT/US2000/031016
Other languages
English (en)
Inventor
Alan S. Ellman
Brian C. Mcguinty
James P. Vinett
Original Assignee
Screamingmedia Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Screamingmedia Inc. filed Critical Screamingmedia Inc.
Priority to AU14842/01A priority Critical patent/AU1484201A/en
Publication of WO2001035281A1 publication Critical patent/WO2001035281A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to the filtering, categorization, and delivery of electronic content to any user of the internet, and more particularly, to an automated process whereby content can be read and understood in relation to a client defined filter for a content topic.
  • a system and method of processing electronic content involves storing a filter elements associated with a content category and categorizing electronic content in the content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content.
  • the filter elements may be a word, a phrase and/or a citation and may part of a Boolean filter which expresses a relationship between the filter elements.
  • the system and method receives the electronic content or a plurality of electronic content in a data stream from a content provider, across a network (e.g., Internet), and categorizes each electronic content in any one of a plurality of content categories having associated therewith corresponding filter elements.
  • the system and method retrieves configuration information defining attributes of a data stream in which the electronic content is received and normalizes the electronic content according to the configuration information.
  • the normalization of the received content may involve sub-parsing the electronic content to obtain header, trailer and payload sections according to the configuration information and performing interpolations within a payload of a discrete electronic content unit including the electronic content.
  • interpolations may involve stripping out contact information, expanding at least one of unique and common abbreviations, tagging at least one of paragraphs or tables, and converting control characters outside an ASCII range to human readable text.
  • the normalization of the electronic content is preferably performed by a
  • system and method indexes the received
  • the system and method hashes at least the body content of the electronic content, and creates a searchable vector of nodes.
  • the vector of nodes may be searched to determine the appearance of any filter elements associated with the content category in the content body of the electronic content as well as to determine a relationship between appearing filter elements in the content body of the electronic content.
  • the system and method categorizes electronic content by determining a first value for the electronic content based on whether any filter elements appear in the content body of the electronic content, determining a second value for the electronic content depending on whether any references associated with the content category are cited in the content body of the electronic content, and comparing a threshold value associated with the content category to a third value based on the first and second values to determine whether to assign the electronic content to the content category.
  • the first and second values may be weighted differently in determining whether to categorize electronic content in a content category.
  • the filter elements such as the Boolean filter elements and citation entries, may be weighted differently in determining whether to categorize electronic content in a content category.
  • a content category may be associated with a client.
  • the system and method provides the electronic content to the client if the electronic content is categorized in that content category.
  • the electronic content may be delivered to a content management system of the client, across a network.
  • the system and method may maintaining the electronic content for the client, and deliver the electronic content directly to a user requesting the electronic content from the client across a network.
  • the system and method provides an address corresponding to a location of the electronic content to the client to enable access to the electronic content.
  • FIG. 1 is a system overview of an electronic content distribution system
  • Fig. 2 is a schematic block diagram of the central server of Fig. 1 including a content engine
  • Fig. 3 is a flowchart illustrating a process by which the central server of Fig. 1, in combination with the content engine, categorize electronic content;
  • Fig. 4 is a flowchart illustrating a process by which the central server of Fig. 1 , in combination with the content engine, normalize electronic content
  • Fig. 5 is a flowchart illustrating a process by which the central server of
  • Fig. 1 in combination with the content engine, analyze a content body as well as traditional index fields of the electronic content to categorize the electronic content.
  • electronic content distribution system 100 includes a central server 1 10. a plurality of content servers 120, a plurality of client servers 130 and a user computer 140, all of which are connected across network backbone 105.
  • Network backbone 105 may include an internet backbone, an intranet backbone or any other conventional network backbone or a combination thereof.
  • Content server 120 may be a conventional server which includes conventional computer hardware and functionality.
  • Content server 120 may be associated with a content provider, such as a publisher (e.g., a magazine publisher, book publisher, etc.), a news agency, or any distributor or provider of electronic content.
  • Electronic content may correspond to any publications (e.g., a news or magazine article), reports, technical papers and so forth.
  • Electronic content may include a content body including text and/or images with associated meta-data as well as traditional index fields generally provided in a header or trailer section of the electronic content. These traditional index fields are typically determined and inserted by human editors.
  • Client server 130 may also be a conventional server which includes conventional computer hardware and functionality and a content management system 135 for managing the storage of and the access to electronic content, for example, associated with a client operated web site accessible to a user of user computer 140.
  • a client may operate a web site, via client server 130, which provides access to electronic content and which is accessible by the user of user computer 140 through the user of a browser program 145, over the internet.
  • Client server 130 may be associated with any operator of a web site, for example, a business (e.g., etailer), an individual, and so forth.
  • Central server 1 10 may be a conventional server which includes
  • Central server 1 10 may be operated or associated with a vendor which provides electronic content to clients according to their needs, e.g.. according to the type of content or content category desired or defined by a client.
  • Centra] server 1 10 is configured to receive electronic content 125 from any of a plurality of content providers 120 across network backbone 105.
  • Central server 100 categorizes electronic content 125 in a content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content through the use of content engine 1 15.
  • Central server 100 then provides the electronic content 125 to a client associated with the content category, e.g., to deliver the electronic content to a client server 130 of the client or to maintain the electronic content for the client.
  • electronic content may be categorized to a level of granularity to satisfy the expanding needs of content clients while minimizing or eliminating reliance on human editors or predefined categorizations in the categorization process.
  • Fig. 2 is a schematic block diagram illustrating the components of central server 1 10 of Fig. 1.
  • Conventional computer components are included, such as a processor 200.
  • user input devices 205 e.g., keyboard, mouse, etc., for receiving user inputs
  • network interface 210 for interconnection to content servers 120 and client servers 130
  • Storage device 230 stores content engine 1 15, persistent object store 240.
  • client configuration files 245 and citation library 250 are included in FIG. 1.
  • Processor 200 in combination with content engine 1 15, are configured to categorize electronic content 125 in a content category based on whether any filter elements, associated with the content category, appear in a content body of the electronic content through the use of content engine 1 15.
  • a detailed discussion of operational examples of content engine 1 15 are described below with reference to Figs. 3-5.
  • Persistent object store 240 is a dynamic object-oriented database which maintains a plurality of stores of information, such as a store of electronic content vended or provided lo clients, a store of client information associated with distribution of electronic content to clients, and so forth.
  • Each client configuration file 240 is a data file that defines a client content category or topic. This definition is preferably constructed from a series of filter elements, such as words and phrases, joined in by Boolean operators, following the fundamental principals of mathematical order of operations.
  • each client configuration file 240 is a subsection(s) of persistent object store 240 which content engine 115 is to examine in the performance of the filtering, categorization and distribution of electronic content to clients, executable programs that are invoked in the event of a filter match by content engine 115, a name of a disk cached index file content engine 115 should output filter match results too, syntax of the disk cached index file output, result sort criteria, persistent store search criteria, persistent store maintenance information, topical information, and filter threshold
  • Client configuration files 240 may generally contain any other information for a plurality of client associated with the provision and distribution of electronic content to the clients.
  • Citation library 250 maintains citation entries or lists for a plurality of content categories.
  • Citation library 250 is preferably a dynamic data store that can be manipulated through human intervention in an automated fashion by the electronic content units passing through content engine 1 15.
  • databases in Fig. 2 such as citation library 250 are shown as being separate from persistent object store 240, these databases may be indexed and maintained in persistent object store 240.
  • the databases maintained in storage device 230 may also be distributed across a plurality of storage devices situated at different locations.
  • FIG. 3 is a flowchart illustrating a process 300 by which central server 1 10, in combination with content engine 1 15, categorize electronic content in one embodiment.
  • central server 1 10 receives electronic content having a content body from any one of a plurality of content servers 120 associated with a content provider.
  • the electronic content may be provided to central server 110 in a data stream from anyone of the plurality of content servers 120, across network backbone 105.
  • the electronic content particularly the content body may include text and/or images with associated meta-data as well as header and trailer sections including traditional index
  • central server 1 10 normalizes the electronic content. This
  • central server 1 10 indexes the normalized electronic content into persistent object store 240. This allows the electronic content to be read and examined by content engine 1 15 in its entirety, e.g., traditional index fields as well as the electronic content payload or body section.
  • central server 1 10 hashes the electronic content including the content body of the electronic content and creates a vector of searchable match nodes. Each element in the vector preferably has a subsequent node chain that points at the next word in the electronic content unit.
  • central server 1 10 categorizes the electronic content in a content category based on whether filter elements associated with a content category appear in the content body and traditional index fields.
  • These filter elements may be a word, phrase, citation entries or any information or characteristic which may be identified in a content body of electronic content for the purposes of categorizing the electronic content in a content category.
  • content engine 1 15 can iterate through the vector of match nodes to identify whether any filter elements associated with a content category appear in the content body and traditional index fields. That is, content engine examines the vector of match nodes to find or identify any filter matches.
  • Content engine 1 15 can also determine a relationship between those appearing filter elements or filter matches in the content body as well as in the traditional index fields of the electronic content.
  • Content engine 1 15 may then determine whether the electronic content belongs in the content category based on the filter matches and/or the relationship between the matched filter elements appearing in the content body as well as in the traditional index fields.
  • central server 1 10 writes to a disk cached index file for the content category information needed to vend or provide the electronic content to a client associated with that content category.
  • the location of the disk cached index file may be specified in client configurations files 245 of Fig. 2.
  • central sever 1 10 provides the electronic content to the client, for example, to content management system 135 of client server 130 of the client, associated with the content category.
  • central server 1 10 may deliver to client server 130, via network backbone 105, the electronic content in a variety of formats, such as in HTML, ASCII and so forth, preferably in a format desired by the client.
  • This information may be maintained in a client configuration file 245 of Fig. 2 associated with the client.
  • central server 1 10 may provide the electronic content to the client by maintaining the electronic content locally and delivering the electronic content to a user via a hyperlink on a web site provided or operated by a client server 130 of the client. This provides a simple method of content delivery since client server 130 does not need to hold or manage the electronic content. Central server 1 10 simply needs to provide the client with data related to a location for accessing the electronic content which may then be incorporated onto the client's web site. While the above describes the normalization, the categorization and the distribution of electronic content related to one content category, central server 1 10 may perform the above operations to normalize and categorize any one of a plurality of electronic content in any one of a plurality of content categories and to provide them to a plurality of clients.
  • Fig. 4 is a flowchart illustrating a process 400 by which central server 1 10, in combination with content engine 1 15, normalize electronic content received in a data stream from any one of a plurality of content providers via their content server 120.
  • central server 1 10 reads local configuration files defining unique features of a data stream to be parsed.
  • the configuration files are maintained at a location accessible by central server 1 10, for example, persistent object store 240..
  • the local configuration files define the layout of a header section of a discrete electronic content unit in the data stream, unique aspects of the payload of a discrete electronic content unit in the data stream, a trailer of a discrete electronic content unit in the data stream, and unique interpolations that are to take place in the body of a discrete electronic content unit in the data stream.
  • central server 1 10 isolates beginning and end points of atomic
  • central server 1 10 sub-parses a header section of a discrete electronic content unit to yield the traditional element of electronic content categorization, i.e. CATCODE, SELCODE, and so forth.
  • central server 1 10 similarly sub-parses a trailer section of the atomic electronic content units to yield other traditional elements of electronic content categorization, i.e. CATCODE, SELCODE, and so forth.
  • central server 1 10 performs unique interpolations within the payload of a discrete electronic content unit as specified in the configuration files.
  • Unique interpolations are functions that are to be performed within the payload of a discrete electronic content unit. These function may include stripping out contact information, expanding unique and common abbreviations, tagging paragraphs and tables and converting unique control characters outside the ASCII range to human readable text.
  • Fig. 5 is a flowchart illustrating a process 500 by which central server 1 10, in combination with content engine 115, determine whether electronic content is to be categorized in a content category based on an analysis of a content body as well as traditional index fields of the electronic content.
  • central server 1 10 examines the vector of match nodes of hashed electronic content against a filter, e.g.. Boolean filter, associated with a content
  • central server 1 10 determines a first score or value for the electronic content based on the filter matches and/or the relationship between the matched filter elements appearing in the content body as well as in the traditional index fields.
  • the first score may be based on how many filter matches and/or a proximity of the filter matching elements in the content body as well as in the traditional index fields.
  • the first score is preferably a number between zero (0) and one (1).
  • central server 1 10 examines the vector of match nodes of hashed electronic content against citation entries of references associated with the content category. That is, central server 1 10 checks whether any of the citation entries, e.g., references, have been referred to, referenced in or cited in the content body as well as the traditional index fields of the electronic content. These citation entries are maintained in citation library 250.
  • central server 1 10 determines a second score for the electronic content based on any appearances of references to a citation entry associated with the content category.
  • different citation entry matches may have different weights associated therewith.
  • the second score is also preferably a number between zero (0) and one (1).
  • central server 1 10 determines a final score based on the first and second scores.
  • the final score may the first score multiplied by the second score.
  • the final score is preferably a number between zero (0) and one (1 ).
  • these first and second scores may also be weighted differently in the determination of the final score. These weights may be preset according to the client or determined after an initial or preliminary examination through the hashed content of the electronic content based on the appearance or non-appearance of filter elements of the Boolean filter or the citation entries associated with the content category. For example, a greater weight may be given to the score in which more filter matches occurred in the initial examination.
  • central server 1 10 determines whether the final score is less than a threshold score for the content category.
  • This threshold score may be maintained in the client configuration files 245 of a client associated with the content category and is also preferably a number between zero (0) and one (1). If the final score is not less than the threshold score, then central server
  • 1 10 assigns the electronic content to the content category.
  • the electronic content may thereafter be provided to the client associated with the category.
  • central server 1 10 may perform the above operations to categorize any one of a plurality of electronic content in any one of a plurality of content categories associated with a plurality of clients.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

L'invention concerne un système et un procédé de traitement pour contenu électronique (100), lequel procédé consiste à catégoriser ledit contenu électronique (125) dans une catégorie (125) de contenus en fonction d'un paramètre, à savoir si les éléments (115) de filtrage, associés à la catégorie (125) de contenus, apparaissent dans le corps du contenu électronique (125).
PCT/US2000/031016 1999-11-10 2000-11-09 Moteur de contenu WO2001035281A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU14842/01A AU1484201A (en) 1999-11-10 2000-11-09 Content engine

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US43800499A 1999-11-10 1999-11-10
US09/438,004 1999-11-10

Publications (1)

Publication Number Publication Date
WO2001035281A1 true WO2001035281A1 (fr) 2001-05-17

Family

ID=23738828

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/031016 WO2001035281A1 (fr) 1999-11-10 2000-11-09 Moteur de contenu

Country Status (2)

Country Link
AU (1) AU1484201A (fr)
WO (1) WO2001035281A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5867799A (en) * 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs
US5905862A (en) * 1996-09-04 1999-05-18 Intel Corporation Automatic web site registration with multiple search engines
US5913215A (en) * 1996-04-09 1999-06-15 Seymour I. Rubinstein Browse by prompted keyword phrases with an improved method for obtaining an initial document set
US5973696A (en) * 1996-08-08 1999-10-26 Agranat Systems, Inc. Embedded web server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5867799A (en) * 1996-04-04 1999-02-02 Lang; Andrew K. Information system and method for filtering a massive flow of information entities to meet user information classification needs
US5913215A (en) * 1996-04-09 1999-06-15 Seymour I. Rubinstein Browse by prompted keyword phrases with an improved method for obtaining an initial document set
US5973696A (en) * 1996-08-08 1999-10-26 Agranat Systems, Inc. Embedded web server
US5905862A (en) * 1996-09-04 1999-05-18 Intel Corporation Automatic web site registration with multiple search engines

Also Published As

Publication number Publication date
AU1484201A (en) 2001-06-06

Similar Documents

Publication Publication Date Title
US6012053A (en) Computer system with user-controlled relevance ranking of search results
US8099423B2 (en) Hierarchical metadata generator for retrieval systems
US7949660B2 (en) Method and apparatus for searching and resource discovery in a distributed enterprise system
US6182066B1 (en) Category processing of query topics and electronic document content topics
US6334132B1 (en) Method and apparatus for creating a customized summary of text by selection of sub-sections thereof ranked by comparison to target data items
US7562076B2 (en) Systems and methods for search query processing using trend analysis
US8014997B2 (en) Method of search content enhancement
JP3755134B2 (ja) コンピュータベースの適合テキスト検索システムおよび方法
US7039625B2 (en) International information search and delivery system providing search results personalized to a particular natural language
US6236991B1 (en) Method and system for providing access for categorized information from online internet and intranet sources
US6826576B2 (en) Very-large-scale automatic categorizer for web content
JP4274689B2 (ja) データ組を選ぶための方法とシステム
US7092938B2 (en) Universal search management over one or more networks
US20050065774A1 (en) Method of self enhancement of search results through analysis of system logs
US20050108200A1 (en) Category based, extensible and interactive system for document retrieval
US20020065857A1 (en) System and method for analysis and clustering of documents for search engine
US20100228741A1 (en) Methods and systems for searching and associating information resources such as web pages
Crabtree et al. Improving web clustering by cluster selection
US20040015485A1 (en) Method and apparatus for improved internet searching
CN111737607A (zh) 数据处理方法、装置、电子设备以及存储介质
US20040205051A1 (en) Dynamic comparison of search systems in a controlled environment
JP2002157270A (ja) 興味記事配信システム及び興味記事配信方法
WO2001035281A1 (fr) Moteur de contenu
KR102351264B1 (ko) 사용자 맞춤형 신간 도서 정보의 제공 방법 및 그 시스템
US9773056B1 (en) Object location and processing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase