WO2007121171A2 - systèmes et méthodes de classement des termes présents dans un produit informatique - Google Patents

systèmes et méthodes de classement des termes présents dans un produit informatique Download PDF

Info

Publication number
WO2007121171A2
WO2007121171A2 PCT/US2007/066314 US2007066314W WO2007121171A2 WO 2007121171 A2 WO2007121171 A2 WO 2007121171A2 US 2007066314 W US2007066314 W US 2007066314W WO 2007121171 A2 WO2007121171 A2 WO 2007121171A2
Authority
WO
WIPO (PCT)
Prior art keywords
weight value
terms
term
data product
list
Prior art date
Application number
PCT/US2007/066314
Other languages
English (en)
Other versions
WO2007121171A3 (fr
Inventor
Robert M. Brinson, Jr.
Bryan Glenn Donaldson
Nicholas Levi Middleton
Robert Leon Ii Bass
Harry H. Blakeslee
Original Assignee
Intelliscience Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intelliscience Corporation filed Critical Intelliscience Corporation
Publication of WO2007121171A2 publication Critical patent/WO2007121171A2/fr
Publication of WO2007121171A3 publication Critical patent/WO2007121171A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages

Definitions

  • Classification solutions use classifications for the words that a developer puts in place prior-to searching. For example “bass” would fit in the classifications of: type of fish, style of guitar, type of stringed instrument, an Artist, brand of shoes, and brand of alcoholic beverage.
  • classification does not support "concept" searching; classification relies on the appropriateness of the classification to be relevant to each and every searcher's word. It is improbable that any classification system will ever be able to reach a saturation point of classifying all words for all searchers.
  • Tagging solutions are in essence another variation of the classification system. Rather than the engineer, it lets web page developers/owners classify their pages with the use of keywords and meta-tags. A sporting goods store, and the manufacturer of certain ale's, shoes and guitars, might all place the word "bass" in their keywords or meta-tags. Tagging does not support "concept" searching. Tagging solutions rely on the appropriateness, integrity & domain knowledge of web page developers / owners. It has become rather common on the web for pages to have keywords and meta-tags that have nothing to do with the content or purpose of the site. In these cases, these tags have been placed solely to drive traffic to the site. Tagging solutions are one of the contributing factors to the high number of search sessions that fail to deliver the desired page or file.
  • the preferred embodiment provides methods and systems for determining the significance of a term in a plurality of data products.
  • the data products are stored on a single computer, at one or more locations over a computer-based network, or on the world wide web.
  • An example method determines the type of the data product.
  • the data product is assigned a weight value based on a list of predetermined variables and variables dynamically created through the search, processing and concept association processes.
  • a processor calculates a weight value for each term inside the data product.
  • the weight value equals the weight value assigned to the data product added to the weight value of the term calculated based on a list of predetermined variables.
  • the list of terms and calculated weight values are stored for each term.
  • FIGURE 1 shows an example system for ranking terms found in a data product
  • FIGURE 2 shows an example formed in accordance with an embodiment of the present invention
  • FIGURE 3 shows an example for assigning a weight value to a term
  • FIGURE 4 shows an example for determining a weight value of a data product type
  • FIGURE 5 shows an example method for determining a weight value of a term in a data product containing text
  • FIGURE 6 shows an example of including user specifications
  • FIGURE 7 shows one embodiment of scanning data products and storing weight values
  • FIGURE 8 shows an example table that stores terms, weight values, and the data product location
  • FIGURE 9 shows an example of how a list of weighted terms is used by a search query.
  • FIGURE 1 shows an example system 100 for ranking terms found in a data product.
  • the system 100 includes a computer 101 in communication with a plurality of other computers 103.
  • the computer 101 is connected with a plurality of computers 103, a server 104, a data storage center 106, and/or a network 108, such as an intranet or the Internet.
  • a bank of servers, a wireless device, a cellular phone and/or another data entry device can be used in the place of the computer 101.
  • a database stores terms and a plurality of weight values. The database is stored at the data storage center 106 or locally at the computer 101.
  • an application program run by the server 104 or computer 101 creates initial database tables.
  • the tables store terms found in each of a plurality of the data products, their respective weight values, as well as the relationships between each table, and data product locations.
  • a term includes a word, a phrase and/or a concept.
  • a term's weight value is defined as a number assigned to a word, such that in a computation the word's effect on the computation reflects its importance.
  • the application program monitors the data products for changes and updates the database tables when a change has occurred or a new data product has been made available.
  • calculating a weight value of terms found in a data product is executed on a single computer 101.
  • a search for a data product is executed on a computer 101 connected to a plurality of computers 103, a server 104, a data storage center 106, and/or a network 108, such as an intranet or the Internet. Search over the Internet allows a user to search and rank a plurality of Internet pages.
  • the data products could be of any format containing text, including but not limited to a flat text file, a word processing document, a spreadsheet, a database, a web page, a business rule, a federation of information silos.
  • FIGURE 2 shows a method 200 formed in accordance with an embodiment of the present invention.
  • a data store in the form of a database
  • the database is setup with tables that allow for the storage of terms, their respective weight values, as well as relationships between tables, and the location of the data product where the term originated.
  • the method 200 using the hardware described in FIGURE 1 , gathers terms with their respective weight values from a data product, described in more detail below in FIGURE 3.
  • the data product is updated; described in more detail in FIGURE 7.
  • FIGURE 3 further describes the process described at block 220 of FIGURE 2.
  • the type of data product to be analyzed is determined by analyzing the properties of each data product.
  • a weight value is assigned to the document based on the file type and a predefined user criteria, further described in FIGURE 6.
  • the method further determines a rank by considering characteristics of the data product as a whole, such as misspellings or grammatical errors contained therein, length and/or type of data product, and/or the uniqueness or organization of the text. This process is further defined in FIGURE 4.
  • a weight value for each term is calculated.
  • the method parses a data product in order to retrieve terms from each data product in accordance with a first embodiment. After a data product type has been identified the method parses each term therein and a parsed list of terms for each data product is stored. Each term starts with its weight value equal to the weight value of the data products that it was found in. The method of determining a weight value of each term is further described below at FIGURE 5. At block 340, the method stores the list of terms along with their respective weight values in the database.
  • FIGURE 4 further describes the method described at block 310 of FIGURE 3.
  • the method determines if the data product is a text file. If it is text file then the weight value of the terms is determined by a numerous set of criteria and methodologies in the form of an algorithm.
  • the criteria and methodologies used are adjustable to rank/weight (hereinafter "rank") higher, lower, require or exclude in order to refine and filter searches to find the desired information and/or exclude undesired information, documents or pages.
  • rank rank/weight
  • These algorithms use characteristics of terms comprised of cues, attributes, formatting, criteria, features and interactions of terms, concepts and objects as their basis for the algorithmic function. There are additional characteristics that may be used in alternate embodiments that are not included on this list. In some cases, this basis is the existence or lack of existence of the characteristic, the frequency of the characteristic, the interaction of the characteristic, etc.
  • any combination or none of the characteristics below can be dynamically set to rank higher, lower, require, exclude or to not be used in the ranking.
  • the presence of any of the following adds a weight value e.g. one to the term.
  • a weight value e.g. one to the term.
  • Caps All, Small: A variable ranking can be applied, such as Caps ranks higher unless a % or more of the document is Caps, then Caps is not used for ranking or ranks lower; if a specific language that does not have case or uses pictographs, then Caps ranking is not used;
  • a variable ranking can be applied, such as Underlined ranks higher unless a % or more of the document is Underlined then Underlined is not used for ranking or ranks lower;
  • Terms, concepts or objects are ranked based on Frequency in the File: A variable ranking can be applied, such as Frequency > n but ⁇ m is rank higher, Frequency > m rank lower, or Frequency > n rank higher unless Frequency is % or more of the file, then Frequency rank lower or exclude;
  • a variable ranking can be applied, such as Successive Repetition 2, 3 or 4, rank higher; Successive Repetition > 4 rank lower or exclude;
  • Terms, concepts or objects are In a Specific Language; [0074] Terms, concepts or objects are ranked based on Frequency in "similar queries"; [0075] Terms, concepts or objects are On or From a Specific Device Type of Origination or Current Location;
  • This ranking characteristic can be implemented to rank lower or exclude "all" files, sites or pages that do not have visible terms, concepts or objects that are listed in the Keywords or Meta Tags;
  • the data product is analyzed to determine if it is a database.
  • Weight values are assigned to terms in a database, similar as discussed above for text files.
  • the terms present within a particular database may also be afforded rank values based on their individual levels of significance, relative to other topics within the same or other databases.
  • the weight value of terms within a database may be affected by, but not limited to, the presence of term within the database rows and/or columns; the use of a particular term within certain database objects. In one exemplary embodiment a term may be considered more significant if it appears in an e.g. "trouble ticket" table as opposed to an e.g. "location” table.
  • the presence of embedded documents with the database or use of the topic with the embedded document and the applicability and/or usefulness of a particular topic to differing users or departments of an organization affects the weight value.
  • the data product is analyzed to determine if it is a business rule.
  • a business rule contains documentation that describes how a business generally operates. It may contain user specifications for determining weight value of terms, formatting guidelines, company best practices, naming conventions, etc. These terms are given a high value as they may have a great effect on how a business operates and how it identifies significant terms.
  • the data product is analyzed to determine if it is a federation of information silos.
  • a federation of information silos allows for the aggregation of information across separate data products. This may offer the ability to rank topics based simply on their existence or nonexistence within the same or other related or unrelated stores, or the topic's existence or nonexistence within a particular store may positively or negatively affect its rank value. For example, a topic may be increased in rank if it is found in a user's desk reference information store and a topically related digital library information store.
  • the data product is analyzed to determine if the data product is a readable data product. If so, then it is assigned an initial weight value of zero, in one embodiment, and the terms are analyzed based on block 410. If it is not a readable data product, then the weight is returned as null and it is a data product that will not appear in the results.
  • FIGURE 5 shows an exemplary embodiment of the method described at block 330 of FIGURE 3.
  • a user is to enter their specifications and is further described below in FIGURE 6.
  • a term is selected from the generated parsed list of terms.
  • a weight value is incremented and the additional occurrence of the term is deleted from the list.
  • a term's weight value is defined as a number assigned to a word, such that in a computation the word's effect on the computation reflects its importance.
  • the term is tested to determine whether the word is a sentence construction word. If the term is a sentence construction word then the term is removed and excluded from the parsed list see block 525.
  • Sentence construction words are those used commonly in written text to build sentences, but have very little content information. They include words such as "and”, “the”, “this”, “of. Because they are common, the algorithm for determining significance of a term might incorrectly assign a high significance to these words that carry very little meaning. A configurable list of sentence construction words is maintained and no term is added to the term storage or weighted for a data product that is found in this list. Any query terms which match a sentence construction word are ignored, and if all the terms in a query are sentence construction words, the query is rejected.
  • a term's weight value is incremented if the term is in all caps see block 530.
  • a term's weight value is incremented if the term is in sentence case see block 535.
  • Sentence case is defined as a term that is all lower case, or is just capitalized because the term follows a period, i.e. is the start of a new sentence.
  • a term's weight value is incremented if the term is in the name of the data product containing the term see block 540.
  • a term's weight value is incremented if the term is in the file location of the data product see block 545.
  • a term's weight value is incremented if the term has any special formatting (see block 550).
  • special formatting includes italics, underline, and larger font than most of the other text in the data product, quotations marks and/or strikethrough. Additional factors can be used to generate or adjust weights of terms, depending upon the data product format and application needs.
  • a term's weight value is incremented based on a terms proximity to a query term found in the data product (See FIG. 6).
  • a term's weight value is increased or decreased if the term is found within specified sections of the data product.
  • One embodiment would adjust the term's weight based on a dictionary of terms suitable to the data product and application system. After a term has been analyzed the final weight is then assigned to the term 560.
  • the parsed list is checked to determine if there are any additional terms to be analyzed. If so, the method returns to block 550 to enable the next term to be analyzed. If there are not any additional terms to be analyzed, then the weighted parsed list is returned to block 330 in FIGURE 3.
  • terms are determined to be insignificant by ranking all of the terms in a data product and then finding the value where terms begin a sequence (of configurable length) with the same value. It can be assumed that a sequence of terms with the same value reflects terms that are not particularly descriptive of the contents of the data product. All terms with weight values above the weight value of the terms with the first repeated value will be flagged as significant terms, so long as they are not sentence construction words.
  • FIGURE 6 shows one embodiment of entering user specifications as shown at block 505 in FIGURE 5.
  • a user is given the capability to alter criteria used to determine weight value.
  • a user is given the capability to add/subtract or mitigate the effects of any, some or specific ranking criteria or methodologies may afford another opportunity to meld the user's ideas of exactly what should be considered significant with the machine-calculable significance.
  • a user may add additional weight to at block 640; a user may decide whether a criterion or methodology has a positive or negative effect on the ranking of the topic(s).
  • the user may apply a customizable filter(s) to automatically increase or decrease the ranks of topics applicable to a particular market, industry or genre.
  • one topic may have a different meaning or connotation to the government or military than it does in the healthcare field. If the user is searching for the topic within the military genre, the user may manually or the filter may automatically increase the rank of topics found on a .MIL or .GOV domain. At block 660, the user may also be given the capacity to manually alter the weight value of any topic within an information store. In this instance, the user may remove the topic from consideration, add a topic which does not qualify for consideration or modify the weight value of a topic in some other fashion.
  • FIGURE 7 shows one embodiment of scanning data products and storing weight values.
  • it is determined whether the content in the data product changes frequently. If it does then at block 720, determining a weight value may be performed as a result of a user query. If the data product is not frequently changing then at block 730, when a change is detected by an indexing system the method will determine the weight values of the terms at that time. At block 740 the results are stored.
  • the method and system ranks topics extracted from a data product using a semantic search engine.
  • a search engine attempts to derive the syntactical, grammatical and/or semantic meanings found within a user's search query, for example, by using a combination of punctuation scrutiny, statistical, probabilistic and cognitive analyses, chronological analysis and text styling analysis to garner machine understanding of human language.
  • FIGURE 8 shows an example table that stores terms, weight values, and the data product location.
  • the term is stored.
  • the term's weighted value is stored.
  • the term's location is stored.
  • FIGURE 9 shows an example of how a list of weighted terms is used.
  • a search tool using a search string sends a search query.
  • the data store 920 is queried for related terms.
  • the weight values are received and indexed for display to a user.
  • the user is presented with indexed terms based on their rank.
  • a user is presented with a list of files containing the ranked terms in presentation to the user.
  • the user is presented with the files with the terms chosen from the ranked terms.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne une méthode de détermination de la signification d'un terme présent dans une pluralité de produits informatiques. Les produits informatiques sont présents sur un ordinateur unique, sur un ou plusieurs emplacements d'un réseau d'ordinateurs, ou sur le Web. La méthode détermine le type de produit informatique. On assigne au produit informatique une valeur de poids basée sur une liste de variables prédéterminées. Un processeur calcule une valeur de poids pour chaque terme contenu dans le produit informatique. La valeur de poids est égale à la somme de la valeur de poids affectée au produit informatique et de la valeur de poids du terme calculée d'après une liste de variables prédéterminées. La liste des termes et les valeurs de poids calculées sont enregistrées pour chaque terme.
PCT/US2007/066314 2006-04-10 2007-04-10 systèmes et méthodes de classement des termes présents dans un produit informatique WO2007121171A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US74457006P 2006-04-10 2006-04-10
US60/744,570 2006-04-10

Publications (2)

Publication Number Publication Date
WO2007121171A2 true WO2007121171A2 (fr) 2007-10-25
WO2007121171A3 WO2007121171A3 (fr) 2008-09-12

Family

ID=38610329

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/066314 WO2007121171A2 (fr) 2006-04-10 2007-04-10 systèmes et méthodes de classement des termes présents dans un produit informatique

Country Status (1)

Country Link
WO (1) WO2007121171A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807680A (zh) * 2018-08-06 2020-02-18 阿里巴巴集团控股有限公司 数据对象信息处理方法、装置及电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US6026388A (en) * 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US7321892B2 (en) * 2005-08-11 2008-01-22 Amazon Technologies, Inc. Identifying alternative spellings of search strings by analyzing self-corrective searching behaviors of users

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026388A (en) * 1995-08-16 2000-02-15 Textwise, Llc User interface and other enhancements for natural language information retrieval system and method
US6012053A (en) * 1997-06-23 2000-01-04 Lycos, Inc. Computer system with user-controlled relevance ranking of search results
US7321892B2 (en) * 2005-08-11 2008-01-22 Amazon Technologies, Inc. Identifying alternative spellings of search strings by analyzing self-corrective searching behaviors of users

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807680A (zh) * 2018-08-06 2020-02-18 阿里巴巴集团控股有限公司 数据对象信息处理方法、装置及电子设备
CN110807680B (zh) * 2018-08-06 2023-05-02 阿里巴巴集团控股有限公司 数据对象信息处理方法、装置及电子设备

Also Published As

Publication number Publication date
WO2007121171A3 (fr) 2008-09-12

Similar Documents

Publication Publication Date Title
US20070175674A1 (en) Systems and methods for ranking terms found in a data product
US9864808B2 (en) Knowledge-based entity detection and disambiguation
US20170235841A1 (en) Enterprise search method and system
US7668825B2 (en) Search system and method
Yi et al. Linking folksonomy to Library of Congress subject headings: an exploratory study
US7783644B1 (en) Query-independent entity importance in books
Kipp Complementary or Discrete Contexts in Online Indexing: A Comparison of User, Creator, and Intermediary Keywords.
KR101098703B1 (ko) 다수의 기입 시스템을 가진 언어들에 대한 관련 쿼리들을 식별하기 위한 시스템 및 방법
US20110106807A1 (en) Systems and methods for information integration through context-based entity disambiguation
US20110161309A1 (en) Method Of Sorting The Result Set Of A Search Engine
US20070250501A1 (en) Search result delivery engine
US7765209B1 (en) Indexing and retrieval of blogs
US20090254540A1 (en) Method and apparatus for automated tag generation for digital content
US20080033982A1 (en) System and method for determining concepts in a content item using context
US20060161543A1 (en) Systems and methods for providing search results based on linguistic analysis
NZ542223A (en) Method and system for enhanced data searching by parsing data into syntactic units
Armentano et al. NLP-based faceted search: Experience in the development of a science and technology search engine
JP5718405B2 (ja) 発話選択装置、方法、及びプログラム、対話装置及び方法
Shatnawi et al. Verification hadith correctness in islamic web pages using information retrieval techniques
Kaptein et al. How different are language models andword clouds?
US20070239735A1 (en) Systems and methods for predicting if a query is a name
US20060184523A1 (en) Search methods and associated systems
JP5251099B2 (ja) 用語共起度抽出装置、用語共起度抽出方法及び用語共起度抽出プログラム
WO2007121171A2 (fr) systèmes et méthodes de classement des termes présents dans un produit informatique
US20080033953A1 (en) Method to search transactional web pages

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07760386

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07760386

Country of ref document: EP

Kind code of ref document: A2