CN1940920A - Phrase indexing method - Google Patents

Phrase indexing method Download PDF

Info

Publication number
CN1940920A
CN1940920A CN 200510105277 CN200510105277A CN1940920A CN 1940920 A CN1940920 A CN 1940920A CN 200510105277 CN200510105277 CN 200510105277 CN 200510105277 A CN200510105277 A CN 200510105277A CN 1940920 A CN1940920 A CN 1940920A
Authority
CN
China
Prior art keywords
phrase
keyword
index
indexing
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200510105277
Other languages
Chinese (zh)
Inventor
孙斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN 200510105277 priority Critical patent/CN1940920A/en
Publication of CN1940920A publication Critical patent/CN1940920A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for indexing phrase includes using complete phrase as index unit and using key words forming phrase also as index unit, making word at behind have more great weight and making longer index unit among index units having the same key word at behind have more great weight.

Description

The method of phrase indexing
Technical field
The present invention relates to information retrieval technique, particularly utilize comprise the phrase of a plurality of keywords or more generally the keyword string be the method that document is set up search.
Background technology
At present, all be to use keyword (keyword) to come index and search file usually based on the DRS of computing machine or computer network.DRS extracts wherein keyword to indexed each document, and the tabulation of the document that obtains each keyword thus and occurred can be set up the index of indexed collection of document.The extensive normally used index data structure of DRS is an inverted index, be clauses and subclauses promptly with a collection of keyword, record has comprised the lists of documents of each keyword respectively, but and information such as the frequency of occurrence of this keyword of detail record in document, position, form.In information retrieval field, " keyword " general item (term) that is used for document index and retrieval of censuring.These can be common speech, phrases, also can be the character strings (for example two word groups or two phrase Bigram etc.) of other type.Unless stated otherwise, " keyword " used in the present invention notion is followed this general implication.
Set up after the index, the process of the search file system that is uses the keyword in the inquiry to search document index.Inquiry is generally single keyword or a plurality of crucial contamination (for example logical expression).If certain keyword kw in the inquiry iIn index, occur, then can obtain all and comprise this keyword kw by index iDocument, pass through suitable set operation (common factor, union, difference set etc.) again, just obtained candidate's relevant documentation set.The certain criterion (for example the keyword frequency and position etc.) of system's utilization is determined the degree of correlation of inquiry and each candidate documents, chooses the higher document of a part of degree of correlation and return to retrieval user as Search Results from the candidate documents set.
It is effectively carrying out index and retrieve for relatively shorter inquiry string with keyword.But for long inquiry, then its performance and effect all are difficult to reach optimization.In order to improve operational efficiency, minimizing ambiguity and to make mistakes, existing main method is to use long keyword or phrase as far as possible.And this can often cause the phenomenon of under-enumeration.For example for character string " Beijing University's information science and department of computer science of technical college ", existing method can be used as indexing units as an organization names with whole phrase, but so just can not mate with query string " department of computer science of Beijing University ".On the other hand, if index terms is very thin, for example each primary word of " Beijing University's information science and department of computer science of technical college " all as index unit, then will increase the storage space and the query processing time of system greatly, simultaneously also can be with very big relevant matches inquiry " Beijing University ".In fact " Beijing University " qualifier just, its relevance weight should be lower than " department of computer science " etc.
Summary of the invention
The objective of the invention is to propose the method for the speech string of a kind of phrase indexing or long other type, it has overcome the above-mentioned weak point of existing method well.
For achieving the above object, the technical scheme that the present invention takes is:
A kind of method of phrase indexing is characterized in that: whole phrase as indexing units; Also as indexing units, and the speech after leaning on has bigger weight with the keyword of forming phrase.
This technical scheme can keep the retrieval effectiveness to short inquiry at the storage space that reduces system with in the query processing time, avoids the short bad interference of index unit aspect correlativity.
Embodiment
Below in conjunction with embodiment technique scheme is further described.
According to present embodiment, the various long phrases of system handles, it is an organization names that the technology of for example using organization names to discern automatically can be discerned " Beijing University's information science and department of computer science of technical college "; Also can leave this organization names in the structure title dictionary in by artificial method.The phrase of other type (name, place name, group/organization name, ProductName, trade (brand) name etc.) also can similarly be handled.Its basic skills is: when certain has comprised document than the long word group at index, simultaneously also will be its " centre word " (head words) also as the index terms of the document, and also its weight of important centre word is high more.
According to present embodiment, phrase " Beijing University's information science and department of computer science of technical college " will comprise following indexing units, and corresponding weights also is listed in after each indexing units:
Beijing University's information science and department of computer science of technical college 1.0
Information science and department of computer science of technical college 0.5
Department of computer science 0.2
Beijing University's information science and technical college 0.1
Information science and technical college 0.05
Beijing University 0.01
Wherein, for comprising same indexing units by rear center's speech, long indexing units has bigger weight.
A basic foundation of this method is such phenomenon of Chinese: in the phrase of Chinese, ornamental equivalent often appears at position before examination, and centre word then often occurs in the back.Therefore, can be with the keyword after leaning in the phrase as the more relevant speech of whole phrase, thus the inquiry relevant in can paying for life with life more relatively with phrase.
Present embodiment can be applied directly to the DRS of any row of using indexed mode.Certainly, those skilled in the art know clearly that also range of application of the present invention is not limited to the system of this mode.

Claims (2)

1. the method for a phrase indexing is characterized in that:
A. whole phrase as indexing units;
B. the keyword that will form phrase is also as indexing units, and the speech after leaning on has bigger weight.
2. the method for phrase indexing according to claim 1 is characterized in that: for comprising same indexing units by rear center's speech, have bigger weight than long indexing units.
CN 200510105277 2005-09-30 2005-09-30 Phrase indexing method Pending CN1940920A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510105277 CN1940920A (en) 2005-09-30 2005-09-30 Phrase indexing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510105277 CN1940920A (en) 2005-09-30 2005-09-30 Phrase indexing method

Publications (1)

Publication Number Publication Date
CN1940920A true CN1940920A (en) 2007-04-04

Family

ID=37959110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510105277 Pending CN1940920A (en) 2005-09-30 2005-09-30 Phrase indexing method

Country Status (1)

Country Link
CN (1) CN1940920A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930435B (en) * 2009-10-27 2013-03-20 深圳市北科瑞声科技有限公司 Method and system for retrieving organization names
CN103034407A (en) * 2012-12-07 2013-04-10 东莞宇龙通信科技有限公司 Terminal and method for inputting useful expressions quickly

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930435B (en) * 2009-10-27 2013-03-20 深圳市北科瑞声科技有限公司 Method and system for retrieving organization names
CN103034407A (en) * 2012-12-07 2013-04-10 东莞宇龙通信科技有限公司 Terminal and method for inputting useful expressions quickly
CN103034407B (en) * 2012-12-07 2016-08-03 东莞宇龙通信科技有限公司 Terminal and the method rapidly inputting common phrases

Similar Documents

Publication Publication Date Title
CA2595674C (en) Multiple index based information retrieval system
US6792414B2 (en) Generalized keyword matching for keyword based searching over relational databases
CA2813644C (en) Phrase-based searching in an information retrieval system
US6772141B1 (en) Method and apparatus for organizing and using indexes utilizing a search decision table
US6167397A (en) Method of clustering electronic documents in response to a search query
Vu et al. A graph method for keyword-based selection of the top-k databases
US9342582B2 (en) Selection of atoms for search engine retrieval
US20110113048A1 (en) Enabling Faster Full-Text Searching Using a Structured Data Store
US20100169305A1 (en) Information retrieval system for archiving multiple document versions
US7827172B2 (en) “Query-log match” relevance features
Chen et al. Template detection for large scale search engines
WO2008144457A2 (en) Efficient retrieval algorithm by query term discrimination
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
Cetindil et al. Efficient instant-fuzzy search with proximity ranking
US20110113052A1 (en) Query result iteration for multiple queries
Cappellari et al. A path-oriented rdf index for keyword search query processing
CN103064847A (en) Indexing equipment, indexing method, search device, search method and search system
CN1940920A (en) Phrase indexing method
Lingpeng et al. Improving retrieval effectiveness by using key terms in top retrieved documents
Momin et al. Web document clustering using document index graph
Lo et al. The numeric indexing for music data
Shui et al. Querying and maintaining ordered XML data using relational databases
Kanlayanawat et al. Automatic indexing for Thai text with unknown words using trie structure
Wang Study on risk evaluation intelligent decision support system of urban gas pipeline
Wang et al. Clustering web search results based on interactive suffix tree algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication