CN1940920A - Phrase indexing method - Google Patents
Phrase indexing method Download PDFInfo
- Publication number
- CN1940920A CN1940920A CN 200510105277 CN200510105277A CN1940920A CN 1940920 A CN1940920 A CN 1940920A CN 200510105277 CN200510105277 CN 200510105277 CN 200510105277 A CN200510105277 A CN 200510105277A CN 1940920 A CN1940920 A CN 1940920A
- Authority
- CN
- China
- Prior art keywords
- phrase
- keyword
- index
- indexing
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method for indexing phrase includes using complete phrase as index unit and using key words forming phrase also as index unit, making word at behind have more great weight and making longer index unit among index units having the same key word at behind have more great weight.
Description
Technical field
The present invention relates to information retrieval technique, particularly utilize comprise the phrase of a plurality of keywords or more generally the keyword string be the method that document is set up search.
Background technology
At present, all be to use keyword (keyword) to come index and search file usually based on the DRS of computing machine or computer network.DRS extracts wherein keyword to indexed each document, and the tabulation of the document that obtains each keyword thus and occurred can be set up the index of indexed collection of document.The extensive normally used index data structure of DRS is an inverted index, be clauses and subclauses promptly with a collection of keyword, record has comprised the lists of documents of each keyword respectively, but and information such as the frequency of occurrence of this keyword of detail record in document, position, form.In information retrieval field, " keyword " general item (term) that is used for document index and retrieval of censuring.These can be common speech, phrases, also can be the character strings (for example two word groups or two phrase Bigram etc.) of other type.Unless stated otherwise, " keyword " used in the present invention notion is followed this general implication.
Set up after the index, the process of the search file system that is uses the keyword in the inquiry to search document index.Inquiry is generally single keyword or a plurality of crucial contamination (for example logical expression).If certain keyword kw in the inquiry
iIn index, occur, then can obtain all and comprise this keyword kw by index
iDocument, pass through suitable set operation (common factor, union, difference set etc.) again, just obtained candidate's relevant documentation set.The certain criterion (for example the keyword frequency and position etc.) of system's utilization is determined the degree of correlation of inquiry and each candidate documents, chooses the higher document of a part of degree of correlation and return to retrieval user as Search Results from the candidate documents set.
It is effectively carrying out index and retrieve for relatively shorter inquiry string with keyword.But for long inquiry, then its performance and effect all are difficult to reach optimization.In order to improve operational efficiency, minimizing ambiguity and to make mistakes, existing main method is to use long keyword or phrase as far as possible.And this can often cause the phenomenon of under-enumeration.For example for character string " Beijing University's information science and department of computer science of technical college ", existing method can be used as indexing units as an organization names with whole phrase, but so just can not mate with query string " department of computer science of Beijing University ".On the other hand, if index terms is very thin, for example each primary word of " Beijing University's information science and department of computer science of technical college " all as index unit, then will increase the storage space and the query processing time of system greatly, simultaneously also can be with very big relevant matches inquiry " Beijing University ".In fact " Beijing University " qualifier just, its relevance weight should be lower than " department of computer science " etc.
Summary of the invention
The objective of the invention is to propose the method for the speech string of a kind of phrase indexing or long other type, it has overcome the above-mentioned weak point of existing method well.
For achieving the above object, the technical scheme that the present invention takes is:
A kind of method of phrase indexing is characterized in that: whole phrase as indexing units; Also as indexing units, and the speech after leaning on has bigger weight with the keyword of forming phrase.
This technical scheme can keep the retrieval effectiveness to short inquiry at the storage space that reduces system with in the query processing time, avoids the short bad interference of index unit aspect correlativity.
Embodiment
Below in conjunction with embodiment technique scheme is further described.
According to present embodiment, the various long phrases of system handles, it is an organization names that the technology of for example using organization names to discern automatically can be discerned " Beijing University's information science and department of computer science of technical college "; Also can leave this organization names in the structure title dictionary in by artificial method.The phrase of other type (name, place name, group/organization name, ProductName, trade (brand) name etc.) also can similarly be handled.Its basic skills is: when certain has comprised document than the long word group at index, simultaneously also will be its " centre word " (head words) also as the index terms of the document, and also its weight of important centre word is high more.
According to present embodiment, phrase " Beijing University's information science and department of computer science of technical college " will comprise following indexing units, and corresponding weights also is listed in after each indexing units:
Beijing University's information science and department of computer science of technical college 1.0
Information science and department of computer science of technical college 0.5
Department of computer science 0.2
Beijing University's information science and technical college 0.1
Information science and technical college 0.05
Beijing University 0.01
Wherein, for comprising same indexing units by rear center's speech, long indexing units has bigger weight.
A basic foundation of this method is such phenomenon of Chinese: in the phrase of Chinese, ornamental equivalent often appears at position before examination, and centre word then often occurs in the back.Therefore, can be with the keyword after leaning in the phrase as the more relevant speech of whole phrase, thus the inquiry relevant in can paying for life with life more relatively with phrase.
Present embodiment can be applied directly to the DRS of any row of using indexed mode.Certainly, those skilled in the art know clearly that also range of application of the present invention is not limited to the system of this mode.
Claims (2)
1. the method for a phrase indexing is characterized in that:
A. whole phrase as indexing units;
B. the keyword that will form phrase is also as indexing units, and the speech after leaning on has bigger weight.
2. the method for phrase indexing according to claim 1 is characterized in that: for comprising same indexing units by rear center's speech, have bigger weight than long indexing units.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510105277 CN1940920A (en) | 2005-09-30 | 2005-09-30 | Phrase indexing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510105277 CN1940920A (en) | 2005-09-30 | 2005-09-30 | Phrase indexing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1940920A true CN1940920A (en) | 2007-04-04 |
Family
ID=37959110
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200510105277 Pending CN1940920A (en) | 2005-09-30 | 2005-09-30 | Phrase indexing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1940920A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101930435B (en) * | 2009-10-27 | 2013-03-20 | 深圳市北科瑞声科技有限公司 | Method and system for retrieving organization names |
CN103034407A (en) * | 2012-12-07 | 2013-04-10 | 东莞宇龙通信科技有限公司 | Terminal and method for inputting useful expressions quickly |
-
2005
- 2005-09-30 CN CN 200510105277 patent/CN1940920A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101930435B (en) * | 2009-10-27 | 2013-03-20 | 深圳市北科瑞声科技有限公司 | Method and system for retrieving organization names |
CN103034407A (en) * | 2012-12-07 | 2013-04-10 | 东莞宇龙通信科技有限公司 | Terminal and method for inputting useful expressions quickly |
CN103034407B (en) * | 2012-12-07 | 2016-08-03 | 东莞宇龙通信科技有限公司 | Terminal and the method rapidly inputting common phrases |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2595674C (en) | Multiple index based information retrieval system | |
US6792414B2 (en) | Generalized keyword matching for keyword based searching over relational databases | |
CA2813644C (en) | Phrase-based searching in an information retrieval system | |
US6772141B1 (en) | Method and apparatus for organizing and using indexes utilizing a search decision table | |
US6167397A (en) | Method of clustering electronic documents in response to a search query | |
Vu et al. | A graph method for keyword-based selection of the top-k databases | |
US9342582B2 (en) | Selection of atoms for search engine retrieval | |
US20110113048A1 (en) | Enabling Faster Full-Text Searching Using a Structured Data Store | |
US20100169305A1 (en) | Information retrieval system for archiving multiple document versions | |
US7827172B2 (en) | “Query-log match” relevance features | |
Chen et al. | Template detection for large scale search engines | |
WO2008144457A2 (en) | Efficient retrieval algorithm by query term discrimination | |
CN104391908B (en) | Multiple key indexing means based on local sensitivity Hash on a kind of figure | |
Cetindil et al. | Efficient instant-fuzzy search with proximity ranking | |
US20110113052A1 (en) | Query result iteration for multiple queries | |
Cappellari et al. | A path-oriented rdf index for keyword search query processing | |
CN103064847A (en) | Indexing equipment, indexing method, search device, search method and search system | |
CN1940920A (en) | Phrase indexing method | |
Lingpeng et al. | Improving retrieval effectiveness by using key terms in top retrieved documents | |
Momin et al. | Web document clustering using document index graph | |
Lo et al. | The numeric indexing for music data | |
Shui et al. | Querying and maintaining ordered XML data using relational databases | |
Kanlayanawat et al. | Automatic indexing for Thai text with unknown words using trie structure | |
Wang | Study on risk evaluation intelligent decision support system of urban gas pipeline | |
Wang et al. | Clustering web search results based on interactive suffix tree algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |