CN102110123B - Method for establishing inverted index - Google Patents
Method for establishing inverted index Download PDFInfo
- Publication number
- CN102110123B CN102110123B CN200910260705.6A CN200910260705A CN102110123B CN 102110123 B CN102110123 B CN 102110123B CN 200910260705 A CN200910260705 A CN 200910260705A CN 102110123 B CN102110123 B CN 102110123B
- Authority
- CN
- China
- Prior art keywords
- word
- inverted index
- document
- data
- certain type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method for establishing an inverted index. The inverted index comprises an extraction result list, wherein the extraction result list comprises a file number and an extraction result record corresponding to the file number; and the extraction result record comprises a type information item, content information item and a position information item. The method comprises the following steps: carrying out word segmentation operation on a file represented by a character string format; extracting one word from the word segmentation operation result; judging whether the extracted word belongs to data of a certain type; if yes, performing the next step, and otherwise finishing operation after a general inverted index list is built for the extracted word; respectively filling the content of the extracted word, the position of the word in the file and a detection method adopted for judging whether the word belongs to the data of a certain type into the content information item, position information item and type information item in the extraction result record; establishing the extraction result list; and then, establishing the general inverted index list for the extracted word.
Description
Technical field
The present invention relates to information retrieval field, particularly a kind of inverted index method for building up.
Background technology
Along with the development of computing machine, internet, the mankind's knowledge is stored with digitized forms more and more.How in the digital text of magnanimity, retrieving fast and accurately the knowledge that people want becomes urgent demand.1945, the paper < < of Vannevar Bush was just as we may think ... > > proposed for the first time design automatically, the conception of the machine searched in large-scale storage data.This is considered to the masterpiece of modern information retrieval technology.Enter after the fifties, researchers start as these imagination effort that realizes progressively.The mid-50, in the research that utilizes computer to retrieve text data, researcher has obtained some achievements.Wherein the most representative Luhn of being (asks for an interview list of references 1 " H.P.Luhn in the work of IBM Corporation, " A statistical approach to mechanized encoding and searching of literary information ", IBM Journal of Research and Development, vol.1 (4), pp.309 – 317, 1957 "), he has proposed the method for utilizing word to retrieve the matching degree of word in the keyword of document index building utilization retrieval use and document, this method is exactly the blank of current conventional Inverted Index Technique.
So-called inverted index (Inverted index) is also often called as reverse indexing, inserts archives or reverse archives, be a kind of conventional indexing means, it is used to be stored in the mapping of the memory location of certain word in a document or one group of document under full-text search.It is data structure the most frequently used in DRS.In a kind of known implementation, inverted index can be seen as a linked list array in the prior art, and the gauge outfit of each chained list comprises keyword, and its follow-up unit comprises all file numbers that comprise this keyword and some other information.These information can be the frequencies of this word in document, can be also the information such as position of this word in document.When retrieval, can directly utilize the keyword of each chained list gauge outfit to search the document that comprises these keywords like this, and carry out one by one the retrieval based on keyword without the document to all, be conducive to improve effectiveness of retrieval.The well-known search engine companies majority such as Google has all adopted inverted index method to realize the retrieval of information.
In prior art, the process of establishing of inverted index comprises following step conventionally:
Step 1), document are resolved.Different document storage formats is converted to unified character string forms.Present document format is many especially, and as PDF, html format, TXT form, DOC form etc., the task of document analyzing step is to read document files, is converted to unified string format.
Step 2), keyword extraction.This step mainly completes operations such as comprising Chinese word segmentation, removal stop words, capital and small letter conversion, tense reduction.
Step 3), foundation, storage inverted index.The appearance position of keyword, article number, keyword is joined in foregoing inverted index data structure, by inverted index data structure storage in the persistence equipment such as database or file.
Inverted index of the prior art is for find document that quick-searching approach is provided according to word, but its matching process is exact matching, only comprises searched the arriving of document ability of term, and this is inadequate often in a lot of occasions.For example; in the text message search application of business and government department; often have of this sort demand: input someone's name, not only to find all documents that comprise this name, also wish to know the information such as the telephone number relevant with this people, mailbox.Obviously, on search engine, input " telephone number " this word, can only find all documents that contain " telephone number " this word, and can not find, only contains the no document of appearance " telephone number " this word of umerical telephone number.
Although those skilled in the art have had realized that the existing above-mentioned defect of Inverted Index Technique, the solution proposing has the defect that implementation efficiency is very low conventionally.As a kind of typical solution to foregoing problems in prior art is: find after all documents that comprise this name, then by information extraction system, the full text of the document searching is resolved, extract required telephone number, mailbox etc.The problem of this method maximum be each search all will be again to searched to document carry out primary information extraction, huge when number of documents, when searching times is a lot, time overhead is obviously unacceptable.
Summary of the invention
The object of the invention is to overcome prior art and cannot directly search by inverted index method the defect of a certain categorical data, thereby a kind of new inverted index creation method is provided.
To achieve these goals, the invention provides a kind of inverted index method for building up, described inverted index comprises extraction result table, described extraction result table comprises number of documents and the extraction outcome record number corresponding with the document, and described extraction outcome record includes type, content and position information item; The method comprises:
Step 1), the document being represented is done to participle operate by string format, from described participle operating result, take out a word;
Step 2), whether the word that takes out of judgement belong to the data of a certain type, if belonged to, carries out next step, otherwise, perform step 4);
Step 3), by the content of extracted word, position in the document of place and judge that the detection method that this word adopts while whether belonging to the data of a certain type inserts respectively content, position and the type information item in described extraction outcome record, create and extract result table, then carry out next step;
Step 4), be that general inverted index table set up in taken out word.
In technique scheme, in described step 2) in, adopt regular expression to detect the data whether word taking out belongs to a certain type.
In technique scheme, the data of described a certain type comprise a kind of in Mobile Directory Number, fixed telephone number, ID (identity number) card No., E-mail address.
In technique scheme, in described step 2) in, adopt the method for named entity recognition to detect the data whether word taking out belongs to a certain type; Wherein, the method for described named entity recognition comprises a kind of in rule-based method, the method based on statistics, method based on dictionary.
In technique scheme, the data of described a certain type comprise a kind of in name, exabyte, address.
The present invention also provides the inverted index that a kind of utilization is set up to realize the method for searching for, and comprising:
Step 1), utilize keyword to search in general inverted index table, obtain including the number of documents of the document of this keyword;
Step 2), according to described number of documents, from extract result table, find out the extraction result of relevant documentation and show.
The invention has the advantages that:
The inverted index that inverted index creation method of the present invention creates can be searched categorical data, the overhead of having avoided prior art to spend when searching categorical data.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of inverted index method for building up of the present invention;
Fig. 2 is the schematic diagram of extraction result table related in the present invention;
Fig. 3 utilizes inverted index that the present invention creates to realize the process flow diagram of the method for search.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is illustrated.
In the present invention, except will extract keyword from document, and set up outside inverted index for keyword, can also from document, extract as required relevant information storage.Make user when search, by keyword, can directly find the relevant information extracting, without again original document being resolved, thus the time efficiency while improving search.Communication information take below as example, to setting up the process of the inverted index that includes communication information, be illustrated.
Same as the prior art, in setting up the process of inverted index, first want parse documents, different document storage formats is converted to unified character string forms.As any one in PDF, html format, TXT form, DOC form is converted to unified string format.Conversion operations and prior art in this step are the same, therefore do not do herein repeat specification.
Document is being converted to after unified string format, will from document, extracting keyword below.Different from the concept of keyword related in prior art, in the present invention, the scope that this concept of keyword comprises is more extensive.Keyword in the present invention common specific character data (Chinese character or letter as definite in several), can also comprise the data of certain type, as fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. etc. in prior art.To these content differences but the extraction of the identical data of type adopts the method for text matches of the prior art cannot realize, therefore need to adopt some special technological means.
In general the data of same kind have some common features, for example, if be all Mobile Directory Number, these data should all be comprised of numeral so, and have identical figure place, and for example, if be all E-mail address, should comprise@character so in data.Therefore, can set in the present embodiment some special characters and do preliminary extraction, and then realize detailed leaching process by the regular expression that can be described rule.For these reasons, with reference to figure 1, the present invention is after obtaining the document of describing by string format, first the document is done to participle operation, in result from participle, take out a word, then in the word that judgement is taken out, whether include special character, if there is special character, so just can adopt the regular expression corresponding with this special character to do matching operation, the result that success is mated extracts, if do not contain special character or matching regular expressions is unsuccessful, according to keyword extracting method of the prior art, extract keyword.Mobile Directory Number take below as example, said process is illustrated.Because the combination of numbers of the Mobile Directory Number of different user there are differences, therefore, unless known the particular content of Mobile Directory Number, otherwise be difficult to rely on existing keyword extracting method to find out all data that belong to Mobile Directory Number type from document.In the present embodiment, adopt regular expression to realize the extraction to Mobile Directory Number categorical data.For example, the regular expression of the Mobile Directory Number of CONTINENTAL AREA OF CHINA is as follows: (15[13567890] d{8}|13[13567890] d{8}).In keyword extraction process, after participle, in the word that judgement is taken out from word segmentation result, whether have numeral so, if there is numeral, just adopt above-mentioned regular expression to do matching operation to this word, the result that success is mated extracts.
Above, to extract the data instance of Mobile Directory Number type from document, the relevant operation of keyword extraction is described.In actual applications, can also realize comprising the extraction of polytype data of fixed telephone number, ID (identity number) card No., E-mail address with same method, just when extracting the data of these types, to the recognition methods of the type data, may have certain change (as the particular content of the special character being adopted can be different), in addition, the regular expression adopting also can be different.The data that provided the types such as fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. are below corresponding regular expression separately.It will be understood by those skilled in the art that the data that can also extract according to actual needs other type, and other categorical data also has each self-corresponding regular expression.
Sequence number | Type | Regular expression |
1 | Mobile Directory Number | (15[13567890]\d{8}|13[13567890]\d{8}) |
2 | Fixed telephone number | (\d{3}-\d{8}|\d{4}-\d{7}) |
3 | ID (identity number) card No. | (\d{15}|\d{18}) |
4 | E-mail address | (\w+([-+.]\\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*) |
Table 1
After the extraction that completes keyword, then set up inverted index below.In explanation above, mention, some word in document need to be extracted by regular expression, and extracts resulting data and conventionally belong to a certain data type, and these data content is each other also incomplete same.Therefore, for these keywords that formed by categorical data, setting up in the process of inverted index, except using conventional inverted index data structure, also will use one and be called as the data structure that extracts result table.The extraction result of as shown in Figure 2, having stored many pieces of documents in extracting result table.The extraction result of each document has a plurality of, and therefore a number of documents is to there being many to extract outcome record, and every records " type " and " content ", " position " three information; Wherein, " type " indicated " content " is with which regular expression to identify, and " content " stored extraction result, and " position " stored the position that " content " occurs in document.
Continuation, with reference to figure 1, in conjunction with extraction result table noted earlier, is illustrated the process of establishing of inverted index.Mention in the preceding article, keyword comprises the word extracting with regular expression and the word extracting by conventional method.To the word extracting with regular expression, judge this word with which regular expression extracts, thereby the type of determining this word, is then filled into the type of this word, content, positional information in the extraction result table of aforementioned document according to the document at this word place successively; Finally according to prior art, the number of documents of the word extracting and word place document is joined in conventional inverted index again.For the word extracting by conventional method, directly according to prior art, the number of documents of this word and word place document is joined in conventional inverted index.
After setting up above-mentioned inverted index, just can utilize set up inverted index to realize search.As shown in Figure 3, after utilizing keyword to complete traditional inverted index to search, obtained the number of documents of a collection of relevant documentation.If further obtain in document the data of certain type, traditional search system is generally to search original document according to these number of documents, obtains by certain rule compositor, representing after filename and documentation summary.And in the present invention, can be according to the number of documents obtaining, in extracting result table, fast finding is to all extraction results of relevant documentation.These extract results can be by type, two dimensions of document are carried out multidimensional displaying.Such as, searched key word " Beijing ", represent result except the document title and documentation summary of " Beijing " this word of all containing, also can in the details of each document, be listed in all phone numbers, E-mail address, fixed telephone number, the ID (identity number) card No. that the document occurred.Also the phone number, E-mail address, fixed telephone number, the ID (identity number) card No. that in all documents of this word of all relating to " Beijing ", occurred can be listed respectively according to type.
From finding out the explanation of above-mentioned search procedure, by keyword lookup document, and in the process of document quick-searching relevant information, last institute can quick-searching to information and index creation process in be kept at information-related in extraction result table.As in the above-described embodiment, utilize regular expression to extract fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. information and be kept at and extract in result table, so just can not in retrieving, by number of documents, directly by searching extraction result table, obtain the information about home address.
In the above-described embodiments, adopted regular expression to realize the extraction to the data of a certain type, but can also adopt in other embodiments additive method of the prior art to realize the extraction to a certain categorical data, as the extraction of the method for employing named entity recognition to information such as name, exabyte, addresses.The method of named entity recognition specifically comprises rule-based method, the method based on statistics, the method based on dictionary etc.Can be preferably based in the present invention regular method or the method based on dictionary.Certainly, if while realizing information extraction by the method for named entity recognition, extract " content " that the type item in result table records is with which kind of named entity recognition method to identify.
The inverted index that inverted index creation method of the present invention creates can be searched categorical data, the overhead of having avoided prior art to spend when searching categorical data.
It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is modified or is equal to replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of claim scope of the present invention.
Claims (6)
1. an inverted index method for building up, described inverted index comprises and extracts result table, and described extraction result table comprises number of documents and the extraction outcome record number corresponding with the document, and described extraction outcome record includes type, content and position information item; The method comprises:
Step 1), the document being represented is done to participle operate by string format, from described participle operating result, take out a word;
Step 2), whether the word that takes out of judgement belong to the data of a certain type, if belonged to, carries out next step, otherwise, perform step 4);
Step 3), by the content of extracted word, position in the document of place and judge that the detection method that this word adopts while whether belonging to the data of a certain type inserts respectively content, position and the type information item in described extraction outcome record, create and extract result table, then carry out next step;
Step 4), use the word take out and the number of documents of the document that comprises this word to set up general inverted index table for taken out word.
2. inverted index method for building up according to claim 1, is characterized in that, in described step 2) in, adopt regular expression to detect the data whether word taking out belongs to a certain type.
3. inverted index method for building up according to claim 2, is characterized in that, the data of described a certain type comprise a kind of in Mobile Directory Number, fixed telephone number, ID (identity number) card No., E-mail address.
4. inverted index method for building up according to claim 1, is characterized in that, in described step 2) in, adopt the method for named entity recognition to detect the data whether word taking out belongs to a certain type; Wherein, the method for described named entity recognition comprises a kind of in rule-based method, the method based on statistics, method based on dictionary.
5. inverted index method for building up according to claim 4, is characterized in that, the data of described a certain type comprise a kind of in name, exabyte, address.
6. utilize the inverted index that one of claim 1-5 sets up to realize a method of searching for, comprising:
Step 1), utilize keyword to search in general inverted index table, obtain including the number of documents of the document of this keyword;
Step 2), according to described number of documents, from extract result table, find out the extraction result of relevant documentation and show.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910260705.6A CN102110123B (en) | 2009-12-29 | 2009-12-29 | Method for establishing inverted index |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910260705.6A CN102110123B (en) | 2009-12-29 | 2009-12-29 | Method for establishing inverted index |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102110123A CN102110123A (en) | 2011-06-29 |
CN102110123B true CN102110123B (en) | 2014-02-05 |
Family
ID=44174285
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200910260705.6A Expired - Fee Related CN102110123B (en) | 2009-12-29 | 2009-12-29 | Method for establishing inverted index |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102110123B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103198079B (en) * | 2012-01-06 | 2016-04-20 | 北大方正集团有限公司 | The implementation method of relevant search and device |
CN104504070B (en) * | 2014-12-22 | 2019-06-04 | 北京奇虎科技有限公司 | A kind of method and apparatus of search |
CN104715068B (en) * | 2015-03-31 | 2017-04-12 | 北京奇元科技有限公司 | Method and device for generating document indexes and searching method and device |
CN104750852B (en) * | 2015-04-14 | 2018-03-09 | 海量云图(北京)数据技术有限公司 | The discovery of Chinese address data and sorting technique |
CN104731976B (en) * | 2015-04-14 | 2018-03-30 | 海量云图(北京)数据技术有限公司 | The discovery of private data and sorting technique in tables of data |
CN104731978B (en) * | 2015-04-14 | 2018-03-09 | 海量云图(北京)数据技术有限公司 | The discovery of Chinese Name data and sorting technique |
CN104731977B (en) * | 2015-04-14 | 2018-01-05 | 海量云图(北京)数据技术有限公司 | The discovery of telephone number data and sorting technique |
CN110019638A (en) * | 2017-07-17 | 2019-07-16 | 南京烽火软件科技有限公司 | A kind of indexing means based on the separation of cold and hot word |
CN108363701B (en) * | 2018-04-13 | 2022-06-28 | 达而观信息科技(上海)有限公司 | Named entity identification method and system |
CN109992603B (en) * | 2019-04-04 | 2020-10-09 | 北京金堤科技有限公司 | Data searching method and device, electronic equipment and computer readable medium |
CN111522905A (en) * | 2020-04-15 | 2020-08-11 | 武汉灯塔之光科技有限公司 | Document searching method and device based on database |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1503163A (en) * | 2002-11-22 | 2004-06-09 | �Ҵ���˾ | International information search and deivery system providing search results personalized to a particular natural language |
WO2007041120A1 (en) * | 2005-09-29 | 2007-04-12 | Microsoft Corporation | Click distance determination |
CN101192237A (en) * | 2006-11-30 | 2008-06-04 | 国际商业机器公司 | Method and system for inquiring multiple information |
-
2009
- 2009-12-29 CN CN200910260705.6A patent/CN102110123B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1503163A (en) * | 2002-11-22 | 2004-06-09 | �Ҵ���˾ | International information search and deivery system providing search results personalized to a particular natural language |
WO2007041120A1 (en) * | 2005-09-29 | 2007-04-12 | Microsoft Corporation | Click distance determination |
CN101192237A (en) * | 2006-11-30 | 2008-06-04 | 国际商业机器公司 | Method and system for inquiring multiple information |
Also Published As
Publication number | Publication date |
---|---|
CN102110123A (en) | 2011-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102110123B (en) | Method for establishing inverted index | |
CN109992645B (en) | Data management system and method based on text data | |
CN109446344B (en) | Intelligent analysis report automatic generation system based on big data | |
CN101539904B (en) | Automatic indexing method of quotations | |
CN110162522B (en) | Distributed data search system and method | |
CN102737021B (en) | Search engine and realization method thereof | |
CN101794307A (en) | Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea | |
CN107844493B (en) | File association method and system | |
CN111400323A (en) | Data retrieval method, system, device and storage medium | |
CN113407785B (en) | Data processing method and system based on distributed storage system | |
CN105404677A (en) | Tree structure based retrieval method | |
CN105824956A (en) | Inverted index model based on link list structure and construction method of inverted index model | |
CN107291951B (en) | Data processing method, device, storage medium and processor | |
CN111291547B (en) | Template generation method, device, equipment and medium | |
CN103064847A (en) | Indexing equipment, indexing method, search device, search method and search system | |
CN111159984A (en) | Supplementary reading system with intelligence study note function | |
US10614102B2 (en) | Method and system for creating entity records using existing data sources | |
CN105426490A (en) | Tree structure based indexing method | |
Bartoli et al. | Semisupervised wrapper choice and generation for print-oriented documents | |
CN114218347A (en) | Method for quickly searching index of multiple file contents | |
CN112214494B (en) | Retrieval method and device | |
CN113742291A (en) | File saving method and device and computer storage medium | |
CN116578666B (en) | Segment sentence position inverted index structure design and limited operation full text retrieval method thereof | |
US8630984B1 (en) | System and method for data extraction from email files | |
CN111966816B (en) | Intelligent association method and system for official documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20140205 Termination date: 20151229 |
|
EXPY | Termination of patent right or utility model |