CN102110123B - Method for establishing inverted index - Google Patents

Method for establishing inverted index Download PDF

Info

Publication number
CN102110123B
CN102110123B CN200910260705.6A CN200910260705A CN102110123B CN 102110123 B CN102110123 B CN 102110123B CN 200910260705 A CN200910260705 A CN 200910260705A CN 102110123 B CN102110123 B CN 102110123B
Authority
CN
China
Prior art keywords
word
inverted index
document
data
certain type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200910260705.6A
Other languages
Chinese (zh)
Other versions
CN102110123A (en
Inventor
黄九鸣
周斌
贾焰
邹鹏
吴泉源
杨树强
韩伟红
李爱平
梁政
单大甫
蒋子海
崔凯
韩毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN200910260705.6A priority Critical patent/CN102110123B/en
Publication of CN102110123A publication Critical patent/CN102110123A/en
Application granted granted Critical
Publication of CN102110123B publication Critical patent/CN102110123B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for establishing an inverted index. The inverted index comprises an extraction result list, wherein the extraction result list comprises a file number and an extraction result record corresponding to the file number; and the extraction result record comprises a type information item, content information item and a position information item. The method comprises the following steps: carrying out word segmentation operation on a file represented by a character string format; extracting one word from the word segmentation operation result; judging whether the extracted word belongs to data of a certain type; if yes, performing the next step, and otherwise finishing operation after a general inverted index list is built for the extracted word; respectively filling the content of the extracted word, the position of the word in the file and a detection method adopted for judging whether the word belongs to the data of a certain type into the content information item, position information item and type information item in the extraction result record; establishing the extraction result list; and then, establishing the general inverted index list for the extracted word.

Description

Inverted index method for building up
Technical field
The present invention relates to information retrieval field, particularly a kind of inverted index method for building up.
Background technology
Along with the development of computing machine, internet, the mankind's knowledge is stored with digitized forms more and more.How in the digital text of magnanimity, retrieving fast and accurately the knowledge that people want becomes urgent demand.1945, the paper < < of Vannevar Bush was just as we may think ... > > proposed for the first time design automatically, the conception of the machine searched in large-scale storage data.This is considered to the masterpiece of modern information retrieval technology.Enter after the fifties, researchers start as these imagination effort that realizes progressively.The mid-50, in the research that utilizes computer to retrieve text data, researcher has obtained some achievements.Wherein the most representative Luhn of being (asks for an interview list of references 1 " H.P.Luhn in the work of IBM Corporation, " A statistical approach to mechanized encoding and searching of literary information ", IBM Journal of Research and Development, vol.1 (4), pp.309 – 317, 1957 "), he has proposed the method for utilizing word to retrieve the matching degree of word in the keyword of document index building utilization retrieval use and document, this method is exactly the blank of current conventional Inverted Index Technique.
So-called inverted index (Inverted index) is also often called as reverse indexing, inserts archives or reverse archives, be a kind of conventional indexing means, it is used to be stored in the mapping of the memory location of certain word in a document or one group of document under full-text search.It is data structure the most frequently used in DRS.In a kind of known implementation, inverted index can be seen as a linked list array in the prior art, and the gauge outfit of each chained list comprises keyword, and its follow-up unit comprises all file numbers that comprise this keyword and some other information.These information can be the frequencies of this word in document, can be also the information such as position of this word in document.When retrieval, can directly utilize the keyword of each chained list gauge outfit to search the document that comprises these keywords like this, and carry out one by one the retrieval based on keyword without the document to all, be conducive to improve effectiveness of retrieval.The well-known search engine companies majority such as Google has all adopted inverted index method to realize the retrieval of information.
In prior art, the process of establishing of inverted index comprises following step conventionally:
Step 1), document are resolved.Different document storage formats is converted to unified character string forms.Present document format is many especially, and as PDF, html format, TXT form, DOC form etc., the task of document analyzing step is to read document files, is converted to unified string format.
Step 2), keyword extraction.This step mainly completes operations such as comprising Chinese word segmentation, removal stop words, capital and small letter conversion, tense reduction.
Step 3), foundation, storage inverted index.The appearance position of keyword, article number, keyword is joined in foregoing inverted index data structure, by inverted index data structure storage in the persistence equipment such as database or file.
Inverted index of the prior art is for find document that quick-searching approach is provided according to word, but its matching process is exact matching, only comprises searched the arriving of document ability of term, and this is inadequate often in a lot of occasions.For example; in the text message search application of business and government department; often have of this sort demand: input someone's name, not only to find all documents that comprise this name, also wish to know the information such as the telephone number relevant with this people, mailbox.Obviously, on search engine, input " telephone number " this word, can only find all documents that contain " telephone number " this word, and can not find, only contains the no document of appearance " telephone number " this word of umerical telephone number.
Although those skilled in the art have had realized that the existing above-mentioned defect of Inverted Index Technique, the solution proposing has the defect that implementation efficiency is very low conventionally.As a kind of typical solution to foregoing problems in prior art is: find after all documents that comprise this name, then by information extraction system, the full text of the document searching is resolved, extract required telephone number, mailbox etc.The problem of this method maximum be each search all will be again to searched to document carry out primary information extraction, huge when number of documents, when searching times is a lot, time overhead is obviously unacceptable.
Summary of the invention
The object of the invention is to overcome prior art and cannot directly search by inverted index method the defect of a certain categorical data, thereby a kind of new inverted index creation method is provided.
To achieve these goals, the invention provides a kind of inverted index method for building up, described inverted index comprises extraction result table, described extraction result table comprises number of documents and the extraction outcome record number corresponding with the document, and described extraction outcome record includes type, content and position information item; The method comprises:
Step 1), the document being represented is done to participle operate by string format, from described participle operating result, take out a word;
Step 2), whether the word that takes out of judgement belong to the data of a certain type, if belonged to, carries out next step, otherwise, perform step 4);
Step 3), by the content of extracted word, position in the document of place and judge that the detection method that this word adopts while whether belonging to the data of a certain type inserts respectively content, position and the type information item in described extraction outcome record, create and extract result table, then carry out next step;
Step 4), be that general inverted index table set up in taken out word.
In technique scheme, in described step 2) in, adopt regular expression to detect the data whether word taking out belongs to a certain type.
In technique scheme, the data of described a certain type comprise a kind of in Mobile Directory Number, fixed telephone number, ID (identity number) card No., E-mail address.
In technique scheme, in described step 2) in, adopt the method for named entity recognition to detect the data whether word taking out belongs to a certain type; Wherein, the method for described named entity recognition comprises a kind of in rule-based method, the method based on statistics, method based on dictionary.
In technique scheme, the data of described a certain type comprise a kind of in name, exabyte, address.
The present invention also provides the inverted index that a kind of utilization is set up to realize the method for searching for, and comprising:
Step 1), utilize keyword to search in general inverted index table, obtain including the number of documents of the document of this keyword;
Step 2), according to described number of documents, from extract result table, find out the extraction result of relevant documentation and show.
The invention has the advantages that:
The inverted index that inverted index creation method of the present invention creates can be searched categorical data, the overhead of having avoided prior art to spend when searching categorical data.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of inverted index method for building up of the present invention;
Fig. 2 is the schematic diagram of extraction result table related in the present invention;
Fig. 3 utilizes inverted index that the present invention creates to realize the process flow diagram of the method for search.
Embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is illustrated.
In the present invention, except will extract keyword from document, and set up outside inverted index for keyword, can also from document, extract as required relevant information storage.Make user when search, by keyword, can directly find the relevant information extracting, without again original document being resolved, thus the time efficiency while improving search.Communication information take below as example, to setting up the process of the inverted index that includes communication information, be illustrated.
Same as the prior art, in setting up the process of inverted index, first want parse documents, different document storage formats is converted to unified character string forms.As any one in PDF, html format, TXT form, DOC form is converted to unified string format.Conversion operations and prior art in this step are the same, therefore do not do herein repeat specification.
Document is being converted to after unified string format, will from document, extracting keyword below.Different from the concept of keyword related in prior art, in the present invention, the scope that this concept of keyword comprises is more extensive.Keyword in the present invention common specific character data (Chinese character or letter as definite in several), can also comprise the data of certain type, as fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. etc. in prior art.To these content differences but the extraction of the identical data of type adopts the method for text matches of the prior art cannot realize, therefore need to adopt some special technological means.
In general the data of same kind have some common features, for example, if be all Mobile Directory Number, these data should all be comprised of numeral so, and have identical figure place, and for example, if be all E-mail address, should comprise@character so in data.Therefore, can set in the present embodiment some special characters and do preliminary extraction, and then realize detailed leaching process by the regular expression that can be described rule.For these reasons, with reference to figure 1, the present invention is after obtaining the document of describing by string format, first the document is done to participle operation, in result from participle, take out a word, then in the word that judgement is taken out, whether include special character, if there is special character, so just can adopt the regular expression corresponding with this special character to do matching operation, the result that success is mated extracts, if do not contain special character or matching regular expressions is unsuccessful, according to keyword extracting method of the prior art, extract keyword.Mobile Directory Number take below as example, said process is illustrated.Because the combination of numbers of the Mobile Directory Number of different user there are differences, therefore, unless known the particular content of Mobile Directory Number, otherwise be difficult to rely on existing keyword extracting method to find out all data that belong to Mobile Directory Number type from document.In the present embodiment, adopt regular expression to realize the extraction to Mobile Directory Number categorical data.For example, the regular expression of the Mobile Directory Number of CONTINENTAL AREA OF CHINA is as follows: (15[13567890] d{8}|13[13567890] d{8}).In keyword extraction process, after participle, in the word that judgement is taken out from word segmentation result, whether have numeral so, if there is numeral, just adopt above-mentioned regular expression to do matching operation to this word, the result that success is mated extracts.
Above, to extract the data instance of Mobile Directory Number type from document, the relevant operation of keyword extraction is described.In actual applications, can also realize comprising the extraction of polytype data of fixed telephone number, ID (identity number) card No., E-mail address with same method, just when extracting the data of these types, to the recognition methods of the type data, may have certain change (as the particular content of the special character being adopted can be different), in addition, the regular expression adopting also can be different.The data that provided the types such as fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. are below corresponding regular expression separately.It will be understood by those skilled in the art that the data that can also extract according to actual needs other type, and other categorical data also has each self-corresponding regular expression.
Sequence number Type Regular expression
1 Mobile Directory Number (15[13567890]\d{8}|13[13567890]\d{8})
2 Fixed telephone number (\d{3}-\d{8}|\d{4}-\d{7})
3 ID (identity number) card No. (\d{15}|\d{18})
4 E-mail address (\w+([-+.]\\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)
Table 1
After the extraction that completes keyword, then set up inverted index below.In explanation above, mention, some word in document need to be extracted by regular expression, and extracts resulting data and conventionally belong to a certain data type, and these data content is each other also incomplete same.Therefore, for these keywords that formed by categorical data, setting up in the process of inverted index, except using conventional inverted index data structure, also will use one and be called as the data structure that extracts result table.The extraction result of as shown in Figure 2, having stored many pieces of documents in extracting result table.The extraction result of each document has a plurality of, and therefore a number of documents is to there being many to extract outcome record, and every records " type " and " content ", " position " three information; Wherein, " type " indicated " content " is with which regular expression to identify, and " content " stored extraction result, and " position " stored the position that " content " occurs in document.
Continuation, with reference to figure 1, in conjunction with extraction result table noted earlier, is illustrated the process of establishing of inverted index.Mention in the preceding article, keyword comprises the word extracting with regular expression and the word extracting by conventional method.To the word extracting with regular expression, judge this word with which regular expression extracts, thereby the type of determining this word, is then filled into the type of this word, content, positional information in the extraction result table of aforementioned document according to the document at this word place successively; Finally according to prior art, the number of documents of the word extracting and word place document is joined in conventional inverted index again.For the word extracting by conventional method, directly according to prior art, the number of documents of this word and word place document is joined in conventional inverted index.
After setting up above-mentioned inverted index, just can utilize set up inverted index to realize search.As shown in Figure 3, after utilizing keyword to complete traditional inverted index to search, obtained the number of documents of a collection of relevant documentation.If further obtain in document the data of certain type, traditional search system is generally to search original document according to these number of documents, obtains by certain rule compositor, representing after filename and documentation summary.And in the present invention, can be according to the number of documents obtaining, in extracting result table, fast finding is to all extraction results of relevant documentation.These extract results can be by type, two dimensions of document are carried out multidimensional displaying.Such as, searched key word " Beijing ", represent result except the document title and documentation summary of " Beijing " this word of all containing, also can in the details of each document, be listed in all phone numbers, E-mail address, fixed telephone number, the ID (identity number) card No. that the document occurred.Also the phone number, E-mail address, fixed telephone number, the ID (identity number) card No. that in all documents of this word of all relating to " Beijing ", occurred can be listed respectively according to type.
From finding out the explanation of above-mentioned search procedure, by keyword lookup document, and in the process of document quick-searching relevant information, last institute can quick-searching to information and index creation process in be kept at information-related in extraction result table.As in the above-described embodiment, utilize regular expression to extract fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. information and be kept at and extract in result table, so just can not in retrieving, by number of documents, directly by searching extraction result table, obtain the information about home address.
In the above-described embodiments, adopted regular expression to realize the extraction to the data of a certain type, but can also adopt in other embodiments additive method of the prior art to realize the extraction to a certain categorical data, as the extraction of the method for employing named entity recognition to information such as name, exabyte, addresses.The method of named entity recognition specifically comprises rule-based method, the method based on statistics, the method based on dictionary etc.Can be preferably based in the present invention regular method or the method based on dictionary.Certainly, if while realizing information extraction by the method for named entity recognition, extract " content " that the type item in result table records is with which kind of named entity recognition method to identify.
The inverted index that inverted index creation method of the present invention creates can be searched categorical data, the overhead of having avoided prior art to spend when searching categorical data.
It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is modified or is equal to replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of claim scope of the present invention.

Claims (6)

1. an inverted index method for building up, described inverted index comprises and extracts result table, and described extraction result table comprises number of documents and the extraction outcome record number corresponding with the document, and described extraction outcome record includes type, content and position information item; The method comprises:
Step 1), the document being represented is done to participle operate by string format, from described participle operating result, take out a word;
Step 2), whether the word that takes out of judgement belong to the data of a certain type, if belonged to, carries out next step, otherwise, perform step 4);
Step 3), by the content of extracted word, position in the document of place and judge that the detection method that this word adopts while whether belonging to the data of a certain type inserts respectively content, position and the type information item in described extraction outcome record, create and extract result table, then carry out next step;
Step 4), use the word take out and the number of documents of the document that comprises this word to set up general inverted index table for taken out word.
2. inverted index method for building up according to claim 1, is characterized in that, in described step 2) in, adopt regular expression to detect the data whether word taking out belongs to a certain type.
3. inverted index method for building up according to claim 2, is characterized in that, the data of described a certain type comprise a kind of in Mobile Directory Number, fixed telephone number, ID (identity number) card No., E-mail address.
4. inverted index method for building up according to claim 1, is characterized in that, in described step 2) in, adopt the method for named entity recognition to detect the data whether word taking out belongs to a certain type; Wherein, the method for described named entity recognition comprises a kind of in rule-based method, the method based on statistics, method based on dictionary.
5. inverted index method for building up according to claim 4, is characterized in that, the data of described a certain type comprise a kind of in name, exabyte, address.
6. utilize the inverted index that one of claim 1-5 sets up to realize a method of searching for, comprising:
Step 1), utilize keyword to search in general inverted index table, obtain including the number of documents of the document of this keyword;
Step 2), according to described number of documents, from extract result table, find out the extraction result of relevant documentation and show.
CN200910260705.6A 2009-12-29 2009-12-29 Method for establishing inverted index Expired - Fee Related CN102110123B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910260705.6A CN102110123B (en) 2009-12-29 2009-12-29 Method for establishing inverted index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910260705.6A CN102110123B (en) 2009-12-29 2009-12-29 Method for establishing inverted index

Publications (2)

Publication Number Publication Date
CN102110123A CN102110123A (en) 2011-06-29
CN102110123B true CN102110123B (en) 2014-02-05

Family

ID=44174285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910260705.6A Expired - Fee Related CN102110123B (en) 2009-12-29 2009-12-29 Method for establishing inverted index

Country Status (1)

Country Link
CN (1) CN102110123B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198079B (en) * 2012-01-06 2016-04-20 北大方正集团有限公司 The implementation method of relevant search and device
CN104504070B (en) * 2014-12-22 2019-06-04 北京奇虎科技有限公司 A kind of method and apparatus of search
CN104715068B (en) * 2015-03-31 2017-04-12 北京奇元科技有限公司 Method and device for generating document indexes and searching method and device
CN104750852B (en) * 2015-04-14 2018-03-09 海量云图(北京)数据技术有限公司 The discovery of Chinese address data and sorting technique
CN104731976B (en) * 2015-04-14 2018-03-30 海量云图(北京)数据技术有限公司 The discovery of private data and sorting technique in tables of data
CN104731978B (en) * 2015-04-14 2018-03-09 海量云图(北京)数据技术有限公司 The discovery of Chinese Name data and sorting technique
CN104731977B (en) * 2015-04-14 2018-01-05 海量云图(北京)数据技术有限公司 The discovery of telephone number data and sorting technique
CN110019638A (en) * 2017-07-17 2019-07-16 南京烽火软件科技有限公司 A kind of indexing means based on the separation of cold and hot word
CN108363701B (en) * 2018-04-13 2022-06-28 达而观信息科技(上海)有限公司 Named entity identification method and system
CN109992603B (en) * 2019-04-04 2020-10-09 北京金堤科技有限公司 Data searching method and device, electronic equipment and computer readable medium
CN111522905A (en) * 2020-04-15 2020-08-11 武汉灯塔之光科技有限公司 Document searching method and device based on database

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1503163A (en) * 2002-11-22 2004-06-09 �Ҵ���˾ International information search and deivery system providing search results personalized to a particular natural language
WO2007041120A1 (en) * 2005-09-29 2007-04-12 Microsoft Corporation Click distance determination
CN101192237A (en) * 2006-11-30 2008-06-04 国际商业机器公司 Method and system for inquiring multiple information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1503163A (en) * 2002-11-22 2004-06-09 �Ҵ���˾ International information search and deivery system providing search results personalized to a particular natural language
WO2007041120A1 (en) * 2005-09-29 2007-04-12 Microsoft Corporation Click distance determination
CN101192237A (en) * 2006-11-30 2008-06-04 国际商业机器公司 Method and system for inquiring multiple information

Also Published As

Publication number Publication date
CN102110123A (en) 2011-06-29

Similar Documents

Publication Publication Date Title
CN102110123B (en) Method for establishing inverted index
CN109992645B (en) Data management system and method based on text data
CN109446344B (en) Intelligent analysis report automatic generation system based on big data
CN101539904B (en) Automatic indexing method of quotations
CN110162522B (en) Distributed data search system and method
CN102737021B (en) Search engine and realization method thereof
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN107844493B (en) File association method and system
CN111400323A (en) Data retrieval method, system, device and storage medium
CN113407785B (en) Data processing method and system based on distributed storage system
CN105404677A (en) Tree structure based retrieval method
CN105824956A (en) Inverted index model based on link list structure and construction method of inverted index model
CN107291951B (en) Data processing method, device, storage medium and processor
CN111291547B (en) Template generation method, device, equipment and medium
CN103064847A (en) Indexing equipment, indexing method, search device, search method and search system
CN111159984A (en) Supplementary reading system with intelligence study note function
US10614102B2 (en) Method and system for creating entity records using existing data sources
CN105426490A (en) Tree structure based indexing method
Bartoli et al. Semisupervised wrapper choice and generation for print-oriented documents
CN114218347A (en) Method for quickly searching index of multiple file contents
CN112214494B (en) Retrieval method and device
CN113742291A (en) File saving method and device and computer storage medium
CN116578666B (en) Segment sentence position inverted index structure design and limited operation full text retrieval method thereof
US8630984B1 (en) System and method for data extraction from email files
CN111966816B (en) Intelligent association method and system for official documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140205

Termination date: 20151229

EXPY Termination of patent right or utility model