CN102110123B

CN102110123B - Method for establishing inverted index

Info

Publication number: CN102110123B
Application number: CN200910260705.6A
Authority: CN
Inventors: 黄九鸣; 周斌; 贾焰; 邹鹏; 吴泉源; 杨树强; 韩伟红; 李爱平; 梁政; 单大甫; 蒋子海; 崔凯; 韩毅
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2009-12-29
Filing date: 2009-12-29
Publication date: 2014-02-05
Anticipated expiration: 2029-12-29
Also published as: CN102110123A

Abstract

The invention provides a method for establishing an inverted index. The inverted index comprises an extraction result list, wherein the extraction result list comprises a file number and an extraction result record corresponding to the file number; and the extraction result record comprises a type information item, content information item and a position information item. The method comprises the following steps: carrying out word segmentation operation on a file represented by a character string format; extracting one word from the word segmentation operation result; judging whether the extracted word belongs to data of a certain type; if yes, performing the next step, and otherwise finishing operation after a general inverted index list is built for the extracted word; respectively filling the content of the extracted word, the position of the word in the file and a detection method adopted for judging whether the word belongs to the data of a certain type into the content information item, position information item and type information item in the extraction result record; establishing the extraction result list; and then, establishing the general inverted index list for the extracted word.

Description

Inverted index method for building up

Technical field

The present invention relates to information retrieval field, particularly a kind of inverted index method for building up.

Background technology

Along with the development of computing machine, internet, the mankind's knowledge is stored with digitized forms more and more.How in the digital text of magnanimity, retrieving fast and accurately the knowledge that people want becomes urgent demand.1945, the paper < < of Vannevar Bush was just as we may think ... > > proposed for the first time design automatically, the conception of the machine searched in large-scale storage data.This is considered to the masterpiece of modern information retrieval technology.Enter after the fifties, researchers start as these imagination effort that realizes progressively.The mid-50, in the research that utilizes computer to retrieve text data, researcher has obtained some achievements.Wherein the most representative Luhn of being (asks for an interview list of references 1 " H.P.Luhn in the work of IBM Corporation, " A statistical approach to mechanized encoding and searching of literary information ", IBM Journal of Research and Development, vol.1 (4), pp.309 – 317, 1957 "), he has proposed the method for utilizing word to retrieve the matching degree of word in the keyword of document index building utilization retrieval use and document, this method is exactly the blank of current conventional Inverted Index Technique.

So-called inverted index (Inverted index) is also often called as reverse indexing, inserts archives or reverse archives, be a kind of conventional indexing means, it is used to be stored in the mapping of the memory location of certain word in a document or one group of document under full-text search.It is data structure the most frequently used in DRS.In a kind of known implementation, inverted index can be seen as a linked list array in the prior art, and the gauge outfit of each chained list comprises keyword, and its follow-up unit comprises all file numbers that comprise this keyword and some other information.These information can be the frequencies of this word in document, can be also the information such as position of this word in document.When retrieval, can directly utilize the keyword of each chained list gauge outfit to search the document that comprises these keywords like this, and carry out one by one the retrieval based on keyword without the document to all, be conducive to improve effectiveness of retrieval.The well-known search engine companies majority such as Google has all adopted inverted index method to realize the retrieval of information.

In prior art, the process of establishing of inverted index comprises following step conventionally:

Step 1), document are resolved.Different document storage formats is converted to unified character string forms.Present document format is many especially, and as PDF, html format, TXT form, DOC form etc., the task of document analyzing step is to read document files, is converted to unified string format.

Step 2), keyword extraction.This step mainly completes operations such as comprising Chinese word segmentation, removal stop words, capital and small letter conversion, tense reduction.

Step 3), foundation, storage inverted index.The appearance position of keyword, article number, keyword is joined in foregoing inverted index data structure, by inverted index data structure storage in the persistence equipment such as database or file.

Inverted index of the prior art is for find document that quick-searching approach is provided according to word, but its matching process is exact matching, only comprises searched the arriving of document ability of term, and this is inadequate often in a lot of occasions.For example; in the text message search application of business and government department; often have of this sort demand: input someone's name, not only to find all documents that comprise this name, also wish to know the information such as the telephone number relevant with this people, mailbox.Obviously, on search engine, input " telephone number " this word, can only find all documents that contain " telephone number " this word, and can not find, only contains the no document of appearance " telephone number " this word of umerical telephone number.

Although those skilled in the art have had realized that the existing above-mentioned defect of Inverted Index Technique, the solution proposing has the defect that implementation efficiency is very low conventionally.As a kind of typical solution to foregoing problems in prior art is: find after all documents that comprise this name, then by information extraction system, the full text of the document searching is resolved, extract required telephone number, mailbox etc.The problem of this method maximum be each search all will be again to searched to document carry out primary information extraction, huge when number of documents, when searching times is a lot, time overhead is obviously unacceptable.

Summary of the invention

The object of the invention is to overcome prior art and cannot directly search by inverted index method the defect of a certain categorical data, thereby a kind of new inverted index creation method is provided.

To achieve these goals, the invention provides a kind of inverted index method for building up, described inverted index comprises extraction result table, described extraction result table comprises number of documents and the extraction outcome record number corresponding with the document, and described extraction outcome record includes type, content and position information item; The method comprises:

Step 1), the document being represented is done to participle operate by string format, from described participle operating result, take out a word;

Step 2), whether the word that takes out of judgement belong to the data of a certain type, if belonged to, carries out next step, otherwise, perform step 4);

Step 3), by the content of extracted word, position in the document of place and judge that the detection method that this word adopts while whether belonging to the data of a certain type inserts respectively content, position and the type information item in described extraction outcome record, create and extract result table, then carry out next step;

Step 4), be that general inverted index table set up in taken out word.

In technique scheme, in described step 2) in, adopt regular expression to detect the data whether word taking out belongs to a certain type.

In technique scheme, the data of described a certain type comprise a kind of in Mobile Directory Number, fixed telephone number, ID (identity number) card No., E-mail address.

In technique scheme, in described step 2) in, adopt the method for named entity recognition to detect the data whether word taking out belongs to a certain type; Wherein, the method for described named entity recognition comprises a kind of in rule-based method, the method based on statistics, method based on dictionary.

In technique scheme, the data of described a certain type comprise a kind of in name, exabyte, address.

The present invention also provides the inverted index that a kind of utilization is set up to realize the method for searching for, and comprising:

Step 1), utilize keyword to search in general inverted index table, obtain including the number of documents of the document of this keyword;

Step 2), according to described number of documents, from extract result table, find out the extraction result of relevant documentation and show.

The invention has the advantages that:

The inverted index that inverted index creation method of the present invention creates can be searched categorical data, the overhead of having avoided prior art to spend when searching categorical data.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of inverted index method for building up of the present invention;

Fig. 2 is the schematic diagram of extraction result table related in the present invention;

Fig. 3 utilizes inverted index that the present invention creates to realize the process flow diagram of the method for search.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is illustrated.

In the present invention, except will extract keyword from document, and set up outside inverted index for keyword, can also from document, extract as required relevant information storage.Make user when search, by keyword, can directly find the relevant information extracting, without again original document being resolved, thus the time efficiency while improving search.Communication information take below as example, to setting up the process of the inverted index that includes communication information, be illustrated.

Same as the prior art, in setting up the process of inverted index, first want parse documents, different document storage formats is converted to unified character string forms.As any one in PDF, html format, TXT form, DOC form is converted to unified string format.Conversion operations and prior art in this step are the same, therefore do not do herein repeat specification.

Document is being converted to after unified string format, will from document, extracting keyword below.Different from the concept of keyword related in prior art, in the present invention, the scope that this concept of keyword comprises is more extensive.Keyword in the present invention common specific character data (Chinese character or letter as definite in several), can also comprise the data of certain type, as fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. etc. in prior art.To these content differences but the extraction of the identical data of type adopts the method for text matches of the prior art cannot realize, therefore need to adopt some special technological means.

In general the data of same kind have some common features, for example, if be all Mobile Directory Number, these data should all be comprised of numeral so, and have identical figure place, and for example, if be all E-mail address, should comprise@character so in data.Therefore, can set in the present embodiment some special characters and do preliminary extraction, and then realize detailed leaching process by the regular expression that can be described rule.For these reasons, with reference to figure 1, the present invention is after obtaining the document of describing by string format, first the document is done to participle operation, in result from participle, take out a word, then in the word that judgement is taken out, whether include special character, if there is special character, so just can adopt the regular expression corresponding with this special character to do matching operation, the result that success is mated extracts, if do not contain special character or matching regular expressions is unsuccessful, according to keyword extracting method of the prior art, extract keyword.Mobile Directory Number take below as example, said process is illustrated.Because the combination of numbers of the Mobile Directory Number of different user there are differences, therefore, unless known the particular content of Mobile Directory Number, otherwise be difficult to rely on existing keyword extracting method to find out all data that belong to Mobile Directory Number type from document.In the present embodiment, adopt regular expression to realize the extraction to Mobile Directory Number categorical data.For example, the regular expression of the Mobile Directory Number of CONTINENTAL AREA OF CHINA is as follows: (15[13567890] d{8}|13[13567890] d{8}).In keyword extraction process, after participle, in the word that judgement is taken out from word segmentation result, whether have numeral so, if there is numeral, just adopt above-mentioned regular expression to do matching operation to this word, the result that success is mated extracts.

Above, to extract the data instance of Mobile Directory Number type from document, the relevant operation of keyword extraction is described.In actual applications, can also realize comprising the extraction of polytype data of fixed telephone number, ID (identity number) card No., E-mail address with same method, just when extracting the data of these types, to the recognition methods of the type data, may have certain change (as the particular content of the special character being adopted can be different), in addition, the regular expression adopting also can be different.The data that provided the types such as fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. are below corresponding regular expression separately.It will be understood by those skilled in the art that the data that can also extract according to actual needs other type, and other categorical data also has each self-corresponding regular expression.

Sequence number	Type	Regular expression
			1	Mobile Directory Number	(15[13567890]\d{8}\|13[13567890]\d{8})
2	Fixed telephone number	(\d{3}-\d{8}\|\d{4}-\d{7})
			3	ID (identity number) card No.	(\d{15}\|\d{18})
4	E-mail address	(\w+([-+.]\\w+)@\w+([-.]\w+)\.\w+([-.]\w+)*)

Table 1

After the extraction that completes keyword, then set up inverted index below.In explanation above, mention, some word in document need to be extracted by regular expression, and extracts resulting data and conventionally belong to a certain data type, and these data content is each other also incomplete same.Therefore, for these keywords that formed by categorical data, setting up in the process of inverted index, except using conventional inverted index data structure, also will use one and be called as the data structure that extracts result table.The extraction result of as shown in Figure 2, having stored many pieces of documents in extracting result table.The extraction result of each document has a plurality of, and therefore a number of documents is to there being many to extract outcome record, and every records " type " and " content ", " position " three information; Wherein, " type " indicated " content " is with which regular expression to identify, and " content " stored extraction result, and " position " stored the position that " content " occurs in document.

Continuation, with reference to figure 1, in conjunction with extraction result table noted earlier, is illustrated the process of establishing of inverted index.Mention in the preceding article, keyword comprises the word extracting with regular expression and the word extracting by conventional method.To the word extracting with regular expression, judge this word with which regular expression extracts, thereby the type of determining this word, is then filled into the type of this word, content, positional information in the extraction result table of aforementioned document according to the document at this word place successively; Finally according to prior art, the number of documents of the word extracting and word place document is joined in conventional inverted index again.For the word extracting by conventional method, directly according to prior art, the number of documents of this word and word place document is joined in conventional inverted index.

After setting up above-mentioned inverted index, just can utilize set up inverted index to realize search.As shown in Figure 3, after utilizing keyword to complete traditional inverted index to search, obtained the number of documents of a collection of relevant documentation.If further obtain in document the data of certain type, traditional search system is generally to search original document according to these number of documents, obtains by certain rule compositor, representing after filename and documentation summary.And in the present invention, can be according to the number of documents obtaining, in extracting result table, fast finding is to all extraction results of relevant documentation.These extract results can be by type, two dimensions of document are carried out multidimensional displaying.Such as, searched key word " Beijing ", represent result except the document title and documentation summary of " Beijing " this word of all containing, also can in the details of each document, be listed in all phone numbers, E-mail address, fixed telephone number, the ID (identity number) card No. that the document occurred.Also the phone number, E-mail address, fixed telephone number, the ID (identity number) card No. that in all documents of this word of all relating to " Beijing ", occurred can be listed respectively according to type.

From finding out the explanation of above-mentioned search procedure, by keyword lookup document, and in the process of document quick-searching relevant information, last institute can quick-searching to information and index creation process in be kept at information-related in extraction result table.As in the above-described embodiment, utilize regular expression to extract fixed telephone number, Mobile Directory Number, E-mail address, ID (identity number) card No. information and be kept at and extract in result table, so just can not in retrieving, by number of documents, directly by searching extraction result table, obtain the information about home address.

In the above-described embodiments, adopted regular expression to realize the extraction to the data of a certain type, but can also adopt in other embodiments additive method of the prior art to realize the extraction to a certain categorical data, as the extraction of the method for employing named entity recognition to information such as name, exabyte, addresses.The method of named entity recognition specifically comprises rule-based method, the method based on statistics, the method based on dictionary etc.Can be preferably based in the present invention regular method or the method based on dictionary.Certainly, if while realizing information extraction by the method for named entity recognition, extract " content " that the type item in result table records is with which kind of named entity recognition method to identify.

It should be noted last that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is modified or is equal to replacement, do not depart from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of claim scope of the present invention.

Claims

1. an inverted index method for building up, described inverted index comprises and extracts result table, and described extraction result table comprises number of documents and the extraction outcome record number corresponding with the document, and described extraction outcome record includes type, content and position information item; The method comprises:

Step 4), use the word take out and the number of documents of the document that comprises this word to set up general inverted index table for taken out word.

2. inverted index method for building up according to claim 1, is characterized in that, in described step 2) in, adopt regular expression to detect the data whether word taking out belongs to a certain type.

3. inverted index method for building up according to claim 2, is characterized in that, the data of described a certain type comprise a kind of in Mobile Directory Number, fixed telephone number, ID (identity number) card No., E-mail address.

4. inverted index method for building up according to claim 1, is characterized in that, in described step 2) in, adopt the method for named entity recognition to detect the data whether word taking out belongs to a certain type; Wherein, the method for described named entity recognition comprises a kind of in rule-based method, the method based on statistics, method based on dictionary.

5. inverted index method for building up according to claim 4, is characterized in that, the data of described a certain type comprise a kind of in name, exabyte, address.

6. utilize the inverted index that one of claim 1-5 sets up to realize a method of searching for, comprising: