CN102637204A

CN102637204A - Method for querying texts based on mutual index structure

Info

Publication number: CN102637204A
Application number: CN2012100717782A
Authority: CN
Inventors: 吴明晖; 金苍宏; 应晶; 陈天洲; 刘源清; 朱凡微
Original assignee: Zhejiang University City College ZUCC
Current assignee: Zhejiang University City College ZUCC
Priority date: 2012-03-16
Filing date: 2012-03-16
Publication date: 2012-08-15
Anticipated expiration: 2032-03-16
Also published as: CN102637204B

Abstract

The invention discloses a method for querying texts based on a mutual index structure. The method comprises the following steps of: segmenting text files needing to be indexed to acquire a text block array containing words with a fixed number; constructing an inverted index structure; calculating the characteristic value of every text block and saving the characteristic value in a feature index file; defining a pointer address length at the end of each text block by virtue of a variable-length algorithm and acquiring a pointer value which is the feature value address corresponding to the text block in the feature index file according to the pointer address length; finding the feature value of the text block according to the feature value index of the specific text block in the text file; and reading the feature value of the text block and comparing the feature value of the text block with the feature value of the queried words to determine whether the queried words are contained in the text block. The method can be used for accelerating phrase match speed, reducing input/output (I/O) operation, decreasing calculation complexity and improving query efficiency and match accuracy.

Description

A kind of text query method based on mutual index structure

Technical field

The invention belongs to technical field of information retrieval, relate to a kind of text query method especially based on mutual index structure.

Background technology

The information explosion epoch in the face of the data of magnanimity, need effective information acquisition approach to help user search, extract integrate information.Search engine becomes the important means of people's acquired information day by day as the important tool in the information retrieval.The structure of search engine can be divided into reptile, data processing, index, parts such as coupling ordering.Wherein index part is the core that makes up the engine data model, the structure of index, and the size of index, the update efficiency of index etc. all directly influences the quality of search engine.Index structure commonly used has by index construct principle branch: forward index structure, inverted index structure, bitmap index structure, signature index structure etc.

For the inquiry of single vocabulary commonly used in the search engine, inverted index, bitmap index can both reasonablely be supported with the signature index.But wherein the space of bitmap index needs is bigger, does not conform to the retrieval that is suitable for mass data.Though the inquiry of the reasonable support word of signature index ability is faced with the inefficient problem of index upgrade, also there is the problem that mismatches simultaneously.And inverted index is because simple in structure, and index upgrade efficient is high, is easy to advantages such as expansion, therefore by application widely.But keep the fixing coupling of word order for needs such as phrase inquiries, inverted index can not well be supported this generic operation.Because inverted index based on basic with or operation, though can solve mutually the problem whether vocabulary independently matees, can't guarantee the correctness of the word order between these vocabulary.In order to remedy this defective, can guarantee correct word order through to the positional information calculation in the vocabulary indexed file.The shortcoming of this method is: the first, need a large amount of calculating operations, and like the phrase of N vocabulary length, guarantee correct word order if desired, need carry out time complexity is O ((N*K) ²) inferior compare operation, wherein K is expressed as the average frequency at the vocabulary place in the sentence, in practical operation, because the vocabulary frequency is very high, so computation complexity is very high; Second; Index file is usually huge and be dispersed in the different disk blocks; In order to obtain the positional information of word in the sentence, often need a large amount of operations of reading index file, these files read a large amount of reading disk of I/O action need; The index that disperses to deposit causes seek time to increase, so machine calculation performance is lower.Along with the raising of the length and the complexity of phrase, inverted index is more low for the performance of word order fixed phrase matching operation.

So, to the above-mentioned defective that exists in the present prior art, be necessary to study in fact, so that a kind of scheme to be provided, solve the defective that exists in the prior art, to avoid causing and read file operation in a large number, the I/O operating performance is low.

Summary of the invention

For addressing the above problem, the object of the present invention is to provide a kind of text query method based on mutual index structure, be used to accelerate the speed of phrase match, reduce the I/O operation, improve search efficiency.

For realizing above-mentioned purpose, technical scheme of the present invention is:

A kind of text query method based on mutual index structure; For the fixing text block of word number is provided with the eigenwert of regular length; Said eigenwert is kept in the aspect indexing file, quotes mutually between said aspect indexing file and the said text block, specifically may further comprise the steps:

Text to needing index carries out cutting operation, obtains to contain the fixedly text block array of word number, sequentially deposits in said text block array in the text;

Make up the inverted index structure, said inverted index structure comprises word, word frequencies, the text numbering that word occurs, the positional information that word occurs in text;

Calculate the eigenwert of each text block, and said eigenwert is kept in the aspect indexing file;

The end of each text block in text; Insert the elongated algorithm of use and provide a pointer address length and obtain two numerical value of pointer value according to said pointer address length, said pointer value is the pairing eigenwert of text piece address in the aspect indexing file;

Eigenwert index according to particular text block in the text finds the text block eigenwert;

Read the text block eigenwert, the eigenwert of said text block eigenwert and looking up words compared, judge whether said looking up words is included in the said text block,

If said looking up words eigenwert is included in the said text block eigenwert, the expression match query is quoted to read through the text block in the eigenwert index file and is comprised word place text block information;

If said looking up words eigenwert is not included in the said text block eigenwert, the expression inquiry does not match.

Compare with the ordering structure that falls that prior art adopts; The present invention improves the inverted index structure; Thereby the matched text piece through this inverted index structure can the Primary Location candidate accurately matees through mutual index structure then, accelerates the speed of phrase match; Reduce the I/O operation, improve search efficiency.

Description of drawings

Fig. 1 is a kind of text query method flow diagram based on mutual index structure of the embodiment of the invention;

Fig. 2 is the process flow diagram of a kind of text query method step S10 based on mutual index structure of the embodiment of the invention;

Fig. 3 is the process flow diagram of a kind of text query method step S30 based on mutual index structure of the embodiment of the invention;

Fig. 4 is the process flow diagram of a kind of text query method step S50 based on mutual index structure of the embodiment of the invention.

Embodiment

In order to make the object of the invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with accompanying drawing and embodiment.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

On the contrary, any alternative, the modification of on marrow of the present invention and scope, making, equivalent method and scheme by claim definition contained in the present invention.Further, the present invention is had a better understanding in order to make the public, in hereinafter details of the present invention being described, detailed some specific detail sections of having described.There is not the description of these detail sections can understand the present invention fully for a person skilled in the art yet.

A kind of text query method based on mutual index structure for the fixing text block of word number is provided with the eigenwert of regular length, is kept at eigenwert in the aspect indexing file, quotes mutually between aspect indexing file and the text block, specifically may further comprise the steps:

S10 carries out cutting operation to the text that needs index, obtains to contain the fixedly text block array of word number, sequentially deposits in the text block array in the text;

S20 makes up the inverted index structure, and the inverted index structure comprises word, word frequencies, the text numbering that word occurs, the positional information that word occurs in text;

The storage organization of inverted index structure is following: { word, word frequencies, the text numbering that word occurs, the positional information that word occurs in text }, specify into:

Word is the word that needs index, refers generally to remove stop words, and each word all is to have carried out root process;

Word frequencies refers to the number of times that word occurs in the middle of index, this value is integer type;

The text that word occurs is numbered the document code of word place document, generally by a plurality of document codes, preserves with the array form;

The positional information that word occurs in text is the reference position of the text block at word place.Need to prove that what deposit in the information field of position in the existing row of falling is the position in the full text of word place, and the reference position for word place text block that the present invention deposits in this territory.For instance, word " China " appears at text and is numbered in 246 the file, and the position is the 32nd, 237; 245 3, belong to text block 4,25 respectively, 25; In existing inverted index { China, 3,246,237,245}; Inverted index is { China, 3, the address of text block 4, the address of the address of text block 25, text block 25 } in the present embodiment.

S30 calculates the eigenwert of each text block, and eigenwert is kept in the aspect indexing file;

S40, the end of each text block in text is inserted the elongated algorithm of use and is provided a pointer address length and obtain two numerical value of pointer value according to pointer address length, and pointer value is the pairing eigenwert of text piece address in the aspect indexing file;

S50, the eigenwert index according to particular text block in the text finds the text block eigenwert;

S60 reads the text block eigenwert, and the eigenwert of text block eigenwert and looking up words is compared, and judges whether looking up words is included in the text block,

If the looking up words eigenwert is included in the text block eigenwert, the expression match query is quoted to read through the text block in the eigenwert index file and is comprised word place text block information;

If the looking up words eigenwert is not included in the text block eigenwert, the expression inquiry does not match.

Particularly, S10 further may further comprise the steps:

S101 removes the stop words that comprises in the text, and text is carried out the root processing;

S102, a given fixing word number becomes different text block to file division, the word of the same number that each text block comprises by fixing word number;

In order to reduce the False Rate of text block eigenwert, need to select suitable word number number, fixedly word number guarantees in text query, can directly jump in the text block of any distance.

S103, use order increasing function produces filename, and a given special file suffixes, and order is preserved fixedly word number text block array in text.

Text block array and the consistance of urtext content on word order after this step guarantees to cut apart, and then guarantee the word order consistance between the text block eigenwert.

Particularly, S30 further may further comprise the steps:

S301, three kinds of signature algorithm MD4 of selection fixed length, MD5, RipeMD128 produce the text block eigenwert that length is 16 bits respectively;

The advantage of these three kinds of algorithms is that algorithm length fixes, and has good discrimination.

S302 carries out and operation through the eigenwert that the conjunction function obtains three kinds of algorithms, obtains the eigenwert that a length is 16 bits;

S303 is kept at the eigenwert order in the binary features index file, provides the fixed length pointer address that points to text block at each eigenwert end.

Particularly, S50 further may further comprise the steps:

S501 in the inverted index structure, opens text according to the text numbering, and the positional information that occurs according to word reads the string length that belongs to text block, skips the text block content, and pen travel is terminal to text block;

S502 according to the text numbering that word occurs, opens the aspect indexing file of same name;

S503 according to the terminal elongated integer numerical value of text block, reads the address pointer value of text block eigenwert, reads the text block eigenwert of 16 bit lengths according to pointer address.

Read the adjacent text block eigenwert of text piece if desired, only need to move forwards, backwards [16+ text block index length] individual bit, just can navigate to text block eigenvalue address.

The mutual index structure of the embodiment of the invention need not read content of text in advance in coupling; But from index file, read the text feature value of fixed length, because the eigenwert size is more much smaller than text block, calculates a character and need 2 bits by English; The average length of English word is 7 characters; If we set text block length is 10 words, the text block entire length is the 2*7*10=140 bit so, and eigenwert has only 16 bits.The space size can save about 90%.Moreover eigenwert is a regular length, the text block characteristic of N section or back M section before reading if desired; Move N* [16+ text block index length] bit before only needing or move M* [16+ text block index length] bit backward; Just can directly locate, need read from the file header order with the traditional text coupling and compare the efficient that the present invention accelerates to mate; Reduce the I/O operation, improve search efficiency.

More than be merely preferred embodiment of the present invention,, all any modifications of within spirit of the present invention and principle, being done, be equal to and replace and improvement etc., all should be included within protection scope of the present invention not in order to restriction the present invention.

Claims

1. text query method based on mutual index structure; It is characterized in that,, said eigenwert is kept in the aspect indexing file for the fixing text block of word number is provided with the eigenwert of regular length; Quote mutually between said aspect indexing file and the said text block, specifically may further comprise the steps:

2. the text query method based on mutual index structure according to claim 1; It is characterized in that; Said the text that needs index is carried out cutting operation, obtains to contain the fixedly digital text block array of word, and deposit in the text and further may further comprise the steps:

Remove the stop words that comprises in the said text, and said text is carried out the root processing;

A given fixing word number becomes different text block to file division by said fixedly word number, the word of the same number that each text block comprises;

Use order increasing function produces filename, and a given special file suffixes, and order is preserved said fixedly word number text block array in text.

3. the text query method based on mutual index structure according to claim 1 is characterized in that, the eigenwert of said each text block of calculating, and said eigenwert is kept in the aspect indexing file further may further comprise the steps:

Select three kinds of signature algorithm MD4 of fixed length, MD5, RipeMD128 produce the text block eigenwert that length is 16 bits respectively;

Eigenwert through the conjunction function obtains said three kinds of algorithms is carried out and operation, obtains the eigenwert that a length is 16 bits;

Said eigenwert is kept in the binary features index file in proper order, provides the fixed length pointer address that points to text block at said each eigenwert end.

4. the text query method based on mutual index structure according to claim 1 is characterized in that, said eigenwert index according to particular text block in the text finds the text block eigenwert further to may further comprise the steps:

In the inverted index structure, open text according to the text numbering, the positional information that occurs according to word reads the string length that belongs to text block, skips the text block content, and pen travel is terminal to text block;

According to the text numbering that word occurs, open the aspect indexing file of same name;

According to the terminal elongated integer numerical value of text block, read the address pointer value of text block eigenwert, read the text block eigenwert of 16 bit lengths according to pointer address.