CN102637204A - Method for querying texts based on mutual index structure - Google Patents

Method for querying texts based on mutual index structure Download PDF

Info

Publication number
CN102637204A
CN102637204A CN2012100717782A CN201210071778A CN102637204A CN 102637204 A CN102637204 A CN 102637204A CN 2012100717782 A CN2012100717782 A CN 2012100717782A CN 201210071778 A CN201210071778 A CN 201210071778A CN 102637204 A CN102637204 A CN 102637204A
Authority
CN
China
Prior art keywords
text
text block
eigenwert
word
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100717782A
Other languages
Chinese (zh)
Other versions
CN102637204B (en
Inventor
吴明晖
金苍宏
应晶
陈天洲
刘源清
朱凡微
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University City College ZUCC
Original Assignee
Zhejiang University City College ZUCC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University City College ZUCC filed Critical Zhejiang University City College ZUCC
Priority to CN201210071778.2A priority Critical patent/CN102637204B/en
Publication of CN102637204A publication Critical patent/CN102637204A/en
Application granted granted Critical
Publication of CN102637204B publication Critical patent/CN102637204B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for querying texts based on a mutual index structure. The method comprises the following steps of: segmenting text files needing to be indexed to acquire a text block array containing words with a fixed number; constructing an inverted index structure; calculating the characteristic value of every text block and saving the characteristic value in a feature index file; defining a pointer address length at the end of each text block by virtue of a variable-length algorithm and acquiring a pointer value which is the feature value address corresponding to the text block in the feature index file according to the pointer address length; finding the feature value of the text block according to the feature value index of the specific text block in the text file; and reading the feature value of the text block and comparing the feature value of the text block with the feature value of the queried words to determine whether the queried words are contained in the text block. The method can be used for accelerating phrase match speed, reducing input/output (I/O) operation, decreasing calculation complexity and improving query efficiency and match accuracy.

Description

A kind of text query method based on mutual index structure
Technical field
The invention belongs to technical field of information retrieval, relate to a kind of text query method especially based on mutual index structure.
Background technology
The information explosion epoch in the face of the data of magnanimity, need effective information acquisition approach to help user search, extract integrate information.Search engine becomes the important means of people's acquired information day by day as the important tool in the information retrieval.The structure of search engine can be divided into reptile, data processing, index, parts such as coupling ordering.Wherein index part is the core that makes up the engine data model, the structure of index, and the size of index, the update efficiency of index etc. all directly influences the quality of search engine.Index structure commonly used has by index construct principle branch: forward index structure, inverted index structure, bitmap index structure, signature index structure etc.
For the inquiry of single vocabulary commonly used in the search engine, inverted index, bitmap index can both reasonablely be supported with the signature index.But wherein the space of bitmap index needs is bigger, does not conform to the retrieval that is suitable for mass data.Though the inquiry of the reasonable support word of signature index ability is faced with the inefficient problem of index upgrade, also there is the problem that mismatches simultaneously.And inverted index is because simple in structure, and index upgrade efficient is high, is easy to advantages such as expansion, therefore by application widely.But keep the fixing coupling of word order for needs such as phrase inquiries, inverted index can not well be supported this generic operation.Because inverted index based on basic with or operation, though can solve mutually the problem whether vocabulary independently matees, can't guarantee the correctness of the word order between these vocabulary.In order to remedy this defective, can guarantee correct word order through to the positional information calculation in the vocabulary indexed file.The shortcoming of this method is: the first, need a large amount of calculating operations, and like the phrase of N vocabulary length, guarantee correct word order if desired, need carry out time complexity is O ((N*K) 2) inferior compare operation, wherein K is expressed as the average frequency at the vocabulary place in the sentence, in practical operation, because the vocabulary frequency is very high, so computation complexity is very high; Second; Index file is usually huge and be dispersed in the different disk blocks; In order to obtain the positional information of word in the sentence, often need a large amount of operations of reading index file, these files read a large amount of reading disk of I/O action need; The index that disperses to deposit causes seek time to increase, so machine calculation performance is lower.Along with the raising of the length and the complexity of phrase, inverted index is more low for the performance of word order fixed phrase matching operation.
So, to the above-mentioned defective that exists in the present prior art, be necessary to study in fact, so that a kind of scheme to be provided, solve the defective that exists in the prior art, to avoid causing and read file operation in a large number, the I/O operating performance is low.
Summary of the invention
For addressing the above problem, the object of the present invention is to provide a kind of text query method based on mutual index structure, be used to accelerate the speed of phrase match, reduce the I/O operation, improve search efficiency.
For realizing above-mentioned purpose, technical scheme of the present invention is:
A kind of text query method based on mutual index structure; For the fixing text block of word number is provided with the eigenwert of regular length; Said eigenwert is kept in the aspect indexing file, quotes mutually between said aspect indexing file and the said text block, specifically may further comprise the steps:
Text to needing index carries out cutting operation, obtains to contain the fixedly text block array of word number, sequentially deposits in said text block array in the text;
Make up the inverted index structure, said inverted index structure comprises word, word frequencies, the text numbering that word occurs, the positional information that word occurs in text;
Calculate the eigenwert of each text block, and said eigenwert is kept in the aspect indexing file;
The end of each text block in text; Insert the elongated algorithm of use and provide a pointer address length and obtain two numerical value of pointer value according to said pointer address length, said pointer value is the pairing eigenwert of text piece address in the aspect indexing file;
Eigenwert index according to particular text block in the text finds the text block eigenwert;
Read the text block eigenwert, the eigenwert of said text block eigenwert and looking up words compared, judge whether said looking up words is included in the said text block,
If said looking up words eigenwert is included in the said text block eigenwert, the expression match query is quoted to read through the text block in the eigenwert index file and is comprised word place text block information;
If said looking up words eigenwert is not included in the said text block eigenwert, the expression inquiry does not match.
Compare with the ordering structure that falls that prior art adopts; The present invention improves the inverted index structure; Thereby the matched text piece through this inverted index structure can the Primary Location candidate accurately matees through mutual index structure then, accelerates the speed of phrase match; Reduce the I/O operation, improve search efficiency.
Description of drawings
Fig. 1 is a kind of text query method flow diagram based on mutual index structure of the embodiment of the invention;
Fig. 2 is the process flow diagram of a kind of text query method step S10 based on mutual index structure of the embodiment of the invention;
Fig. 3 is the process flow diagram of a kind of text query method step S30 based on mutual index structure of the embodiment of the invention;
Fig. 4 is the process flow diagram of a kind of text query method step S50 based on mutual index structure of the embodiment of the invention.
Embodiment
In order to make the object of the invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with accompanying drawing and embodiment.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
On the contrary, any alternative, the modification of on marrow of the present invention and scope, making, equivalent method and scheme by claim definition contained in the present invention.Further, the present invention is had a better understanding in order to make the public, in hereinafter details of the present invention being described, detailed some specific detail sections of having described.There is not the description of these detail sections can understand the present invention fully for a person skilled in the art yet.
A kind of text query method based on mutual index structure for the fixing text block of word number is provided with the eigenwert of regular length, is kept at eigenwert in the aspect indexing file, quotes mutually between aspect indexing file and the text block, specifically may further comprise the steps:
S10 carries out cutting operation to the text that needs index, obtains to contain the fixedly text block array of word number, sequentially deposits in the text block array in the text;
S20 makes up the inverted index structure, and the inverted index structure comprises word, word frequencies, the text numbering that word occurs, the positional information that word occurs in text;
The storage organization of inverted index structure is following: { word, word frequencies, the text numbering that word occurs, the positional information that word occurs in text }, specify into:
Word is the word that needs index, refers generally to remove stop words, and each word all is to have carried out root process;
Word frequencies refers to the number of times that word occurs in the middle of index, this value is integer type;
The text that word occurs is numbered the document code of word place document, generally by a plurality of document codes, preserves with the array form;
The positional information that word occurs in text is the reference position of the text block at word place.Need to prove that what deposit in the information field of position in the existing row of falling is the position in the full text of word place, and the reference position for word place text block that the present invention deposits in this territory.For instance, word " China " appears at text and is numbered in 246 the file, and the position is the 32nd, 237; 245 3, belong to text block 4,25 respectively, 25; In existing inverted index { China, 3,246,237,245}; Inverted index is { China, 3, the address of text block 4, the address of the address of text block 25, text block 25 } in the present embodiment.
S30 calculates the eigenwert of each text block, and eigenwert is kept in the aspect indexing file;
S40, the end of each text block in text is inserted the elongated algorithm of use and is provided a pointer address length and obtain two numerical value of pointer value according to pointer address length, and pointer value is the pairing eigenwert of text piece address in the aspect indexing file;
S50, the eigenwert index according to particular text block in the text finds the text block eigenwert;
S60 reads the text block eigenwert, and the eigenwert of text block eigenwert and looking up words is compared, and judges whether looking up words is included in the text block,
If the looking up words eigenwert is included in the text block eigenwert, the expression match query is quoted to read through the text block in the eigenwert index file and is comprised word place text block information;
If the looking up words eigenwert is not included in the text block eigenwert, the expression inquiry does not match.
Particularly, S10 further may further comprise the steps:
S101 removes the stop words that comprises in the text, and text is carried out the root processing;
S102, a given fixing word number becomes different text block to file division, the word of the same number that each text block comprises by fixing word number;
In order to reduce the False Rate of text block eigenwert, need to select suitable word number number, fixedly word number guarantees in text query, can directly jump in the text block of any distance.
S103, use order increasing function produces filename, and a given special file suffixes, and order is preserved fixedly word number text block array in text.
Text block array and the consistance of urtext content on word order after this step guarantees to cut apart, and then guarantee the word order consistance between the text block eigenwert.
Particularly, S30 further may further comprise the steps:
S301, three kinds of signature algorithm MD4 of selection fixed length, MD5, RipeMD128 produce the text block eigenwert that length is 16 bits respectively;
The advantage of these three kinds of algorithms is that algorithm length fixes, and has good discrimination.
S302 carries out and operation through the eigenwert that the conjunction function obtains three kinds of algorithms, obtains the eigenwert that a length is 16 bits;
S303 is kept at the eigenwert order in the binary features index file, provides the fixed length pointer address that points to text block at each eigenwert end.
Particularly, S50 further may further comprise the steps:
S501 in the inverted index structure, opens text according to the text numbering, and the positional information that occurs according to word reads the string length that belongs to text block, skips the text block content, and pen travel is terminal to text block;
S502 according to the text numbering that word occurs, opens the aspect indexing file of same name;
S503 according to the terminal elongated integer numerical value of text block, reads the address pointer value of text block eigenwert, reads the text block eigenwert of 16 bit lengths according to pointer address.
Read the adjacent text block eigenwert of text piece if desired, only need to move forwards, backwards [16+ text block index length] individual bit, just can navigate to text block eigenvalue address.
The mutual index structure of the embodiment of the invention need not read content of text in advance in coupling; But from index file, read the text feature value of fixed length, because the eigenwert size is more much smaller than text block, calculates a character and need 2 bits by English; The average length of English word is 7 characters; If we set text block length is 10 words, the text block entire length is the 2*7*10=140 bit so, and eigenwert has only 16 bits.The space size can save about 90%.Moreover eigenwert is a regular length, the text block characteristic of N section or back M section before reading if desired; Move N* [16+ text block index length] bit before only needing or move M* [16+ text block index length] bit backward; Just can directly locate, need read from the file header order with the traditional text coupling and compare the efficient that the present invention accelerates to mate; Reduce the I/O operation, improve search efficiency.
More than be merely preferred embodiment of the present invention,, all any modifications of within spirit of the present invention and principle, being done, be equal to and replace and improvement etc., all should be included within protection scope of the present invention not in order to restriction the present invention.

Claims (4)

1. text query method based on mutual index structure; It is characterized in that,, said eigenwert is kept in the aspect indexing file for the fixing text block of word number is provided with the eigenwert of regular length; Quote mutually between said aspect indexing file and the said text block, specifically may further comprise the steps:
Text to needing index carries out cutting operation, obtains to contain the fixedly text block array of word number, sequentially deposits in said text block array in the text;
Make up the inverted index structure, said inverted index structure comprises word, word frequencies, the text numbering that word occurs, the positional information that word occurs in text;
Calculate the eigenwert of each text block, and said eigenwert is kept in the aspect indexing file;
The end of each text block in text; Insert the elongated algorithm of use and provide a pointer address length and obtain two numerical value of pointer value according to said pointer address length, said pointer value is the pairing eigenwert of text piece address in the aspect indexing file;
Eigenwert index according to particular text block in the text finds the text block eigenwert;
Read the text block eigenwert, the eigenwert of said text block eigenwert and looking up words compared, judge whether said looking up words is included in the said text block,
If said looking up words eigenwert is included in the said text block eigenwert, the expression match query is quoted to read through the text block in the eigenwert index file and is comprised word place text block information;
If said looking up words eigenwert is not included in the said text block eigenwert, the expression inquiry does not match.
2. the text query method based on mutual index structure according to claim 1; It is characterized in that; Said the text that needs index is carried out cutting operation, obtains to contain the fixedly digital text block array of word, and deposit in the text and further may further comprise the steps:
Remove the stop words that comprises in the said text, and said text is carried out the root processing;
A given fixing word number becomes different text block to file division by said fixedly word number, the word of the same number that each text block comprises;
Use order increasing function produces filename, and a given special file suffixes, and order is preserved said fixedly word number text block array in text.
3. the text query method based on mutual index structure according to claim 1 is characterized in that, the eigenwert of said each text block of calculating, and said eigenwert is kept in the aspect indexing file further may further comprise the steps:
Select three kinds of signature algorithm MD4 of fixed length, MD5, RipeMD128 produce the text block eigenwert that length is 16 bits respectively;
Eigenwert through the conjunction function obtains said three kinds of algorithms is carried out and operation, obtains the eigenwert that a length is 16 bits;
Said eigenwert is kept in the binary features index file in proper order, provides the fixed length pointer address that points to text block at said each eigenwert end.
4. the text query method based on mutual index structure according to claim 1 is characterized in that, said eigenwert index according to particular text block in the text finds the text block eigenwert further to may further comprise the steps:
In the inverted index structure, open text according to the text numbering, the positional information that occurs according to word reads the string length that belongs to text block, skips the text block content, and pen travel is terminal to text block;
According to the text numbering that word occurs, open the aspect indexing file of same name;
According to the terminal elongated integer numerical value of text block, read the address pointer value of text block eigenwert, read the text block eigenwert of 16 bit lengths according to pointer address.
CN201210071778.2A 2012-03-16 2012-03-16 Method for querying texts based on mutual index structure Expired - Fee Related CN102637204B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210071778.2A CN102637204B (en) 2012-03-16 2012-03-16 Method for querying texts based on mutual index structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210071778.2A CN102637204B (en) 2012-03-16 2012-03-16 Method for querying texts based on mutual index structure

Publications (2)

Publication Number Publication Date
CN102637204A true CN102637204A (en) 2012-08-15
CN102637204B CN102637204B (en) 2014-04-16

Family

ID=46621598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210071778.2A Expired - Fee Related CN102637204B (en) 2012-03-16 2012-03-16 Method for querying texts based on mutual index structure

Country Status (1)

Country Link
CN (1) CN102637204B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105141583A (en) * 2015-07-28 2015-12-09 中国电子科技集团公司第三十六研究所 Character string matching method and system
CN105589894A (en) * 2014-11-13 2016-05-18 腾讯数码(深圳)有限公司 Document index establishing method and device as well as document retrieving method and device
CN106250362A (en) * 2015-06-05 2016-12-21 富士通株式会社 Text segmentation device and text segmenting method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040243569A1 (en) * 1996-08-09 2004-12-02 Overture Services, Inc. Technique for ranking records of a database
CN101673307A (en) * 2009-10-21 2010-03-17 中国农业大学 Space data index method and system
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040243569A1 (en) * 1996-08-09 2004-12-02 Overture Services, Inc. Technique for ranking records of a database
CN101673307A (en) * 2009-10-21 2010-03-17 中国农业大学 Space data index method and system
CN102063446A (en) * 2009-11-13 2011-05-18 中国移动通信集团四川有限公司 Method for creating inverted index and inverted indexing device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589894A (en) * 2014-11-13 2016-05-18 腾讯数码(深圳)有限公司 Document index establishing method and device as well as document retrieving method and device
CN105589894B (en) * 2014-11-13 2020-05-29 腾讯数码(深圳)有限公司 Document index establishing method and device and document retrieval method and device
CN106250362A (en) * 2015-06-05 2016-12-21 富士通株式会社 Text segmentation device and text segmenting method
CN105141583A (en) * 2015-07-28 2015-12-09 中国电子科技集团公司第三十六研究所 Character string matching method and system
CN105141583B (en) * 2015-07-28 2019-02-15 中国电子科技集团公司第三十六研究所 A kind of character string matching method and system

Also Published As

Publication number Publication date
CN102637204B (en) 2014-04-16

Similar Documents

Publication Publication Date Title
US9195738B2 (en) Tokenization platform
CN101706807B (en) Method for automatically acquiring new words from Chinese webpages
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
Wang et al. Vchunkjoin: An efficient algorithm for edit similarity joins
CN103123650B (en) A kind of XML data storehouse full-text index method mapped based on integer
US10192028B2 (en) Data analysis device and method therefor
CN104199965A (en) Semantic information retrieval method
CN103914483B (en) File memory method, device and file reading, device
CN102567409A (en) Method and device for providing retrieval associated word
CN103345496A (en) Multimedia information searching method and system
CN101894135A (en) Method for compressing and storing GPS data based on route clustering
CN102637204B (en) Method for querying texts based on mutual index structure
CN105279281A (en) Internet-of-things data access method
JPWO2014174599A1 (en) Computer, recording medium and data retrieval method
JP4237813B2 (en) Structured document management system
CN111782663A (en) Aggregation index structure and aggregation index method for improving aggregation query efficiency
CN108647243B (en) Industrial big data storage method based on time series
CN101405727B (en) Management of statistical views in a database system
US9436715B2 (en) Data management apparatus and data management method
CN105005627A (en) Shortest path key node query method based on Spark distributed system
CN115794861A (en) Offline data query multiplexing method based on feature abstract and application thereof
KR20100105080A (en) Query processing method and apparatus based on n-gram
CN113420219A (en) Method and device for correcting query information, electronic equipment and readable storage medium
CN105095276B (en) Method and device for mining maximum repetitive sequence
KR101679011B1 (en) Method and Apparatus for moving data in DBMS

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140416

CF01 Termination of patent right due to non-payment of annual fee