CN103176978B - Method and the device of the positional information of a kind of deterministic retrieval word in document - Google Patents

Method and the device of the positional information of a kind of deterministic retrieval word in document Download PDF

Info

Publication number
CN103176978B
CN103176978B CN201110430651.0A CN201110430651A CN103176978B CN 103176978 B CN103176978 B CN 103176978B CN 201110430651 A CN201110430651 A CN 201110430651A CN 103176978 B CN103176978 B CN 103176978B
Authority
CN
China
Prior art keywords
document
memory location
lexical item
positional information
tentatively
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110430651.0A
Other languages
Chinese (zh)
Other versions
CN103176978A (en
Inventor
童征宇
徐剑波
闫进兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Fangzheng Apapi Technology Co Ltd
New Founder Holdings Development Co ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201110430651.0A priority Critical patent/CN103176978B/en
Publication of CN103176978A publication Critical patent/CN103176978A/en
Application granted granted Critical
Publication of CN103176978B publication Critical patent/CN103176978B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses method and the device of the positional information of a kind of deterministic retrieval word in document, comprise: be divided the each lexical item obtaining for term, carry out respectively: determine that this lexical item is in each memory location of tentatively hitting the positional information in document, and according to the described memory location of determining, read this lexical item in described positional information of tentatively hitting in document, wherein, the described document that tentatively hits comprises that described term is divided the each lexical item obtaining. According to the technical program, reduce lexical item in the non-process that reads of tentatively hitting the positional information in document, thereby reduced the amount of reading of information, improve the efficiency of the positional information of deterministic retrieval word in document, and then improved recall precision.

Description

Method and the device of the positional information of a kind of deterministic retrieval word in document
Technical field
The present invention relates to technical field of information retrieval, relate in particular to the position of a kind of deterministic retrieval word in documentThe method of information and device.
Background technology
Text retrieval system is the very universal a kind of searching system of application at present, the main basis of this searching systemThe inverted index file of setting up is in advance determined the document mating with the term of user terminal submission, the documentBe generally the document of the each term that comprises user terminal submission.
At present, the process that text retrieval system is set up inverted index file comprises: scan literary composition by concordance programEach lexical item in shelves, and respectively each lexical item is set up to an index entry, this index entry is for markThe positional information that occurs in the document of corresponding lexical item, and according to building respectively for each lexical item in documentVertical index entry creates inverted index file. Setting up after inverted index file, text retrieval system is carrying outWhen retrieval, first determine the collection of document of the lexical item that comprises user's submission by reading this inverted index file(document that this set comprises can exist with the form of lists of documents), and this lexical item is at each documentThe middle positional information (this positional information can exist with the form of list) occurring respectively, then returns to retrievalHit results. Generally, the term that user terminal is submitted to can be phrase, can be also short sentence,Therefore, in the time that application text retrieval system is retrieved, general by the word corresponding term in retrieval requestGroup or short sentence are divided in multiple lexical item indexed files and search, and obtain comprise all divisions simultaneouslyThe document of the lexical item document that is defined as tentatively hitting, then read each document that term is tentatively hittingMiddle respectively occur positional information, and by the positional information of determining meet desired location relation document determineFor the final document hitting returns to this user terminal. Particularly, determine that positional information meets desired locationThe document of relation, is determining after the positional information that term occurs in document, according to term at literary compositionAbove-mentioned positional information in shelves is carried out position relationship calculating. In the time carrying out position relationship calculating, need to read retrievalThe positional information that each lexical item that word comprises occurs respectively respectively in the document tentatively hitting, below will be to readThe positional information of the term that the document tentatively hitting in following table 1 comprises in document is that example describes:
Table 1:
Term Numeral Information Process Accelerate 's Method
Document 1 100 50 60 0 1000 20
Document 2 40 20 400 20 1200 0
Document 3 0 90 100 80 3200 400
Document 4 200 100 300 120 2000 100
Document 5 210 130 0 140 2300 140
Document 6 310 0 320 150 2300 140
Document 7 50 410 210 150 3000 140
In upper table 1, be that term " method that digital information processing is accelerated " is divided into lexical item " numeral ", " letterBreath ", " processing ", " acceleration ", " " and " method " laggard line retrieval, that obtains comprises at least one7 documents of above-mentioned lexical item, wherein document 4 and document 7 comprise obtain after term is divided wholeLexical item, the document 4 and document 7 are for tentatively hitting document. Tentatively hit after document determining, needRead the positional information that each lexical item occurs respectively in the document 4 tentatively hitting and document 7, forCalculate the position relationship of each lexical item in the document 4 and document 7. To read lexical item " numeral " at documentThe positional information occurring in 4 is example, and Fig. 1 shows this and read the schematic flow sheet of process, reads lexical item " numberWord " positional information in document 4, mainly comprise the steps:
The position number 3 of the memory location of step 101, definite " numeral " positional information in document 4.
In this step 101, the information providing taking table 1 is example, can determine that described position number is 3, toolBody ground, can determine according to table 1 in the document that comprises " numeral ", be respectively document 1, document 2, document 4,Document 5, document 6, document 7, general is by " numeral " while preserving " numeral " positional information in documentPositional information in each document is preserved successively taking document as unit, herein, and the each document listing with table 1Order be example, the memory location of " numeral " positional information in document 4 in " numeral " at document1 and document 2 in the memory location of positional information after.
The position number 3 that step 102, basis are obtained, need to first read " numeral " at document 1 and literary compositionPositional information in shelves 2.
Step 103, read " numeral " positional information in document 4.
So far the flow process that, reads " numeral " positional information in document 4 finishes.
Shown in Fig. 1, in flow process, " numeral " positional information in document 1 and document 2 need to readAfter, just can read " numeral " positional information in document 4. In like manner, reading " numeral " at literary compositionBefore the positional information of shelves in 7, need to first read " numeral " document 1, document 2 and document 4 toPositional information in document 6, can read " numeral " positional information in document 7.
The flow process that the information of recording according to table 1 and Fig. 1 are corresponding, can know what full-text search need to be readInformation content is very large, corresponding, the information that need to read be stored in storage medium, also can occupy a large amount ofShelf space. At present, in order to reduce the memory space of preserving the storage medium that occupies of index file, raising is depositedThe information storage efficiency of storage media, generally compress index file that storage occupies with minimizing index fileMemory space, in compression process, in order to reduce the memory space occupying of index file, can be by basisThe information of setting memory space preservation is kept in less memory space, for example, will be kept in 8 bytesLexical item boil down to be kept at the lexical item in 4 bytes, also may after compressed, be kept in less byte,Thereby cause determining depositing of lexical item by the length of byte and the quantity of lexical item of preserving each lexical itemStorage space is put.
For the problems referred to above, prior art has proposed a kind of stepping data item that adopts and has carried out that positional information readsMethod, the method has increased stepping data item in index data, and the value of this stepping data item can be according to needArrange. Particularly, to read " numeral " tentatively hitting positional information in document 7 as exampleDescribe, the document that comprised " numeral " in table 1 before document 7 is document 1, document 2 and literary compositionShelves 4, to document 6, are set the value 3 of stepping data item, are reading " numeral " position in document 7Front need be read " numeral " positional information in document 5 and document 6, need to not exist from " numeral "The memory location of the positional information in document 1 starts to read successively " numeral " respectively at document 1, document 2And document 4 is to the positional information in document 6.
According to above-mentioned, term is divided to the process that the lexical item that obtains is carried out full-text search, can know, evenAdopt at present that stepping data item is auxiliary reads lexical item in the positional information of tentatively hitting in document, also need to be fromImmediate stepping data item starts, and reads successively lexical item and tentatively hits described in comprising before document being arranged inPositional information in each document of lexical item, could locate and read lexical item in the position of tentatively hitting in documentPut information. Visible, existing text retrieval system has read in a large number in the process that term is retrievedUnwanted positional information.
In sum, existing text retrieval system, in the time of retrieval, need to read successively lexical item and comprise retrievalWord is divided the positional information in the document of at least one lexical item obtaining, and just can read lexical item and tentatively hitPositional information in document, therefore, exists and causes deterministic retrieval word to divide owing to reading redundancy or invalid informationThe not inefficient problem of the positional information in the document tentatively hitting, thus recall precision affected.
Summary of the invention
In view of this, the embodiment of the present invention provides the method for the positional information of a kind of deterministic retrieval word in documentAnd device, adopt this technical scheme, can improve recall precision.
The embodiment of the present invention is achieved through the following technical solutions:
According to the embodiment of the present invention aspect, provide the position letter of a kind of deterministic retrieval word in documentThe method of breath, comprising:
Be divided the each lexical item obtaining for term, carry out respectively:
Determine that this lexical item is in each memory location of tentatively hitting the positional information in document, and according to determiningDescribed memory location, read this lexical item in described positional information of tentatively hitting in document, wherein, described inTentatively hit document and comprise that described term is divided the each lexical item obtaining.
According to another aspect of the embodiment of the present invention, also provide a kind of deterministic retrieval word position in documentThe device of information, comprising:
Term division unit, for being divided into term multiple lexical items;
Positional information reading unit, for dividing and obtain term for described term division unitEach lexical item, respectively carry out: determine that this lexical item is in each storage of tentatively hitting the positional information in documentPosition, and according to the described memory location of determining, read this lexical item in described position of tentatively hitting in documentPut information, wherein, the described document that tentatively hits comprises that described term is divided the each lexical item obtaining.
Above-mentioned at least one technical scheme providing by the embodiment of the present invention, in retrieving, first pinTerm is divided to the each lexical item obtaining, carries out respectively: determine that this lexical item is tentatively hitting in documentThe memory location of positional information, wherein, tentatively hits document and comprises that this term is divided obtain eachLexical item, then according to the memory location of determining, reads this lexical item and tentatively hits the position letter in document at thisBreath. This technical scheme compared with prior art, can directly be determined lexical item tentatively hitting in documentThe memory location of positional information, and then read lexical item in the position of tentatively hitting in document according to this memory locationInformation, is comprising that term is divided at least one word obtaining and need to read successively lexical item in prior artPositional information in the document of item, obviously, technical solution of the present invention has reduced for lexical item tentatively hits non-The process that reads of the positional information in document, thus the amount of reading of information reduced, improve deterministic retrieval wordThe efficiency of the positional information in document, and then improved recall precision.
Other features and advantages of the present invention will be set forth in the following description, and, partly from explanationIn book, become apparent, or understand by implementing the present invention. Object of the present invention and other advantages canRealize and obtain by specifically noted structure in write description, claims and accompanying drawing.
Brief description of the drawings
Accompanying drawing is used to provide a further understanding of the present invention, and forms a part for description, with thisBright embodiment mono-is used from explanation the present invention, is not construed as limiting the invention. In the accompanying drawings:
The one that Fig. 1 provides for prior art reads the stream of the positional information of lexical item " numeral " in document 4Journey schematic diagram;
A kind of definite lexical item that Fig. 2 provides for the embodiment of the present invention one is in the positional information of tentatively hitting in documentSchematic flow sheet;
A kind of this lexical item that reads that Fig. 3 provides for the embodiment of the present invention one is tentatively hitting position in document letterThe schematic flow sheet of breath;
Fig. 4 provide for the embodiment of the present invention one another read lexical item at the position letter tentatively hitting in documentThe schematic flow sheet of breath;
Fig. 5 determines that for what the embodiment of the present invention one provided this lexical item is in each positional information of tentatively hitting in documentThe schematic flow sheet of the first initial memory location of difference correspondence while being saved;
The schematic flow sheet of definite difference that Fig. 6 provides for the embodiment of the present invention one;
The positional information of definite lexical item " numeral " that Fig. 7 provides for the embodiment of the present invention two in document 4Schematic flow sheet;
The device of a kind of deterministic retrieval word positional information in document that Fig. 8 provides for the embodiment of the present invention threeStructural representation.
Detailed description of the invention
Improve the deterministic retrieval word efficiency of the positional information in the document tentatively hitting respectively in order to provideImplementation, the method that the embodiment of the present invention provides the positional information of a kind of deterministic retrieval word in document withAnd device, below in conjunction with Figure of description, the preferred embodiments of the present invention are described, should be appreciated that thisLocate described preferred embodiment only for description and interpretation the present invention, be not intended to limit the present invention. AndIn the situation that not conflicting, the feature in embodiment and embodiment in the application can combine mutually.
In text retrieval system, term may be an independent lexical item, also may be for comprising multiple wordsPhrase or the short sentence of item are in the time of retrieval, general by the phrase corresponding term in retrieval request or short sentenceBe divided in multiple lexical item indexed files and search, and will comprise each lexical item of obtaining after being dividedDocument is defined as tentatively hitting document, then reads each lexical item in the positional information of tentatively hitting in document,For determining the final document that hits. Below, after the application will preferably adopt deterministic retrieval word to be dividedThe method of the positional information of the lexical item obtaining in document is described.
Embodiment mono-
The embodiment of the present invention one provides the method for the positional information of a kind of deterministic retrieval word in document, the partyMethod can be applied in text retrieval system, by implement the method in text retrieval system, can solveIn prior art due to deterministic retrieval word respectively the efficiency of the positional information in the document tentatively hitting low andAffect the problem of recall precision.
Fig. 2 shows a kind of deterministic retrieval word that the embodiment of the present invention one provides positional information in documentSchematic flow sheet, particularly, the present embodiment one will be divided the lexical item that obtains tentatively for deterministic retrieval wordThe positional information of hitting in document is described, and to determine that the lexical item being divided in the lexical item obtaining existsThe positional information of tentatively hitting in document is that example is specifically described, as shown in Figure 2, this determine lexical item at the beginning ofStep is hit the process of the positional information in document, mainly comprises the following steps:
Step 201, determine that lexical item is in the memory location of tentatively hitting the positional information in document.
In this step 201, in actual applications, can the each positional information of lexical item in document will be preservedMemory space is called list of locations file, particularly, can be by the each positional information in list of locations fileMemory location is called PrxPosition.
The memory location that step 202, basis are determined, reads this lexical item in the position of tentatively hitting in documentInformation.
In this step 202, in specific implementation process, can be with the basis that is exemplified as of above-mentioned steps 201, rootRead this lexical item in the positional information of tentatively hitting in document according to PrxPosition.
So far the flow process of, determining the positional information of lexical item in document finishes.
In flow process corresponding to Fig. 1, in retrieving, can tentatively hit literary composition according to the lexical item of determiningThe memory location of the positional information in shelves, reads lexical item in the positional information of tentatively hitting in document. This technologyScheme compared with prior art, can directly be determined lexical item tentatively hitting positional information in documentMemory location, and then read lexical item in the positional information of tentatively hitting in document according to this memory location, and showHaving needs to read successively lexical item in technology and is comprising that term is divided the document of at least one lexical item obtainingIn positional information, obviously, technical solution of the present invention has reduced for lexical item tentatively hits in document non-The process that reads of positional information, thus the amount of reading of information reduced, improve deterministic retrieval word in documentThe efficiency of positional information, and then improved recall precision.
In flow process corresponding to Fig. 2, believe in each position of tentatively hitting in document according to the lexical item of determiningThe memory location of breath, reads this lexical item in the time of the positional information of tentatively hitting in document, can be according to establishing in advanceDetermine mode and read this lexical item in the positional information of tentatively hitting in document, particularly, for example, according to true in advanceThe fixed order of tentatively hitting document, determines that this lexical item tentatively hits positional information in document at firstMemory location, and read lexical item according to this memory location of determining and tentatively hit the position in document at firstPut information, then select the next one tentatively to hit document, and read lexical item and tentatively hit document at this next oneIn positional information, until read lexical item in all positional informations of tentatively hitting in document; Or,Determine that this lexical item is behind all memory locations of tentatively hitting the positional information in document, directly according to determiningEach memory location, reads this lexical item successively in each positional information of tentatively hitting in document.
In the step 201 comprising in the corresponding flow process of Fig. 2, determine that lexical item is in the position of tentatively hitting in documentThe memory location of information, the application provides definite lexical item in the storage of tentatively hitting the positional information in documentThe preferred embodiment of position, and corresponding providing read this lexical item at the position letter tentatively hitting in documentThe preferred embodiment of breath, particularly, as shown in Figure 3, this reads lexical item in the position of tentatively hitting in documentThe process of putting information, mainly comprises the following steps:
Step 301, determine that this lexical item is tentatively to hit each positional information in document corresponding respectively while being savedThe first initial memory location.
In this step 301, the first initial memory location can be for preserving the memory space of above-mentioned positional informationStarting position, for example, if the byte length of the positional information of storage lexical item is 4 bytes, the first initial depositingIt can be the place, position of front 2 bytes of this 4 byte that storage space is put, in addition, if the position letter of storage lexical itemThe byte length of breath is 2 bytes, and the first initial memory location is the place, position of 2 bytes. SpecificallyGround, can set in actual applications as the case may be flexibly, is only example herein.
The first initial bank bit corresponding to difference when each positional information that step 302, basis are determined is savedPut, read this lexical item in each positional information of tentatively hitting in document.
In this step 302, if the first initial memory location is the place, position of front 2 bytes of this 4 byte,Can be before reading after the information of 2 bytes, continue to read the information of rear 2 bytes, by the letter readingBreath combines and is defined as this lexical item in the positional information of tentatively hitting in document, or, if the first initial depositingStorage space is put the memory location of 2 bytes of the positional information of corresponding stored lexical item, directly reads this 2 bytesInformation, and this information is defined as to this lexical item in the positional information of tentatively hitting in document.
So far, reading lexical item finishes in the flow process of tentatively hitting the positional information in document.
In the step 301 comprising in flow process corresponding to Fig. 3, the technical program provides a kind of this lexical item of determiningTentatively hitting the preferred embodiment of memory location of the positional information in document, determine this lexical item at the beginning ofWalk the first initial memory location of distinguishing correspondence when each positional information of hitting in document is saved. Practical applicationIn, also can taking this lexical item corresponding tentatively hit document as determining that unit carries out that first start bit puts, toolBody ground, only determines that this lexical item is at each corresponding each document that tentatively hits of each positional information tentatively hitting in documentWhile being saved, corresponding each this lexical item of tentatively hitting document storing is at each each position letter tentatively hitting in documentThe initial memory location of first memory location of breath, then, in step 302, takes corresponding modeRead this lexical item in each positional information of tentatively hitting in document, particularly, as shown in Figure 4, read lexical itemIn the step of tentatively hitting the positional information in document, mainly comprise the following steps:
Step 401, determine that this lexical item is in the time that each positional information of tentatively hitting in document is saved, corresponding eachTentatively hit this lexical item of document storing at each first bank bit that tentatively hits the positional information in documentBe set to the first initial memory location.
In this step 401, described the first initial memory location also can be understood as the first in above-mentioned Fig. 3Beginning memory location, the first initial memory location in this step 401 is only the first initial depositing in above-mentioned Fig. 3A kind of situation that storage space is put, particularly, taking practical application as example, exists 4 if tentatively hit in document AIndividual " retrieval ", is carrying out in 4 positional informations of tentatively hitting in document A for this 4 " retrieval "While preservation successively, first memory location of preserving above-mentioned 4 positional informations is defined as to the first initial storagePosition, the memory location of other 3 positional informations come successively this first initial memory location after (figureThe flow process of 3 correspondences is also determined the memory location first initial bank bit of correspondence respectively of these 3 positional informationsPut), further, can be for each document that tentatively hits, determine respectively corresponding each document that tentatively hitsThis lexical item of preserving is first initial in each first memory location of tentatively hitting the positional information in documentMemory location.
Step 402, from the first initial memory location of determining, read successively this lexical item in preliminary lifePositional information in middle document.
In this step 402, tentatively hitting document is preliminary life corresponding to the first initial memory location of determiningMiddle document, the example providing according to above-mentioned steps 401, can be from " retrieval " tentatively in this step 402The first initial memory location of hitting the memory location of 4 positional informations of A in document starts, and reads successivelyThe 1st positional information, the 2nd positional information, the 3rd positional information and the 4th positional information, enterOne step, determines with each and tentatively hits behind the first initial memory location that document is corresponding, all according to step 401Can read this lexical item in each positional information of tentatively hitting in document by carrying out this step 402.
So far, reading lexical item finishes in the flow process of tentatively hitting the positional information in document.
According to above-mentioned Fig. 3 and flow process corresponding to Fig. 4, the technical program provide directly read lexical item at the beginning ofStep is hit the mode of the positional information in document, according to the first initial bank bit of the positional information of determiningPut and can directly read this lexical item in the positional information of tentatively hitting in document, or be only to determine this wordIn the time that each positional information of tentatively hitting in document is saved, corresponding each this word that tentatively hits document storingItem is the first initial memory location in each first memory location of tentatively hitting the each positional information in document,Then start to read successively this lexical item from this first initial memory location at the position letter tentatively hitting documentBreath. Corresponding above-mentioned the first initial memory location, the technical program also provides determines the first initial memory locationPreferred embodiment, particularly, Fig. 5 shows and a kind ofly determines that this lexical item is tentatively hitting each in documentWhen positional information is saved, the first initial memory location corresponding to difference, as shown in Figure 5, determines that this lexical item existsTentatively hit each positional information in document process of the first corresponding initial memory location respectively while being saved,Mainly comprise the following steps:
Step 501, definite second initial memory location corresponding with this lexical item of preserving.
In this step 501, the second initial memory location is each in the each document that comprises this lexical item of this lexical itemInitial memory location when positional information is saved.
Step 502, determine the second initial memory location respectively with this lexical item tentatively hitting first in documentDifference between initial memory location when individual positional information is saved.
In this step 502, described difference is the first initial guarantor in the second initial memory location and above-mentioned Fig. 4Deposit the difference of position, in practical application, also can preserve the second initial memory location respectively with this lexical item at the beginning ofStep is hit the difference between each positional information in document initial memory location while being saved. Herein, canThe second initial memory location is called to PrxPointer, difference is called to PrxValue.
The second initial memory location and difference that step 503, basis are determined, determines that respectively this lexical item existsTentatively hit each positional information in document corresponding the first initial memory location respectively while being saved.
In this step 503, according to the determine second initial memory location and difference, can be first definiteGo out the first initial memory location described in above-mentioned Fig. 4, by PrxValue and PrxPointer and trueBe decided to be the first initial memory location, afterwards, can comply with according to the byte length of the positional information of preserving this lexical itemInferior this lexical item of determining is tentatively hit the first initial preservation position corresponding to other positional informations in document at thisPut, or determine that according to the stepping data item of setting this lexical item tentatively hits other positions letter in document at thisThe first initial preservation position that breath is corresponding. In addition, the example in corresponding step 502, if described difference isTwo initial memory locations respectively with this lexical item tentatively hitting each positional information in document rising while being savedDifference between beginning memory location, can directly determine that this lexical item is at the each position letter tentatively hitting in documentThe first initial memory location corresponding to difference when breath is saved. When concrete application, can be with reference to the technical programThe above-mentioned preferred embodiment exemplifying, also can carry out as the case may be other and arrange, herein no longer one by oneExemplify.
So far, determine that this lexical item is tentatively hitting each positional information in document while being saved respectively corresponding theThe flow process of one initial memory location finishes.
In the step 502 that the flow process that Fig. 5 is corresponding comprises, determine the second initial memory location respectively with this wordIn the difference of tentatively hitting between first positional information in document initial memory location while being saved,For definite method of this difference, the technical program provides corresponding preferred embodiment, particularly, asShown in Fig. 6, determine the second initial memory location respectively with this lexical item tentatively hit in document first positionThe process of the difference between the initial memory location when information of putting is saved, mainly comprises the following steps:
Step 601, definite three initial memory location corresponding with this lexical item of preserving.
In this step 601, the 3rd initial memory location is for preserving and the each document point that comprises this lexical itemThe not initial memory location of the memory location of corresponding difference. In practical application, can this difference will be preservedMemory space is called list of locations index file, according to this list of locations index file, can take to setAlgorithm is determined the memory location of preserving the fixed positional information of the list of locations document of positional information. ItsIn, can will in the list of locations index file that the positional information in each document is set up for a lexical item, protectFirst memory location of depositing difference is called the 3rd initial memory location DpPointer.
Step 602, according to this tentatively hit document in the each document that comprises this lexical item corresponding order andFor preserving the memory space of the difference that each document is corresponding, be identified for being kept at this tentatively hit document itTotal memory space of the difference that front document is corresponding.
In this step 602, tentatively hit the order of document correspondence in the each document that comprises this lexical item, buildingBefore vertical list of locations index file, determine, order corresponding this each document can be called to document markKnow, in practical application, otherwise the each document of mark is to distinguish each document, routine no longer one by one hereinLift. In the list index file of position, can the corresponding described difference of document identification of document of this lexical item will be comprisedPreserve, and the byte length of preserving difference is set as to fixed value, thus can be according to tentatively hitting documentDocument identification and fixing byte length determine the difference corresponding to document of tentatively hitting before documentTotal memory space, for example, the length of byte of preserving difference is 4 bytes, tentatively hits the literary composition of documentShelves are designated DOC4, this tentatively hit document before document be respectively DOC1, DOC2 andThe document that DOC3 is corresponding, preserve this first three document respectively the memory space of corresponding difference be 12 wordsSave corresponding memory space.
Step 603, by the 3rd initial memory location of determining and total memory space sum, determine differenceInitial memory location.
In this step 603, according to the example of above-mentioned steps 601 and step 602, can determine differenceInitial memory location is that DpPointer adds memory space that memory space sum corresponding to 12 bytes point toThe memory space of next saved differences, in practical application, can add 12 words determining DpPointerSave after corresponding memory space sum, automatically by the finger of the initial memory location of difference corresponding sensing DOC3Pin points to the initial memory location of the difference that DOC4 is corresponding, and further, the technical program can also be by upperState the initial memory location that algorithm directly points to pointer the difference that DOC4 is corresponding, for example, by DpPointerAdd memory space corresponding to 16 bytes and be directly defined as preserving the initial of difference that DOC4 is corresponding and depositStorage space is put, and pointer is directly targeted to the initial memory location of the difference that this DOC4 is corresponding. These are onlyThe preferred embodiment of the technical program, can set in practical application flexibly.
Difference is read in the initial memory location that step 604, basis are determined.
So far, determine the second initial memory location respectively with this lexical item tentatively hit in document first positionThe flow process of the difference between the initial memory location when information of putting is saved finishes.
The deterministic retrieval word providing according to the technical program method of positional information in document is determined retrievalEach lexical item that word comprises is after the positional information of tentatively hitting in document, and text retrieval system can be according to establishingFixed algorithm calculates to determine the final document that hits, example to the positional information of the each lexical item readingAs, the number of times occurring in tentatively hitting document in conjunction with each lexical item and the positional information calculation of appearance go out each wordDegree of correlation score value taking determine comprise each lexical item whether tentatively hit document as the final document that hits, orPerson, determines that in the positional information of tentatively hitting in document whether the position relationship between each lexical item is full according to each lexical itemThe relation that foot is set, if meet, is defined as the final document that hits, particularly, and with " digital information placeThe method that reason is accelerated " be term, if lexical item " numeral ", " information ", " processing ", " acceleration ", " "And " method " exist adjacent successively situation tentatively hitting positional information in document, can determineThis tentatively hits document is the final document that hits. These are only the preferred enforcement side that the technical program providesFormula, in practical application, can be to the deterministic retrieval word positional information in document providing according to the technical programThe positional information of each lexical item of determining of method, apply flexibly as the case may be, herein no longerRepeat.
Embodiment bis-
This embodiment bis-provides the application scenarios of a kind of deterministic retrieval word method of positional information in document.
The application scenarios of a kind of deterministic retrieval word method of positional information in document that the present embodiment two providesIn, searching system retrieval in full " method that digital information processing is accelerated " is for example describes, concreteGround, " method that digital information processing is accelerated " is being divided into lexical item " numeral ", " letter by text retrieval systemBreath ", " processing ", " acceleration ", " " and " method " laggard line retrieval, get in above-mentioned table 17 documents, these 7 documents are the document that comprises above-mentioned at least one lexical item, wherein document 4 and literary compositionShelves 7 comprise the whole lexical items that obtain after term is divided, and the document 4 and document 7 are for tentatively hitting literary compositionShelves.
The technical program is carried out the process to read the positional information of lexical item " numeral " in document 4 as exampleDescribe, particularly, as shown in Figure 7, determine the process of the positional information of lexical item " numeral " in document 4,Mainly comprise the following steps:
Step 701, read the document identification DOC3 that tentatively hits document 4, and " numeral " is correspondingIn list of locations index file, preserve the initial memory location DpPointer of the memory space of difference.
In this step 701, the document identification that document 4 is tentatively hit in setting is DOC3, list of locations indexThe corresponding memory space of preserving difference of file, initial memory location DpPointer is that correspondence comprises " numeral "The document identification of document preserve the initial memory location of memory space of difference, for example, protect with array formWhile depositing the corresponding relation of document identification and difference, preserve successively and document identification pair as subscript using document identificationThe difference of answering, the memory location of the difference that corresponding document identification DOC1 preserves is initial memory locationDpPointer。
Step 702, be 4 bytes according to the byte length of preservation difference of setting, determine corresponding document identificationThe initial memory location of difference that DOC4 preserves is that DpPointer adds corresponding bank bit after 16 bytesPut.
The difference that corresponding DOC4 preserves is read in the initial memory location that step 703, basis are determined.
In this step 703, can be by the difference called after PrxValue reading.
Step 704, by the original position PrxPointer of list of locations file and PrxValue with correspondingMemory location, the initial of memory space that is defined as " numeral " 200 positional informations in document 4 depositsStorage space is put.
In this step 704, if be also provided with in text retrieval system, stepping data item is auxiliary reads each lexical item and existsPositional information in each document, corresponding, if the initial memory location of determining is positioned at " numeral " correspondenceFirst stepping interval of list of locations file in, can be by the initial memory location of list of locations filePrxPointer and PrxValue and corresponding memory location, be defined as " numeral " each in document 4The initial memory location of the memory space of positional information, otherwise, can be by interval each stepping corresponding start bit(this original position is for right according to PrxPointer, stepping data item and stepping interval to put SkipPrxPointerThe memory location that the memory space of answering is determined) with PrxValue and corresponding memory location, be defined as " numberWord " the initial memory location of each positional information in document 4.
The initial memory location that step 705, basis are determined is read " numeral " each position in document 4 successivelyInformation.
So far the flow process of, determining the positional information of lexical item " numeral " in document 4 finishes.
Before the step 701 comprising in flow process corresponding to Fig. 7, read the document mark that tentatively hits document 4Know DOC3, and in list of locations index file corresponding to " numeral ", preserve the rising of memory space of differenceBefore beginning memory location DpPointer, in practical application, each document that can comprise according to above-mentioned table 1 entersRow tentatively hits determining of document, tentatively hits after document, directly execution graph 7 correspondences when judgingAbove-mentioned flow process, particularly, according to document identification corresponding to each document providing in table 1, successively to each documentThe lexical item comprising, for example, reads the lexical item that in table 1, first document comprises, this first document is notComprise whole lexical items, read the lexical item that second document comprises, so until read the 4th document,The document comprises whole lexical items, can be to the above-mentioned flow process of the document execution graph 7 correspondences.
Embodiment tri-
This embodiment tri-provides the device of a kind of deterministic retrieval word positional information in document, and this device canBe applied in text retrieval system, by implement the method in text retrieval system, can solve existing skillIn art, due to deterministic retrieval word, the efficiency of the positional information in the document tentatively hitting is low respectively affects inspectionThe problem of rope efficiency.
Particularly, Fig. 8 shows a kind of deterministic retrieval word position in document that the embodiment of the present invention three providesThe structural representation of the device of information, as shown in Figure 8, the dress of this deterministic retrieval word positional information in documentPut, comprising:
Term division unit 801 and positional information reading unit 802; Wherein:
Term division unit 801, for being divided into term multiple lexical items;
Positional information reading unit 802, for dividing term for term division unit 801The each lexical item obtaining, carries out respectively: determine that this lexical item tentatively hits positional information in document eachMemory location, and according to the memory location of determining, read this lexical item at the position letter tentatively hitting in documentBreath, wherein, tentatively hits document and comprises that term is divided the each lexical item obtaining.
In the preferred embodiment that the embodiment of the present invention three provides, the positional information that the device that Fig. 8 is corresponding comprisesReading unit 802, specifically for:
According to predetermined order of tentatively hitting document, determine that this lexical item tentatively hits document at firstIn the memory location of positional information, and according to the memory location of determining, read this lexical item at the beginning of firstStep is hit the positional information in document; Select the next one tentatively to hit document, and it is preliminary at this to read this lexical itemHit the positional information in document, until read this lexical item in all positional informations of tentatively hitting in document;Or, determine that this lexical item is in all memory locations of tentatively hitting the positional information in document, and according to determiningEach memory location, reads this lexical item successively in each positional information of tentatively hitting in document.
In the preferred embodiment that the embodiment of the present invention three provides, the positional information that the device that Fig. 8 is corresponding comprisesReading unit 802, specifically for:
Determine that this lexical item is tentatively hitting each positional information in document corresponding the first respectively while being savedBeginning memory location; The first initial bank bit corresponding to difference while being saved according to each positional information of determiningPut, read this lexical item in each positional information of tentatively hitting in document.
In the preferred embodiment that the embodiment of the present invention three provides, the positional information that the device that Fig. 8 is corresponding comprisesReading unit 802, specifically for:
Determine the second initial memory location corresponding with this lexical item of preserving, the second initial memory location is this wordInitial memory location when the each positional information of item in the each document that comprises this lexical item is saved; And determine theTwo initial memory locations respectively with this lexical item tentatively hitting first positional information in document while being savedInitial memory location between difference; According to the determine second initial memory location and difference, respectivelyDetermine that this lexical item is tentatively hitting each positional information in document corresponding the first initial depositing respectively while being savedStorage space is put.
In the preferred embodiment that the embodiment of the present invention three provides, the positional information that the device that Fig. 8 is corresponding comprisesReading unit 802, specifically for:
Determine preserve the three initial memory location corresponding with this lexical item, the 3rd initial memory location be forPreserve with comprise each document of this lexical item in distinguish the initial memory location of the memory location of corresponding difference; AndTentatively hit document corresponding order and each for preserving in the each document that comprises this lexical item according to thisThe memory space of the difference that document is corresponding, is identified for being kept at this and tentatively hits document document correspondence beforeTotal memory space of difference; By the 3rd initial memory location of determining and total memory space sum, determineGo out the initial memory location of difference, and read difference according to the initial memory location of determining.
Should be appreciated that the unit that above device comprises is only that the logic of carrying out according to the function of this device realization is drawnPoint, in practical application, can carry out stack or the fractionation of said units. And the device that this embodiment providesThe method flow of deterministic retrieval word positional information in document that the function realizing and above-described embodiment provideCorresponding one by one, the more detailed handling process realizing for this device, in said method embodimentBe described in detail, be not described in detail herein.
And the device of the deterministic retrieval word positional information in document in the present embodiment three also has can be realThe functional module of existing embodiment mono-and embodiment bis-schemes repeats no more herein.
Although described the application's preferred embodiment, once those skilled in the art obtain cicada baseThis creative concept, can make other change and amendment to these embodiment. So appended right is wantedAsk and be intended to be interpreted as comprising preferred embodiment and fall into all changes and the amendment of the application's scope.
Obviously, those skilled in the art can carry out various changes and modification and not depart from this present inventionBright spirit and scope. Like this, if of the present invention these amendment and modification belong to the claims in the present invention andWithin the scope of its equivalent technologies, the present invention be also intended to comprise these change and modification interior.

Claims (4)

1. a method for the positional information of deterministic retrieval word in document, is characterized in that, comprising:
Be divided the each lexical item obtaining for term, carry out respectively:
Determine the second initial memory location corresponding with this lexical item of preserving, described the second initial memory location isInitial memory location when the each positional information of this lexical item in the each document that comprises this lexical item is saved; And
Determine described the second initial memory location respectively with this lexical item tentatively hit in document first positionDifference between initial memory location when the information of putting is saved;
According to described the second initial memory location and the described difference determined, respectively determine this lexical item at the beginning ofWalk the first initial memory location of distinguishing correspondence when each positional information of hitting in document is saved;
While being saved according to described each positional information of determining, the first initial memory location corresponding to difference, readsGet this lexical item in described each positional information of tentatively hitting in document; Wherein, describedly tentatively hit in document and wrapDraw together described term and be divided the each lexical item obtaining.
2. the method for claim 1, is characterized in that, determines described the second initial memory locationRespectively with the each initial memory location of this lexical item in the time that the primary importance information in document of tentatively hitting is savedBetween difference, comprising:
Determine the three initial memory location corresponding with this lexical item of preserving, described the 3rd initial memory location isFor preserving, the initial of memory location of corresponding described difference deposits respectively with the each document that comprises this lexical itemStorage space is put; And
Tentatively hit document corresponding order and for preserving in the each document that comprises this lexical item according to thisThe memory space of the described difference that each document is corresponding, is identified for being kept at this and tentatively hits before documentTotal memory space of the described difference that document is corresponding;
By described the 3rd initial memory location of determining and described total memory space sum, be defined as differenceInitial memory location, and read described difference according to the described initial memory location of determining.
3. a device for deterministic retrieval word positional information in document, is characterized in that, comprising:
Term division unit, for being divided into term multiple lexical items;
Positional information reading unit, for dividing and obtain term for described term division unitEach lexical item, respectively carry out: determine preserve the second initial memory location corresponding with this lexical item, described inThe second initial memory location is that the each positional information of this lexical item in the each document that comprises this lexical item is while being savedInitial memory location; And document is tentatively being hit with this lexical item respectively in definite described the second initial memory locationIn the initial memory location of first positional information while being saved between difference; Described in determiningThe second initial memory location and described difference, determine that respectively this lexical item is in each position of tentatively hitting in documentThe first initial memory location corresponding to difference when information is saved; According to described each positional information quilt of determiningCorresponding the first initial memory location respectively when preservation, reads this lexical item and tentatively hits each in document describedPositional information; Wherein, the described document that tentatively hits comprises that described term is divided the each lexical item obtaining.
4. device as claimed in claim 3, is characterized in that, described positional information reading unit, toolBody is used for:
Determine the three initial memory location corresponding with this lexical item of preserving, described the 3rd initial memory location isFor preserving, the initial of memory location of corresponding described difference deposits respectively with the each document that comprises this lexical itemStorage space is put; And
Tentatively hit document corresponding order and for preserving in the each document that comprises this lexical item according to thisThe memory space of the described difference that each document is corresponding, is identified for being kept at this and tentatively hits before documentTotal memory space of the described difference that document is corresponding;
By described the 3rd initial memory location of determining and described total memory space sum, be defined as differenceInitial memory location, and read described difference according to the described initial memory location of determining.
CN201110430651.0A 2011-12-20 2011-12-20 Method and the device of the positional information of a kind of deterministic retrieval word in document Expired - Fee Related CN103176978B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110430651.0A CN103176978B (en) 2011-12-20 2011-12-20 Method and the device of the positional information of a kind of deterministic retrieval word in document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110430651.0A CN103176978B (en) 2011-12-20 2011-12-20 Method and the device of the positional information of a kind of deterministic retrieval word in document

Publications (2)

Publication Number Publication Date
CN103176978A CN103176978A (en) 2013-06-26
CN103176978B true CN103176978B (en) 2016-05-04

Family

ID=48636860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110430651.0A Expired - Fee Related CN103176978B (en) 2011-12-20 2011-12-20 Method and the device of the positional information of a kind of deterministic retrieval word in document

Country Status (1)

Country Link
CN (1) CN103176978B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6510425B1 (en) * 1998-02-25 2003-01-21 Hitachi, Ltd. Document search method for registering documents, generating a structure index with elements having position of occurrence in documents represented by meta-nodes
CN101131704A (en) * 2006-08-23 2008-02-27 国际商业机器公司 Device and method for positional representation of content
CN102020447A (en) * 2009-09-18 2011-04-20 上海诗溢建材科技有限公司 Nano crystal paint material

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6510425B1 (en) * 1998-02-25 2003-01-21 Hitachi, Ltd. Document search method for registering documents, generating a structure index with elements having position of occurrence in documents represented by meta-nodes
CN101131704A (en) * 2006-08-23 2008-02-27 国际商业机器公司 Device and method for positional representation of content
CN102020447A (en) * 2009-09-18 2011-04-20 上海诗溢建材科技有限公司 Nano crystal paint material

Also Published As

Publication number Publication date
CN103176978A (en) 2013-06-26

Similar Documents

Publication Publication Date Title
US7853770B2 (en) Storage system, data relocation method thereof, and recording medium that records data relocation program
US7165156B1 (en) Read-write snapshots
CN102024046B (en) Data repeatability checking method and device as well as system
CN105320775A (en) Data access method and apparatus
JP2005267600A5 (en)
CN104731896A (en) Data processing method and system
US11567681B2 (en) Method and system for synchronizing requests related to key-value storage having different portions
CN103390020A (en) Method and system for storing data in database
US20060155739A1 (en) A Generic Architecture for Indexing Document Groups in an Inverted Text Index
CN102200968A (en) Method and device for removing duplications of EXCEL form data
US7783589B2 (en) Inverted index processing
CN101464901A (en) Object search method in object storage device
CN103885721B (en) A kind of data storage or read method in key assignments system, device
US20220138203A1 (en) Method and system for searching a key-value storage
CN104572785A (en) Method and device for establishing index in distributed form
CN102959548A (en) Data storage method, search method and device
CN108009049A (en) The offline restoration methods of MYISAM storage engines deletion records, storage medium
CN106201851A (en) The detection method of heap memory operation and device
CN100462973C (en) XML file preprocessing method, apparatus, file structure, reading method and device
CN105354149B (en) A kind of internal storage data lookup method and device
CN102567296B (en) A kind of disposal route of Chinese character information and the treating apparatus of Chinese character information
CN103176978B (en) Method and the device of the positional information of a kind of deterministic retrieval word in document
CN103714121A (en) Index record management method and device
CN102902731B (en) The storage method of mail index
CN107943415B (en) Method and system for searching free cluster based on FAT file system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220624

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Beijing Fangzheng apapi Technology Co., Ltd.

Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Beijing Fangzheng apapi Technology Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160504