CN104636384B - A kind of method and device handling document - Google Patents

A kind of method and device handling document Download PDF

Info

Publication number
CN104636384B
CN104636384B CN201310567401.0A CN201310567401A CN104636384B CN 104636384 B CN104636384 B CN 104636384B CN 201310567401 A CN201310567401 A CN 201310567401A CN 104636384 B CN104636384 B CN 104636384B
Authority
CN
China
Prior art keywords
document
word
internal
code
leaf node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310567401.0A
Other languages
Chinese (zh)
Other versions
CN104636384A (en
Inventor
施腾飞
王中飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201310567401.0A priority Critical patent/CN104636384B/en
Publication of CN104636384A publication Critical patent/CN104636384A/en
Application granted granted Critical
Publication of CN104636384B publication Critical patent/CN104636384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of method and device for handling document, is related to information search technique field, and the real-time of new document storage can be improved.The embodiment of the present invention obtains in-line arrangement information, the in-line arrangement information includes the number of each word in document code and the document by carrying out extraction processing to the word in document;Internal document number is distributed for the document;By the corresponding internal document number of the number addition of each word in the in-line arrangement information, and the corresponding relationship that the number of each word and the internal document are numbered is saved in the database.The present invention is suitable for using when new document is put in storage and is saved.

Description

A kind of method and device handling document
Technical field
The present invention relates to information search technique field more particularly to a kind of method and devices for handling document.
Background technique
It is generally more demanding to the timeliness n of webpage when user retrieves webpage.In the prior art to new document Processing generallys use following manner: merging after new document is accumulated to a certain extent with old document, whole documents are again Establish index;Alternatively, in new document storage, will new index accumulation to certain document size when be appended to old index back.
However, needing to accumulate new document when handling new document using the prior art, when new document is accumulated to certain Index is just established after degree, or will be appended to behind old index after new index accumulation to certain document size, and new text is caused Shelves cannot be arrived by user search in time, and real-time is poor.
Summary of the invention
The embodiment of the present invention provides a kind of method and device for handling document, and the real-time of new document storage can be improved Property.
In a first aspect, the embodiment of the present invention provides a kind of method for handling document, comprising:
Extraction processing carried out to the word in document, obtains in-line arrangement information, the in-line arrangement information include document code and The number of each word in the document;
Internal document number is distributed for the document;
By the corresponding internal document number of the number addition of each word in the in-line arrangement information, and will be described each The corresponding relationship of the number of a word and internal document number saves in the database.
Second aspect, the embodiment of the present invention provide a kind of device for handling document, comprising:
Extracting unit obtains in-line arrangement information, the in-line arrangement information includes for carrying out extraction processing to the word in document The number of each word in document code and the document;
Allocation unit, for distributing internal document number for the document;
Adding unit, for compiling the corresponding internal document of number addition of each word in the in-line arrangement information Number;
Storage unit, for the corresponding relationship of the number of each word and internal document number to be stored in In database.
The embodiment of the present invention provides a kind of method and device for handling document, by carrying out at extraction to the word in document Reason obtains in-line arrangement information, and the in-line arrangement information includes the number of each word in document code and the document;It is described Document distributes internal document number;The corresponding internal document of number addition of each word in the in-line arrangement information is compiled Number, and the corresponding relationship that the number of each word and the internal document are numbered is saved in the database.With it is existing It when the new document of skill cardia, needs to accumulate new document, just establishes and index after the accumulation to a certain extent of new document Or will be appended to behind old index after new index accumulation to certain document size, cause new document that cannot be examined in time by user Rope arrives, and real-time is poor to be compared, and the embodiment of the present invention can execute storage to single document and save, and enters so as to improve document The real-time in library.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of method for handling document provided in an embodiment of the present invention;
Fig. 2 is the flow chart of the method for another processing document provided in an embodiment of the present invention;
Fig. 3 is a kind of B+ tree schematic diagram provided in an embodiment of the present invention;
Fig. 4 is another kind B+ tree schematic diagram provided in an embodiment of the present invention;
Fig. 5 is another kind B+ tree schematic diagram provided in an embodiment of the present invention;
Fig. 6 is the flow chart of the method for another processing document provided in an embodiment of the present invention;
Fig. 7 is a kind of block diagram of device for handling document provided in an embodiment of the present invention;
Fig. 8 is the block diagram of the device of another processing document provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Embodiment 1
The embodiment of the present invention provides a kind of method for handling document, and the executing subject of this method can be server.Such as Fig. 1 It is shown, this method comprises:
Step 101, extraction processing is carried out to the word in document, obtains in-line arrangement information.
The in-line arrangement information includes the number of each word in document code and the document.For example, in-line arrangement information It can be docid:wordid1, wordid2, wordid3 ... wordidN, wherein docid indicates that the corresponding document of document is compiled Number, wordid1 indicates the number of the corresponding word of first word in document, and N indicates the quantity of word in document.
Optionally, this step includes: offline downloading new web page, obtains corresponding document;
According to the network address of the document, the document code is obtained;
Extraction processing is carried out to the word in the document, obtains each word;
According to each word, the number of each word is obtained;
According to the number of the document code and each word, in-line arrangement information is obtained.
Step 102, internal document number is distributed for the document.
Optionally, processing is forwarded to document first before saving to document storage, is forwarded and is advised according to preset document It then forwards the document on a machine, thus machine executes in-stockroom operation.This machine is a machine in server, service Include more machines in device, can be that each machine serial number will be different after receiving the in-line arrangement information of document Document, which is forwarded on different machines, carries out parallel processing, to accelerate the speed of document storage.
Internal document number is the document code used inside single machine, and internal document in different machines number can be with It is identical.Every machine storage inside one be initialized as 0 global variable, after often receiving a new document, distribution is worked as The internal document number that preceding global variable is put in storage as document, then global variable increases 1 certainly.For example, current global variable It is 00130, after receiving a new document, distributes to new document for 00130, then global variable becomes 00131, then After receiving a new document, 00131 is distributed to the new document received again, then global variable becomes 00132, successively add up.
The embodiment of the present invention distributes internal document number before document storage for document, can make the number for being inserted into B+ tree According to it is dull orderly, it therefore is not in that leaf node in B+ tree reorders that B+ tree insertion process, which becomes to be to be sequentially written in data, Or the case where division, operation difficulty is reduced, so that the speed of storage is greatly improved.
Step 103, the corresponding internal document of number addition of each word in the in-line arrangement information is numbered, and The corresponding relationship that the number of each word and the internal document are numbered is saved in the database.
The inverted index of the corresponding B+ tree of the number of each word, the inverted index are used for through a word number An internal document number collection is positioned, it includes that the word number corresponds to that the internal document number, which concentrates corresponding every document, Word.For example, inverted index can be wordid1:00019,00108 ... 00130 etc., wherein wordid1 indicates word Number, 00019,00108 ... 00130 constitutes internal document number collection, including multiple internal documents number, each internal document are compiled A number corresponding document includes wordid1 in this document.
When user search information, input inquiry word string by distributing the number of word for inquiry word string, and passes through word Number can quickly navigate to internal document number collection, the document of retrieval can be obtained by internal document number collection. Optionally, the inquiry word string of the user received is pre-processed, obtains each word for including in the inquiry word string Number;According to the number of each word, obtain identical in the corresponding internal document number of number of each word Internal document number regard the identical internal document number as target internal document code;It is arrived according to internal document number The mapping relations of document code determine the corresponding document code of the target internal document code, and export the document code Corresponding document.
A kind of method that the embodiment of the present invention provides processing document is obtained by carrying out extraction processing to the word in document In-line arrangement information is obtained, the in-line arrangement information includes the number of each word in document code and the document;For the document Distribute internal document number;The corresponding internal document of number addition of each word in the in-line arrangement information is numbered, And the corresponding relationship for numbering the number of each word and the internal document saves in the database, allows to pair Single document executes storage and saves, so as to improve the real-time of document storage.
The embodiment of the present invention provides a kind of method for handling document, and the executing subject of this method can be server, this method Two processes, a process for document storage, a process for user's real-time retrieval, as shown in Fig. 2, this article can be divided into The process of shelves storage specifically includes:
Step 201, new web page is downloaded offline, obtains corresponding document.
When webpage when website has update, or when being added to new webpage, the webpage can be downloaded, it is corresponding to obtain webpage Document, so that document is further processed.
Step 202, according to the network address of the document, the document code is obtained.
Optionally, according to the network address of document, corresponding the 5th edition (Message-Digest of Message Digest 5 of document is calculated Algorithm5, MD5) value, using the corresponding MD5 value of document as document code docid.Optionally, the network address of document can be Uniform resource locator (Uniform Resource Locator, URL).MD5 is a set of proving program, is to guarantee file Correctness, prevent some from usurping program and add a little wooden horses or distort copyright etc., a set of proving program of design.Each file A fixed MD5 value can be calculated with MD5 proving program.One document uniquely corresponds to a MD5 value.
Step 203, extraction processing is carried out to the word in the document, obtains each word.
When carrying out extracting processing to the word in document, invalid word can be removed, duplicate word only retains one Deng.
Step 204, according to each word, the number of each word is obtained.
Optionally, according to each word, the corresponding MD5 value of word is calculated, using the corresponding MD5 value of word as the volume of word Number.The corresponding MD5 value of each word.
Step 205, according to the number of the document code and each word, in-line arrangement information is obtained.
In-line arrangement information includes the number of each word in document code and the document.For example, in-line arrangement information can be with For docid:wordid1, wordid2, wordid3 ... wordidN, wherein docid indicates the corresponding document code of document, Wordid1 indicates the number of the corresponding word of first word in document, and N indicates the quantity of word in document.
Step 206, according to document forward rule, different documents is forwarded on different machines, further to locate Reason.
Optionally, processing is forwarded to document first before saving to document storage, is forwarded and is advised according to preset document It then forwards the document on a machine, thus machine executes in-stockroom operation.This machine is a machine in server, service It may include more machines in device, can be that each machine serial number will not after receiving the in-line arrangement information of document Same document, which is forwarded on different machines, carries out parallel processing, to accelerate the speed of document storage.For example, including in server 10 machines, number are respectively 0-9, carry out mould 10 to document and calculate, such as the number of acquisition is 6, then is forwarded to this document On the machine that number is 6.If number is 0, then it is 1 that the document received, which can number, always currently when an only document It numbers to 9 and is renumberd since 0 again later, can guarantee that each machine operation amount is suitable in this way, be not in that machine is too busy Or too not busy situation, also ensure that document can be processed in time.
Step 207, internal document number is distributed for the document.
The internal document number inner-docid is to be initialized as the number that 0 global variable obtains according to one, when When one document needs to be saved in database, distributes the internal document that current global variable is the document and number, then institute It states global variable and increases by 1.
Internal document number is redistributed to document code by each machine, can be inserted into when being put in storage document The data of B+ tree are dull orderly, and the insertion process of B+ book becomes to be sequentially written in data, will not be to the leaf node in B+ tree again Sequence or division.
Step 208, establish and save internal document number to the document code mapping relations.
Establish and save the mapping relations of inner_docid- > docid, the continuous memory headroom storage of one section of distribution Inner_docid is corresponding, since inner_docid is orderly and distributes since 0, directly passes through pointer and inner_ Docid can be directly targeted to corresponding docid as offset.
docid=f(inner_docid)=inner_docid_2_dcoid[inner_docid]
Step 209, establish and save the mapping relations that the document code is numbered to the internal document.
The mapping relations of docid- > inner_docid are established and saved, can be stored using traditional ltsh chain table The corresponding inner_docid of docid.
Since document code indicates that when document is identical, then MD5 value is also identical using MD5 value.Establish docid- > When the mapping relations of inner_docid, it is first determined in the mapping relations of previously stored docid- > inner_docid whether It will then be had existed in ltsh chain table in the presence of duplicate document code is numbered with current document when duplicating document code The corresponding inner_docid of docid be updated to newly assigned inner_docid;And by inner_docid_2_dcoid [inner_docid] corresponding value is 0, is not present with identifying the result of docid.
Step 210, the number of the first word in the in-line arrangement information is successively obtained, first word is described each Any word in word.
Optionally, for example, in-line arrangement information be docid:wordid1, wordid2, wordid3 ... wordidN, then for the first time Wordid1 is obtained, wordid1 is inserted into corresponding B+ tree and then takes wordid2, wordid2 is inserted into corresponding B+ In tree, each word is successively taken, until word all in current document is inserted into corresponding B+ tree.
Step 211, judge whether the corresponding B+ leaf child node of the number of first word has expired.
Optionally, it can also be comprised determining whether before this step there are the corresponding B+ tree of the number of the first word, when When B+ tree corresponding there are the number of the first word, 211 are thened follow the steps, when the corresponding B+ tree of number that the first word is not present When, then a block space is distributed for the number of the first word, establishes the corresponding B+ tree of number of the first word.
B+ tree includes root node, leaf node, and schematic diagram as shown in Figure 3, B+ tree includes a root node, multiple leaves Node includes the number and internal document number of leaf node in leaf node.For example, the number of first leaf node is 1, internal document number is 000012, and the number of second leaf node is 2, and internal document number is 000019 etc..
Step 212, when the leaf node of the B+ tree is less than, the internal text is directly added in current leaf node Shelves number.
As shown in figure 3, leaf node shares 10, wherein preceding 9 leaf nodes have corresponding internal document to number, this When, internal document number can be added directly in the 10th leaf node.
Step 213, when the leaf node of the B+ tree has been expired, judge whether current layer leaf node has expired.
Schematic diagram as shown in Figure 3,10 leaf nodes are full in B+ tree, can not added directly at this time corresponding The leaf node of internal document number, then judge whether the leaf node of the layer where root node has expired.
Step 214, when the current layer leaf node is less than, new leaf node is added, and in the new leaf The internal document number is added in node.
Schematic diagram as shown in Figure 4, the root node in B+ tree is [1,20], therefore can have 20 leaf nodes, at this time New leaf node can be directly added under this root node, for example, the 11st leaf node of addition, when there are new leaves After node, internal document number can be added after leaf node, for example, addition internal document number 000130.
Step 215, when the current layer leaf node has been expired, one layer is increased to the B+ tree, and add new centre Node and new leaf node, and the internal document number, the new intermediate node are added in the new leaf node Connect the current layer leaf node and the new leaf node.
Schematic diagram as shown in Figure 5, the root node of B+ tree are [1,20], and current all leaf nodes are full, when need It when adding new leaf node again, then needs to increase by one layer for B+ tree, as shown in figure 5, the intermediate node that addition is new, Yi Jiye Child node, the intermediate node [1,20] of the second layer are original B+ tree, and the first address that former B+ root vertex is directed toward is as current B+ tree root First child node of node, then adds new intermediate node and leaf node.To be added in current B+ root vertex [21, 40] new intermediate node [21,40], and the leaf node of [21,40], and are in the second layer added.It can add at this time newly Internal document number is added in the leaf node added, for example, leaf node is 21, internal document number is 000138.
During dynamic layered, first node for newly separating level is directed toward the address that original root node is directed toward, then after It continues into new data, due to only increasing node in split process, there is no need to carry out lock operation.
It should be noted that the webpage more often changed will may be often updated, therefore enters database documents there may be weights Recurrent images, therefore to exclude to search repetitive file, it needs to carry out B+ tree to delete the operation being inserted into.And in the present invention, When repetitive file is put in storage, newly assigned inner_docid is updated to the inner_docid distributed originally, then will newly be divided again The value of the inner_docid matched sets 0, i.e., original inner_docid is done deletion label, and deletes without real B+ tree It removes, more complex B+ tree deletion operation can be saved in this way.Since repetitive file ratio is smaller in webpage, this can be ignored Bring space loss.
Step 216, the corresponding relationship that the number of each word and the internal document are numbered is stored in data In library.
Optionally, the corresponding B+ tree of number of each word is saved.After step 212,214 execute completion, this step is executed Suddenly.
As shown in fig. 6, the process of user's real-time retrieval specifically includes:
Step 601, the inquiry word string of user's input is received.
Optionally, user's transmission is received by CGI(Common gateway interface) (Common Gateway Interface, CGI) Inquire word string.
Step 602, the inquiry word string is subjected to correction process, segmentation of words processing, obtains and is wrapped in the inquiry word string The each word included.
Correction process, synonym processing, segmentation of words etc. are carried out to inquiry word string, such as are by " I am Chinese " cutting " I " "Yes" " Chinese ".
Step 603, according to each word, the MD5 value of each word is calculated, by the MD5 of each word It is worth the number as each word.
Optionally, each word uniquely corresponds to a MD5 value.
Step 604, the corresponding internal text of number of each word is inquired according to the number of each word respectively Shelves number obtains the inverted index of the corresponding B+ tree of number of each word.
The inverted index of the corresponding B+ tree of the number of each word, the inverted index are used for through a word number An internal document number collection is positioned, it includes that the word number corresponds to that the internal document number, which concentrates corresponding every document, Word.For example, the inverted index of its corresponding B+ tree can be obtained by wordid1, compiled including 100 internal documents Number, it includes this wordid1 in corresponding document that each internal document, which is numbered,.
It it should be noted that B+ tree can save in the database, can also save in memory, not limited in the present invention The position that B+ tree saves.
Step 605, according to the index length of each inverted index, the shortest inverted index of length is obtained.
Optionally, the index length of inverted index is the quantity for the internal document number for including, the shortest row's of the falling rope of length Draw the internal document number collection of the minimum number for the internal document number for including.For example, in the number of word " I " is corresponding The quantity of portion's document code is 3000, and the quantity of the corresponding internal document number of the number of word "Yes" is 5000, word " China The quantity of the corresponding internal document number of the number of people " is 30.Then at this time can by the number of " Chinese " as candidate word, from And other words can further be selected to number corresponding internal document number, quickly determine the document that user needs to inquire.
Certainly, the embodiment of the present invention is also an option that the corresponding internal document number of the number of any one word is candidate Word, and other words is further selected to number corresponding internal document number, determine the document that user needs to inquire.It is preferred that , the shortest inverted index of length is selected, the file retrieval time is saved.
Step 606, the target internal document code in the shortest inverted index of the length is successively obtained.
Step 607, judge whether all there is institute in the inverted index in addition to the shortest inverted index of the length State target internal document code.
Optionally, when at least one inverted index in the inverted index in addition to the shortest inverted index of the length There is no when the target internal document code, next target internal document code is obtained, i.e. execution step 606.
Each internal document number of the shortest inverted index of length in B+ leaf child node is successively taken out, respectively with each A internal document number is used as keyword, carries out the retrieval of B+ tree to the number of other words, if the number of other words When all there is this keyword in corresponding internal document number, then this keyword is exported, when there are the numbers of a word to correspond to Internal document number there is no this keyword when, then do not export this keyword.The workload of equipment can be reduced in this way, improved Retrieval rate.
For example, the number of " Chinese " corresponding internal document number is respectively 000012,000019,000100, 000130 ..., by first the most key word of internal document number 000012, inquire the corresponding internal text of number of other words It is numbered in shelves number with the presence or absence of this internal document, when not all including in the number of other words corresponding internal document number When 000012, for example, include 000012 in the corresponding internal document number of the number of " I ", the corresponding internal text of the number of "Yes" Do not include 000012 in shelves number, then selects next keyword 000019 at this time, further judge the number pair of other words It is numbered in the internal document number answered with the presence or absence of this internal document.
Step 608, when all there is the target in the inverted index in addition to the shortest inverted index of the length When internal document number, the target internal document code is obtained.
Optionally, when all exist in the number of other words corresponding internal document number this 000012 when, then illustrate There is " I " "Yes" " Chinese " these words, the user that this document includes for users in 000012 corresponding document Query information is more, therefore precision is higher.
Step 609, the mapping relations to document code are numbered according to internal document, determines the target internal document code Corresponding document code, and export the corresponding document of the document code.
Optionally, before document storage, the mapping relations of storage internal document number to document code, to retrieve text When shelves, the mapping relations that can be numbered according to internal document to document code inquire the corresponding document of target internal document code Number, to be supplied to user.
The embodiment of the present invention provides a kind of method for handling document, may be implemented dynamically to chase after using single document as granularity Information is indexed, and in document storage, the data of only additional operation, newly-increased index do not interfere with the number of old index According to, therefore it is capable of providing the read operation of no lock, in addition, due to document distribution internal document number, so that being inserted into B+ tree Data are dull orderly, to improve the speed of B+ tree insertion data.
The embodiment of the present invention provides a kind of device for handling document, which can be server, as shown in fig. 7, the dress Setting includes: extracting unit 701, allocation unit 702, adding unit 703, storage unit 704.
Extracting unit 701 obtains in-line arrangement information, the in-line arrangement information for carrying out extraction processing to the word in document Number including each word in document code and the document;
Allocation unit 702, for distributing internal document number for the document;
Adding unit 703, for the number addition of each word in the in-line arrangement information is corresponding described internal literary Shelves number;
Storage unit 704, for protecting the corresponding relationship of the number of each word and internal document number It deposits in the database.
It is further alternative, as shown in fig. 7, the extracting unit 701, comprising: download module 7011, number obtain module 7012, abstraction module 7013, in-line arrangement data obtaining module 7014.
Download module 7011 obtains corresponding document for downloading new web page offline;
Number obtains module 7012 and obtains the document code for the network address according to the document;
Abstraction module 7013 obtains each word for carrying out extraction processing to the word in the document;
The number obtains module 7012, is also used to obtain the number of each word according to each word;
In-line arrangement data obtaining module 7014 is obtained for the number according to the document code and each word In-line arrangement information.
It is further alternative, as shown in figure 8, the number obtains module 7012, it is used for:
According to the network address of the document, the 5th edition MD5 value of Message Digest 5 of the document is calculated, and by the document MD5 value as the document code;
The number obtains module 7012, is also used to calculate the MD5 value of each word according to each word, And using the MD5 value of each word as the number of each word.
Optionally, the internal document number is to be initialized as the number that 0 global variable obtains according to one, when one When document needs to be saved in database, distributes the internal document that current global variable is the document and number, it is then described complete Office's variable increases by 1.
It is further alternative, it is the dress after the document distributes internal document number in the allocation unit 702 It sets, further includes: establish unit 705.
Unit 705 is established, the mapping relations for establishing the internal document number to the document code;
The storage unit 704, be also used to save internal document number to the document code mapping relations;Institute Storage unit 704 is stated, is used for: the mapping that the document code is numbered to the internal document is stored using continuous memory headroom Relationship;
Optionally, the inverted index of the corresponding B+ tree of the number of each word, the inverted index are used to pass through one Word number positions an internal document number collection, and it includes the word that the internal document number, which concentrates corresponding every document, Number corresponding word.
It is further alternative, as shown in figure 8, the adding unit 703, comprising: obtain module 7031, be inserted into module 7032。
Module 7031 is obtained, for successively obtaining the number of the first word in the in-line arrangement information, first word For any word in each word;
It is inserted into module 7032, for internal document number to be inserted into the corresponding B+ tree of number of first word In.
Further, the insertion module 7032, comprising:
Judging submodule, for judging whether the leaf node of the corresponding B+ tree of number of first word has expired;
Submodule is added, for directly adding institute in current leaf node when the leaf node of the B+ tree is less than State internal document number;
The judging submodule is also used to judge that current layer leaf node is when the leaf node of the B+ tree has been expired It is no to have expired;
The addition submodule is also used to when the current layer leaf node is less than, adds new leaf node, and The internal document number is added in the new leaf node;
The addition submodule is also used to when the current layer leaf node has been expired, to one layer of B+ tree increase, and New intermediate node and new leaf node are added, and adds the internal document number in the new leaf node, it is described New intermediate node connects the current layer leaf node and the new leaf node.
It is further alternative, as shown in figure 8, described device, further includes: pretreatment unit 706, acquiring unit 707 determine Unit 708, output unit 709.
The corresponding relationship of the number of each word and internal document number will be saved in storage unit 704 After in the database, pretreatment unit 706 is pre-processed for the inquiry word string to the user received, described in acquisition The number for each word for including in inquiry word string;
Acquiring unit 707, for the number according to each word, the number for obtaining each word is corresponding interior Identical internal document number in portion's document code regard the identical internal document number as target internal document code;
Determination unit 708, for, to the mapping relations of document code, determining the target internal according to internal document number The corresponding document code of document code;
Output unit 709, for exporting the corresponding document of the document code.
It is further alternative, as shown in figure 8, the pretreatment unit 706, comprising: receiving module 7061, preprocessing module 7062, computing module 7063.
Receiving module 7061, for receiving the inquiry word string of user's input;
Preprocessing module 7062, for the inquiry word string to be carried out correction process, the segmentation of words is handled, and is looked into described in acquisition Ask each word for including in word string;
Computing module 7063 will be described each for calculating the MD5 value of each word according to each word Number of the MD5 value of word as each word.
Optionally, the inverted index of the corresponding B+ tree of the number of each word, the inverted index are used to pass through one Word number positions an internal document number collection, and it includes the word that the internal document number, which concentrates corresponding every document, Number corresponding word.
It is further alternative, as shown in figure 8, the acquiring unit 707, comprising: enquiry module 7071, inverted index obtain Module 7072, target internal document code obtain module 7073.
Enquiry module 7071, for according to the number of each word, inquiring the number pair of each word respectively The internal document number answered obtains the inverted index of the corresponding B+ tree of number of each word;
Inverted index obtains module 7072, and for the index length according to each inverted index, it is most short to obtain length Inverted index;
Target internal document code obtains module 7073, for successively obtaining the mesh in the shortest inverted index of the length Mark internal document number;
The target internal document code obtains module 7073, is also used in addition to the shortest inverted index of the length The inverted index at least one inverted index when the target internal document code is not present, obtain the next mesh Mark internal document number;
The target internal document code obtains module 7073, is also used in addition to the shortest inverted index of the length The inverted index in when all there is the target internal document code, obtain the target internal document code.
It should be noted that in 8 shown device of attached drawing 7 or attached drawing, the specific implementation process of modules and each The contents such as the information exchange between module, due to based on the same inventive concept, may refer to method with embodiment of the present invention method Embodiment will not repeat them here.
The embodiment of the present invention provides a kind of device for handling document, by extracting unit, for the word in document into Row extraction processing, obtains in-line arrangement information, the in-line arrangement information includes the volume of each word in document code and the document Number;Allocation unit, for distributing internal document number for the document;Adding unit, for will be each in the in-line arrangement information The corresponding internal document number of the number addition of a word;Storage unit, for by the number of each word and The corresponding relationship of the internal document number saves in the database, allows to execute single document storage preservation, thus The real-time of document storage can be improved.
It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separation unit The unit of explanation may or may not be physically separated, and component shown as a unit can be or can also be with It is not physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual It needs that some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not In the case where making the creative labor, it can understand and implement.
Through the above description of the embodiments, it is apparent to those skilled in the art that the present invention can borrow Help software that the mode of required common hardware is added to realize, naturally it is also possible to by specialized hardware include specific integrated circuit, specially It is realized with CPU, private memory, special components and parts etc., but the former is more preferably embodiment in many cases.Based in this way Understanding, substantially the part that contributes to existing technology can be with the shape of software product in other words for technical solution of the present invention Formula embodies, which stores in a readable storage medium, and such as the floppy disk of computer, USB flash disk, movement are hard Disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Magnetic or disk etc., including some instructions are used so that computer equipment (it can be personal computer, server, or Person's network equipment etc.) execute method described in each embodiment of the present invention.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device and For system embodiment, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to method The part of embodiment illustrates.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (22)

1. a kind of method for handling document characterized by comprising
Extraction processing carried out to the word in document, obtains in-line arrangement information, the in-line arrangement information includes document code and described The number of each word in document, wherein the inverted index of the corresponding B+ tree of the number of each word, the inverted index For positioning an internal document number collection by a word number, the internal document number concentrates corresponding every document Corresponding word is numbered including the word;
Internal document number is distributed for the document;
The number of the first word in the in-line arrangement information is successively obtained, first word is any in each word Word;
Internal document number is inserted into the leaf node of the corresponding B+ tree of number of first word;
The corresponding relationship that the number of each word and the internal document are numbered is saved in the database.
2. being obtained the method according to claim 1, wherein the word in document carries out extraction processing In-line arrangement information, comprising:
Offline downloading new web page, obtains corresponding document;
According to the network address of the document, the document code is obtained;
Extraction processing is carried out to the word in the document, obtains each word;
According to each word, the number of each word is obtained;
According to the number of the document code and each word, in-line arrangement information is obtained.
3. according to the method described in claim 2, it is characterized in that, the network address according to the document, obtains the document Number, comprising:
According to the network address of the document, the 5th edition MD5 value of Message Digest 5 of the document is calculated, and by the MD5 of the document Value is used as the document code;
It is described according to each word, obtain the number of each word, comprising:
According to each word, the MD5 value of each word is calculated, and using the MD5 value of each word as described in The number of each word.
4. the method according to claim 1, wherein internal document number is to be initialized as 0 according to one Global variable obtain number, when a document needs to be saved in database, distribute current global variable be the text The internal document number of shelves, then the global variable increases by 1.
5. according to the method described in claim 4, it is characterized in that, being gone back after distributing internal document number for the document Include:
Establish and save internal document number to the document code mapping relations.
6. according to the method described in claim 5, it is characterized in that, the preservation document code is compiled to the internal document Number mapping relations, comprising:
The mapping relations that the document code is numbered to the internal document are stored using continuous memory headroom.
7. the method according to claim 1, wherein described be inserted into described first for internal document number In the leaf node of the corresponding B+ tree of number of word, comprising:
Judge whether the leaf node of the corresponding B+ tree of number of first word has expired;
When the leaf node of the B+ tree is less than, the internal document number is directly added in current leaf node;
When the leaf node of the B+ tree has been expired, judge whether current layer leaf node has expired;
When the current layer leaf node is less than, new leaf node is added, and add institute in the new leaf node State internal document number;
When the current layer leaf node has been expired, one layer is increased to the B+ tree, and add new intermediate node and young leaves Child node, and the internal document number is added in the new leaf node, the new intermediate node connection is described current Layer leaf node and the new leaf node.
8. the method according to claim 1, wherein described by the number of each word and described interior After the corresponding relationship of portion's document code saves in the database, further includes:
The inquiry word string of the user received is pre-processed, the volume for each word for including in the inquiry word string is obtained Number;
According to the number of each word, obtain in identical in the corresponding internal document number of number of each word Portion's document code regard the identical internal document number as target internal document code;
According to the mapping relations of internal document number to document code, determine that the corresponding document of the target internal document code is compiled Number, and export the corresponding document of the document code.
9. according to the method described in claim 8, it is characterized in that, the query information of the described couple of user received is located in advance Reason obtains the number for each word for including in the inquiry word string, comprising:
Receive the inquiry word string of user's input;
The inquiry word string is subjected to correction process, segmentation of words processing, obtains each word for including in the inquiry word string;
According to each word, the MD5 value of each word is calculated, using the MD5 value of each word as described each The number of a word.
10. according to the method described in claim 8, it is characterized in that,
The inverted index of the corresponding B+ tree of the number of each word, the inverted index are used for through a word number positioning One internal document number collection, it includes that the word numbers corresponding word that the internal document number, which concentrates corresponding every document, Language.
11. according to the method described in claim 10, it is characterized in that, the number according to each word, obtains institute State identical internal document number in the corresponding internal document number of number of each word, comprising:
Respectively according to the number of each word, the corresponding internal document number of number of each word is inquired, is obtained The inverted index of the corresponding B+ tree of number of each word;
According to the index length of each inverted index, the shortest inverted index of length is obtained;
Successively obtain the target internal document code in the shortest inverted index of the length;
When there is no described at least one inverted index in the inverted index in addition to the shortest inverted index of the length When target internal document code, next target internal document code is obtained;
When all there is the target internal document code in the inverted index in addition to the shortest inverted index of the length When, obtain the target internal document code.
12. a kind of device for handling document characterized by comprising
Extracting unit obtains in-line arrangement information, the in-line arrangement information includes document for carrying out extraction processing to the word in document The number of number and each word in the document, wherein the inverted index of the corresponding B+ tree of the number of each word, The inverted index, which is used to number by a word, positions an internal document number collection, the internal document number concentration pair The every document answered includes that the word numbers corresponding word;
Allocation unit, for distributing internal document number for the document;
Adding unit, for numbering the corresponding internal document of number addition of each word in the in-line arrangement information;
Storage unit, for the corresponding relationship of the number of each word and internal document number to be stored in data In library;
The adding unit, comprising:
Module is obtained, for successively obtaining the number of the first word in the in-line arrangement information, first word is described each Any word in a word;
It is inserted into module, the leaf section of the corresponding B+ tree of number for internal document number to be inserted into first word Point in.
13. device according to claim 12, which is characterized in that the extracting unit, comprising:
Download module obtains corresponding document for downloading new web page offline;
Number obtains module and obtains the document code for the network address according to the document;
Abstraction module obtains each word for carrying out extraction processing to the word in the document;
The number obtains module, is also used to obtain the number of each word according to each word;
In-line arrangement data obtaining module obtains in-line arrangement information for the number according to the document code and each word.
14. device according to claim 13, which is characterized in that the number obtains module, is used for:
According to the network address of the document, the 5th edition MD5 value of Message Digest 5 of the document is calculated, and by the MD5 of the document Value is used as the document code;
The number obtains module, is also used to calculate the MD5 value of each word, and will be described according to each word Number of the MD5 value of each word as each word.
15. device according to claim 12, which is characterized in that the internal document number is to be initialized as according to one The number that 0 global variable obtains, when a document needs to be saved in database, distributing current global variable is the text The internal document number of shelves, then the global variable increases by 1.
16. device according to claim 15, which is characterized in that described device, further includes:
Unit is established, the mapping relations for establishing the internal document number to the document code;
The storage unit, be also used to save internal document number to the document code mapping relations.
17. device according to claim 16, which is characterized in that the storage unit is used for:
The mapping relations that the document code is numbered to the internal document are stored using continuous memory headroom.
18. device according to claim 12, which is characterized in that the insertion module, comprising:
Judging submodule, for judging whether the leaf node of the corresponding B+ tree of number of first word has expired;
Submodule is added, for when the leaf node of the B+ tree is less than, directly adding in current leaf node in described Portion's document code;
Whether the judging submodule be also used to judge current layer leaf node when the leaf node of the B+ tree has been expired It is full;
The addition submodule is also used to add new leaf node, and described when the current layer leaf node is less than The internal document number is added in new leaf node;
The addition submodule is also used to when the current layer leaf node has been expired, and increases by one layer to the B+ tree, and add New intermediate node and new leaf node, and add in the new leaf node internal document number is described new Intermediate node connects the current layer leaf node and the new leaf node.
19. device according to claim 12, which is characterized in that described device, further includes:
Pretreatment unit is pre-processed for the inquiry word string to the user received, includes in the acquisition inquiry word string Each word number;
Acquiring unit obtains the corresponding internal document of number of each word for the number according to each word Identical internal document number in number regard the identical internal document number as target internal document code;
Determination unit determines that the target internal document is compiled for the mapping relations according to internal document number to document code Number corresponding document code;
Output unit, for exporting the corresponding document of the document code.
20. device according to claim 19, which is characterized in that the pretreatment unit, comprising:
Receiving module, for receiving the inquiry word string of user's input;
Preprocessing module, for the inquiry word string to be carried out correction process, the segmentation of words is handled, and is obtained in the inquiry word string Including each word;
Computing module, for the MD5 value of each word being calculated, by the MD5 of each word according to each word It is worth the number as each word.
21. device according to claim 19, which is characterized in that
The inverted index of the corresponding B+ tree of the number of each word, the inverted index are used for through a word number positioning One internal document number collection, it includes that the word numbers corresponding word that the internal document number, which concentrates corresponding every document, Language.
22. device according to claim 21, which is characterized in that the acquiring unit, comprising:
Enquiry module, for according to the number of each word, inquiring the corresponding inside of number of each word respectively Document code obtains the inverted index of the corresponding B+ tree of number of each word;
Inverted index obtains module, for the index length according to each inverted index, obtains the shortest row's of the falling rope of length Draw;
Target internal document code obtains module, for successively obtaining the text of the target internal in the shortest inverted index of the length Shelves number;
The target internal document code obtains module, is also used to fall described in addition to the shortest inverted index of the length When the target internal document code is not present at least one inverted index in row's index, next target internal text is obtained Shelves number;
The target internal document code obtains module, is also used to fall described in addition to the shortest inverted index of the length When all there is the target internal document code in row's index, the target internal document code is obtained.
CN201310567401.0A 2013-11-13 2013-11-13 A kind of method and device handling document Active CN104636384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310567401.0A CN104636384B (en) 2013-11-13 2013-11-13 A kind of method and device handling document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310567401.0A CN104636384B (en) 2013-11-13 2013-11-13 A kind of method and device handling document

Publications (2)

Publication Number Publication Date
CN104636384A CN104636384A (en) 2015-05-20
CN104636384B true CN104636384B (en) 2019-07-16

Family

ID=53215147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310567401.0A Active CN104636384B (en) 2013-11-13 2013-11-13 A kind of method and device handling document

Country Status (1)

Country Link
CN (1) CN104636384B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956203B (en) * 2016-06-30 2019-03-08 湖州亿联信息技术有限公司 A kind of information storage means, information query method, search engine device
US10540443B2 (en) * 2016-12-20 2020-01-21 RELX Inc. Systems and methods for determining references in patent claims
CN112395829B (en) * 2019-08-01 2024-03-19 珠海金山办公软件有限公司 Method and device for adding Chinese numbers to documents and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916905A (en) * 2006-09-04 2007-02-21 北京航空航天大学 Method for carrying out retrieval hint based on inverted list
CN102033954A (en) * 2010-12-24 2011-04-27 东北大学 Full text retrieval inquiry index method for extensible markup language document in relational database
CN102722526A (en) * 2012-05-16 2012-10-10 成都信息工程学院 Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN102955812A (en) * 2011-08-29 2013-03-06 阿里巴巴集团控股有限公司 Method and device for building index database as well as method and device for querying

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NZ518744A (en) * 2002-05-03 2004-08-27 Hyperbolex Ltd Electronic document indexing using word use nodes, node objects and link objects

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916905A (en) * 2006-09-04 2007-02-21 北京航空航天大学 Method for carrying out retrieval hint based on inverted list
CN102033954A (en) * 2010-12-24 2011-04-27 东北大学 Full text retrieval inquiry index method for extensible markup language document in relational database
CN102955812A (en) * 2011-08-29 2013-03-06 阿里巴巴集团控股有限公司 Method and device for building index database as well as method and device for querying
CN102722526A (en) * 2012-05-16 2012-10-10 成都信息工程学院 Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种由B+树实现的倒排索引;李文 等;《电脑知识与技术》;20110315;第7卷(第8期);1720-1722

Also Published As

Publication number Publication date
CN104636384A (en) 2015-05-20

Similar Documents

Publication Publication Date Title
EP3522029A1 (en) Natural language search results for intent queries
CN104598631B (en) Distributed data processing platform
CN102164186B (en) Method and system for realizing cloud search service
CN102855309B (en) A kind of information recommendation method based on user behavior association analysis and device
CN103605715B (en) Data Integration treating method and apparatus for multiple data sources
CN106951557B (en) Log association method and device and computer system applying log association method and device
CN102411617B (en) Method for storing and inquiring a large quantity of URLs
CN104915426B (en) Information sorting method, the method and device for generating information sorting model
CN104182405A (en) Method and device for connection query
CN102682046A (en) Member searching and analyzing method in social network and searching system
US20160292207A1 (en) Resolving outdated items within curated content
CN108228657B (en) Method and device for realizing keyword retrieval
CN104699841A (en) Method and device for providing list summary information of search results
CN109947759A (en) A kind of data directory method for building up, indexed search method and device
CN106599215A (en) Question generation method and question generation system based on deep learning
CN104636368B (en) Data retrieval method, device and server
CN104636384B (en) A kind of method and device handling document
CN103745006A (en) Internet information searching system and internet information searching method
CN110019380B (en) Data query method, device, server and storage medium
CN106649385B (en) Data reordering method and device based on HBase database
CN106599062A (en) Data processing method and device in SparkSQL system
CN106789147A (en) A kind of flow analysis method and device
US8706705B1 (en) System and method for associating data relating to features of a data entity
CN105488165B (en) Data retrieval method and system based on index database
CN103678601A (en) Model essay retrieval request processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant