CN105808607A - Generation method and device of document index - Google Patents

Generation method and device of document index Download PDF

Info

Publication number
CN105808607A
CN105808607A CN201410854769.XA CN201410854769A CN105808607A CN 105808607 A CN105808607 A CN 105808607A CN 201410854769 A CN201410854769 A CN 201410854769A CN 105808607 A CN105808607 A CN 105808607A
Authority
CN
China
Prior art keywords
participle
document
query word
anchor text
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410854769.XA
Other languages
Chinese (zh)
Inventor
董毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410854769.XA priority Critical patent/CN105808607A/en
Publication of CN105808607A publication Critical patent/CN105808607A/en
Pending legal-status Critical Current

Links

Abstract

The embodiment of the invention provides a generation method and device of a document index. The method comprises the following steps: obtaining one or a plurality of query word anchor texts corresponding to a document; setting the total weight of the query word anchor texts for the query word anchor texts; according to the total weight of the query word anchor texts, configuring a characteristic weight for a segmentation word contained in each query word anchor text; on the basis of the characteristic weight, determining the segmentation word weight, which is relative to the document, of the segmentation word; and according to the segmentation word and the segmentation word weight relative to the document, generating the document index. The embodiment of the invention improves a display probability of a search result item associated with user search and improves search accuracy so as to reduce research in a search result in ways including page turning search, the re-input of search keywords and the like, operation convenience is improved, the resource consumption of a search engine and a local system is reduced, bandwidth consumption is reduced, and search efficiency is improved.

Description

A kind of generation method and apparatus of document index
Technical field
The present invention relates to search technique field, particularly relate to a kind of generation method of document index and the generation device of a kind of document index.
Background technology
Along with developing rapidly of network, the information on network sharply increases.User in order to find required information in the information of magnanimity, it is common to use search engine scans for.
Search engine refers to automatically gather information from the Internet, after certain arrangement, it is provided that carry out the system inquired about to user.Information vastness is multifarious, and has no order, and all of information is as the island one by one on vast sea, and web page interlinkage is bridge crisscross between these islands, and search engine, then draw an open-and-shut information map for user, consult at any time for user.
Search engine generally pre-builds document index, such as inverted index, each item in this concordance list all includes a property value and has each address recorded of this property value, index object is the word etc. in document or collection of document, it is used for storing these words storage position in a document or one group of document, is a kind of the most frequently used Indexing Mechanism to document or collection of document.Owing to not being determine property value by recording, but determined the position of record by property value, thus be called inverted index (invertedindex).
Document index in search engine is usually data base's concordance list, based on this database index table scan for obtain result often and non-user needed for, accuracy rate is low.User is not when searching required information, and generally in Search Results, page turning is searched, re-entered the search mode such as key word and scan for, and troublesome poeration, the resource consumption of search engine and local system is big, and bandwidth consumption is big, and search efficiency is low.
Summary of the invention
In view of the above problems, it is proposed that the present invention is to provide a kind of and overcome the problems referred to above or solve a kind of generation method of document index of the problems referred to above and the generation device of corresponding a kind of document index at least in part.
According to one aspect of the present invention, it is provided that a kind of generation method of document index, including:
Obtain one or more query word Anchor Text that document is corresponding;
Described query word Anchor Text is arranged the total weight of query word Anchor Text;
According to the participle configuration feature weight that described query word Anchor Text is comprised by the described total weight of query word Anchor Text;
The described participle participle weight relative to described document is determined based on described feature weight;
Document index is generated according to described participle with relative to the participle weight of described document.
Alternatively, described method also includes:
Participle is extracted from the document grabbed.
Alternatively, described participle includes one-gram word, and the described step extracting participle from the document grabbed includes:
The document grabbed is carried out word segmentation processing, it is thus achieved that one-gram word.
Alternatively, described participle also includes binary participle, and the described step extracting participle from the document grabbed also includes:
One-gram word adjacent between two is combined, it is thus achieved that binary participle.
Alternatively, the step of the described participle configuration feature weight described query word Anchor Text comprised according to the described total weight of query word Anchor Text includes:
When described query word Anchor Text comprises a participle, the described total weight of query word Anchor Text is allocated to described participle.
Alternatively, the step of the described participle configuration feature weight described query word Anchor Text comprised according to the described total weight of query word Anchor Text includes:
When described query word Anchor Text comprises multiple participle, the described total weighted average of query word Anchor Text is allocated to each participle.
Alternatively, described determine that described participle includes relative to the step of the participle weight of described document based on described feature weight:
Calculate the feature weight sum of identical participle within said document, it is thus achieved that described participle is relative to the participle weight of described document.
Alternatively, described document has a number information, the described participle of described employing and generate the step of document index relative to the participle weight of described document and include:
In one or more concordance lists, described participle is set to key;
The number information of described document, described participle weight and described participle are set to, at the positional information of described document, the value that described key is corresponding, it is thus achieved that one or more document index.
Alternatively, the described participle of described employing and relative to described document participle weight generate document index step also include:
Merge the one or more document index.
Alternatively, described method also includes:
Described document index is stored to data base.
Alternatively, described the step that described document index stores to data base is included:
By target designation information and target participle weight, with, target position information is stored separately in different files;
Described target code information is that access frequency exceedes the coding information of default first frequency threshold value, described target participle weight is that access frequency exceedes the participle weight of default second frequency threshold value, described target position information is that access frequency is lower than the positional information presetting the 3rd frequency threshold.
Alternatively, described the step that described document index stores to data base is included:
It is combined into one or more data block by one or more;
In each data block, it is compressed processing at least one in the ownership number information of described data block, participle weight and positional information respectively.
According to a further aspect in the invention, it is provided that the generation device of a kind of document index, including:
Acquisition module, is suitable to obtain one or more query word Anchor Text that document is corresponding;
Module is set, is suitable to described query word Anchor Text is arranged the total weight of query word Anchor Text;
Configuration module, is suitable to the participle configuration feature weight described query word Anchor Text comprised according to the described total weight of query word Anchor Text;
Determine module, be suitable to determine the described participle participle weight relative to described document based on described feature weight;
Generation module, is suitable to generate document index according to described participle with relative to the participle weight of described document.
Alternatively, described device also includes:
Extraction module, is suitable to extraction participle from the document grabbed.
Alternatively, described participle includes one-gram word, and described extraction module is further adapted for:
The document grabbed is carried out word segmentation processing, it is thus achieved that one-gram word.
Alternatively, described participle also includes binary participle, and described extraction module is further adapted for:
One-gram word adjacent between two is combined, it is thus achieved that binary participle.
Alternatively, described configuration module is further adapted for:
When described query word Anchor Text comprises a participle, the described total weight of query word Anchor Text is allocated to described participle.
Alternatively, described configuration module is further adapted for:
When described query word Anchor Text comprises multiple participle, the described total weighted average of query word Anchor Text is allocated to each participle.
Alternatively, described determine that module is further adapted for:
Calculate the feature weight sum of identical participle within said document, it is thus achieved that described participle is relative to the participle weight of described document.
Alternatively, described document has number information, and described generation module is further adapted for:
In one or more concordance lists, described participle is set to key;
The number information of described document, described participle weight and described participle are set to, at the positional information of described document, the value that described key is corresponding, it is thus achieved that one or more document index.
Alternatively, described generation module is further adapted for:
Merge the one or more document index.
Alternatively, described device also includes:
Memory module, is suitable to store to data base described document index.
Alternatively, described memory module is further adapted for:
By target designation information and target participle weight, with, target position information is stored separately in different files;
Described target code information is that access frequency exceedes the coding information of default first frequency threshold value, described target participle weight is that access frequency exceedes the participle weight of default second frequency threshold value, described target position information is that access frequency is lower than the positional information presetting the 3rd frequency threshold.
Alternatively, described memory module is further adapted for:
It is combined into one or more data block by one or more;
In each data block, it is compressed processing at least one in the ownership number information of described data block, participle weight and positional information respectively.
nullOne or more query word Anchor Text that document is corresponding are arranged the total weight of query word Anchor Text by the embodiment of the present invention,According to the participle configuration feature weight that query word Anchor Text is comprised by the total weight of query word Anchor Text,The participle participle weight relative to document is determined based on described feature weight,And according to participle and relative to document participle weight generate document index,By scoring the word participle weight relative to document in document index acceptance of the bid,To support that other users follow-up are when search,It is ranked up showing to search result items according to participle weight,Improve the displaying probability of the search result items relevant to user search,Improve the accuracy rate of search,And then reduce page turning lookup in Search Results、Re-enter the modes such as search key word to scan for,Improve the simplicity of operation,Decrease the consumption of the resource of search engine and local system,Reduce bandwidth consumption,Improve search efficiency.
The embodiment of the present invention is by target designation information and target participle weight, with, target position information is stored separately in different files, one or more sets of documentation are synthesized one or more data blocks, in each data block, is compressed processing at least one in the ownership number information of data block, participle weight and positional information respectively, save the space of storage on the one hand, on the other hand, it is ensured that the performance when retrieval, it is ensured that search efficiency.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, and can be practiced according to the content of description, and in order to above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit those of ordinary skill in the art be will be clear from understanding.Accompanying drawing is only for illustrating the purpose of preferred implementation, and is not considered as limitation of the present invention.And in whole accompanying drawing, it is denoted by the same reference numerals identical parts.In the accompanying drawings:
Fig. 1 illustrates the flow chart of steps of the generation embodiment of the method 1 of a kind of according to an embodiment of the invention document index;
Fig. 2 illustrates the flow chart of steps of the generation embodiment of the method 2 of a kind of according to an embodiment of the invention document index;And
Fig. 3 illustrates the structured flowchart generating device embodiment of a kind of according to an embodiment of the invention document index.
Detailed description of the invention
It is more fully described the exemplary embodiment of the disclosure below with reference to accompanying drawings.Although accompanying drawing showing the exemplary embodiment of the disclosure, it being understood, however, that may be realized in various forms the disclosure and should do not limited by embodiments set forth here.On the contrary, it is provided that these embodiments are able to be best understood from the disclosure, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
With reference to Fig. 1, it is shown that the flow chart of steps of the generation embodiment of the method 1 of a kind of according to an embodiment of the invention document index, specifically may include steps of:
Step 101, obtains one or more query word Anchor Text that document is corresponding;
It should be noted that query word Anchor Text (queryanchor) is for certain document, it can be the query (such as search key word) that the document being triggered is corresponding.
Such as, user inputs query " Tian'anmen Square ", search engine searches the multiple search result items relevant to " Tian'anmen Square ", including search result items A (summary info of document A), search result items B (summary info of document B), search result items C (summary info of document C), if user triggers search result items A and search result items B respectively by the mode such as click or touch-control click, turn respectively and jump to the search result items A page corresponding with search result items B, then query " Tian'anmen Square " is properly termed as the query word Anchor Text of document A and document B;Conversely, as search result items C is not triggered, then query " Tian'anmen Square " can not be the query word Anchor Text of document C.
Further, search engine can the query word Anchor Text of statistic mass user, choose the highest one or more query word Anchor Text of frequency of usage and carry out the generation of document index.
Step 102, arranges the total weight of query word Anchor Text to described query word Anchor Text;
In actual applications, it is possible to be assigned with a total weight of query word Anchor Text for every query word Anchor Text that each document is corresponding, this total weight of query word Anchor Text can be obtained by calculated off line according to practical situation by those skilled in the art.
Such as, movie star Xiao Ming and singer are little red in recent marriage, then can be set to the higher total weight of query word Anchor Text about the query word Anchor Text of Xiao Ming and little red marriage, Xiao Ming and little red marriage a period of time after, the query word Anchor Text about Xiao Ming and little red marriage can be set to the relatively low total weight of query word Anchor Text.
Under normal circumstances, the total weight of query word Anchor Text of the query word Anchor Text high with file correlation is high, otherwise, the total weight of query word Anchor Text of the query word Anchor Text low with file correlation is low.
Step 103, according to the participle configuration feature weight that described query word Anchor Text is comprised by the described total weight of query word Anchor Text;
It should be noted that this participle can for belong to the document that query word Anchor Text is corresponding.
In a kind of alternative embodiment of the present invention, step 103 can include following sub-step:
Sub-step S11, when described query word Anchor Text comprises a participle, is allocated to described participle by the described total weight of query word Anchor Text.
In embodiments of the present invention, if query word Anchor Text comprises a participle, then the total weight of query word Anchor Text of this query word Anchor Text can be allocated to this participle, to obtain the feature weight of this participle.
Such as, the total weight of query word Anchor Text of query word Anchor Text " Tian'anmen Square " is 80, then participle " Tian'anmen Square " can be assigned to the feature weight of 80.
In a kind of alternative embodiment of the present invention, step 103 can include following sub-step:
Sub-step S12, when described query word Anchor Text comprises multiple participle, is allocated to each participle by the described total weighted average of query word Anchor Text.
In embodiments of the present invention, if query word Anchor Text comprises multiple participle (i.e. at least two), then the total weighted average of query word Anchor Text of this query word Anchor Text can be allocated to each participle, to obtain the feature weight of each participle.
Such as, the total weight of query word Anchor Text of query word Anchor Text " Tian'anmen Square " is 80, then participle " Tian An-men " and " square " can respectively be assigned to the feature weight of 40.
Certainly, above-mentioned weight collocation method is intended only as example, when implementing the embodiment of the present invention, it is possible to arranging other weight collocation methods according to practical situation, this is not any limitation as by the embodiment of the present invention.It addition, except above-mentioned weight collocation method, those skilled in the art can also adopt other weight collocation method according to actual needs, and this is not also any limitation as by the embodiment of the present invention.
Step 104, determines the described participle participle weight relative to described document based on described feature weight;
In implementing, each participle can have a participle weight for each document.
In a kind of alternative embodiment of the present invention, step 104 can include following sub-step:
Sub-step S21, calculates the feature weight sum of identical participle within said document, it is thus achieved that described participle is relative to the participle weight of described document.
In embodiments of the present invention, in same document, same participle obtains feature weight from different positions and adds up, it is possible to calculate the participle weight of this participle this document corresponding.
If participle is more high for the participle weight of document, then may indicate that this is more strong with the association of the document;Otherwise, if participle is more low for the participle weight of document, then may indicate that this is more weak with the association of the document.
Such as, the feature weight of participle " Tian'anmen Square " is 80, and the frequency of occurrence at certain document is 30, then " Tian'anmen Square " is 2400 relative to the participle weight of the document;If the feature weight in " Tian An-men " is 40, the frequency of occurrence at certain document is 50, then " Tian An-men " is 2000 relative to the participle weight of the document.
Step 105, generates document index according to described participle with relative to the participle weight of described document.
In implementing, document index can include inverted index, forward index etc., and document index can be made up of concordance list and master file two parts.
Concordance list can be the table of corresponding relation between an instruction logic record and physical record.Each in concordance list is called index entry.Index entry is button (or logic record number) order arrangement.
In a kind of alternative embodiment of the present invention, described document can have number information, then in embodiments of the present invention, step 105 can include following sub-step:
Sub-step S31, in one or more concordance lists, is set to key by described participle;
Sub-step S32, is set to, at the positional information of described document, the value that described key is corresponding by the number information of described document, described participle weight and described participle, it is thus achieved that one or more document index.
In embodiments of the present invention, input data can be number information (DocID), number information (DocID) continuous print document.
Output data can be the inverted index of these certification shelves corresponding.
Specifically, key (key) can be set to by participle retrieve, participle is through sequence, number information (the DocID that content is the document comprising this participle that each participle is corresponding, in order), and the participle weight corresponding on each document of this participle, the number of times of appearance, appearance positional information etc..
When generating inverted index, Hash (hash) table can be used as concordance list, after from just arrange, extraction obtains participle, with participle for key (key), by the number information (DocID) of current document, participle weight, positional information etc. updates in Hash (hash) table, as the value (value) that this key (key) is corresponding.After completing, then by output after the content arrangement in Hash (hash) table, then can obtain inverted index.
In a kind of alternative embodiment of the present invention, step 105 can also include following sub-step:
Sub-step S33, merges the one or more document index.
In realization, owing to index amount is huge, a data base has much several documents of ten million, generally cannot store down so big Hash (hash) table in internal memory.
In the embodiment of the present invention, all webpages of one data base can be divided into several number information (DocID) continuous print set, each set is sufficiently small, can put down in internal memory, it is individually created an inverted index for each set, then again these little inverted indexs are merged (merge) to together, obtaining a complete inverted index.
nullOne or more query word Anchor Text that document is corresponding are arranged the total weight of query word Anchor Text by the embodiment of the present invention,According to the participle configuration feature weight that query word Anchor Text is comprised by the total weight of query word Anchor Text,The participle participle weight relative to document is determined based on described feature weight,And according to participle and relative to document participle weight generate document index,By scoring the word participle weight relative to document in document index acceptance of the bid,To support that other users follow-up are when search,It is ranked up showing to search result items according to participle weight,Improve the displaying probability of the search result items relevant to user search,Improve the accuracy rate of search,And then reduce page turning lookup in Search Results、Re-enter the modes such as search key word to scan for,Improve the simplicity of operation,Decrease the consumption of the resource of search engine and local system,Reduce bandwidth consumption,Improve search efficiency.
With reference to Fig. 2, it is shown that the flow chart of steps of the generation embodiment of the method 2 of a kind of according to an embodiment of the invention document index, specifically may include steps of:
Step 201, extracts participle from the document grabbed.
In actual applications, search engine can automatically grab substantial amounts of document by web crawlers from network.
Web crawlers is also called Web Spider, i.e. WebSpider, webpage is found in the chained address that Web Spider is by webpage, from some page of website (usually homepage), read the content of webpage, find other chained address in webpage, then pass through these chained addresses and find next webpage, circulation so always is gone down, until all of for this website webpage has all been captured.If whole the Internet as a website, then Web Spider just all can capture webpage all of on the Internet get off by this principle.
Current web crawlers can be divided into general reptile and focused crawler.General reptile is based on the thought of BFS, from the URL (UniformResourceLocator of one or several Initial pages, URL) start, obtain the URL on Initial page, in the process capturing webpage, constantly extracting new URL from current page puts into queue, until meeting certain stop condition of system.And focused crawler is the program of an automatic download webpage, capture related pages resource for orientation.It is according to set crawl target, and the webpage accessed on WWW selectively links to relevant, obtains required information.Different from general reptile, focused crawler does not pursue big covering, but captures the webpage relevant to a certain particular topic content by being targeted by, and the user for subject-oriented inquires about preparation data resource.
The document of crawler capturing can be saved in data base and form substantial amounts of searching resource, then in embodiments of the present invention, it is possible to extract participle in data base from the document grabbed.
The website of heterogeneity and classification, the content arrangement of the document of its webpage is usually different.But the substance of general webpage includes title (title), header, footer, body matter (content), functional areas, navigation area billboard etc..
In the embodiment of the present invention, it is possible to according to configuration, the field specified is carried out participle, builds up document index, for quickly accessing and retrieval.This field generally can comprise the fields such as title (title), body matter (content), site, Anchor Text (anchortext).
In a kind of alternative embodiment of the present invention, described participle can include one-gram word (uni-Gram), then in embodiments of the present invention, step 201 can include following sub-step:
Sub-step S31, carries out word segmentation processing to the document grabbed, it is thus achieved that one-gram word.
In embodiments of the present invention, the probability of occurrence of N-Gram model hypothesis current word is only relevant with N-1 word before it, and it is the probability of occurrence (MarkovChain) going prediction current word with the probability of occurrence of front N-1 word in other words.
Conventional N-Gram model has uni-Gram (N=1, a tuple), bi-Gram (N=2, two tuples).
The basic word that participle obtains can as uni-Gram, for instance, to text message " People's Republic of China (PRC) ", carry out word segmentation processing, it is possible to obtain " China ", " people ", " republicanism ", " state " these 4 one-gram words (uni-Gram).
Some conventional participle processing methods are described below:
1, based on the segmenting method of string matching: the entry referred in the Chinese character string being analysed to according to certain strategy and a preset machine dictionary mates, if finding certain character string in dictionary, then the match is successful (identifying a word).
2, the segmenting method of feature based scanning or mark cutting: refer to and preferentially identify in character string to be analyzed and be syncopated as some words with obvious characteristic, using these words as breakpoint, less string can be divided into enter mechanical Chinese word segmentation more former character string, thus reducing the error rate of coupling;Or participle and part-of-speech tagging are combined, utilizes abundant grammatical category information that participle decision-making is offered help, and in turn word segmentation result tested again in annotation process, adjust, thus improving the accuracy rate of cutting.
3, based on the segmenting method understood: refer to the understanding by making computer mould personification distich, reach to identify the effect of word.Its basic thought is exactly carry out syntax, semantic analysis while participle, utilizes syntactic information and semantic information to process Ambiguity.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Under the coordination of master control part, participle subsystem can obtain the syntax and semantic information about word, sentence etc. and segmentation ambiguity is judged, namely it simulates people's understanding process to sentence.This segmenting method needs to use substantial amounts of linguistry and information.
4, the segmenting method of Corpus--based Method: refer to, owing to the frequency of word co-occurrence adjacent with word or probability can reflect into the credibility of word preferably in Chinese information, so the frequency of each combinatorics on words of co-occurrence adjacent in language material can be added up, calculate their information that appears alternatively, and calculate the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information of appearing alternatively can embody the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold value, just it is believed that this word group is likely to constitute a word.Word group frequency in language material is added up by this method, it is not necessary to cutting dictionary.
In a kind of alternative embodiment of the present invention, described participle can also include binary participle (bi-Gram), then in embodiments of the present invention, step 201 can also include following sub-step:
Sub-step S32, is combined one-gram word adjacent between two, it is thus achieved that binary participle.
In embodiments of the present invention, bluebeard compound is instructed for what word segmentation processing obtained, it is possible to by instructing two unitary words (uni-Gram) adjacent in bluebeard compound to combine, obtain binary participle (bi-Gram).
To text message " People's Republic of China (PRC) ", carry out word segmentation processing, " China ", " people ", " republicanism ", " state " these 4 one-gram words (uni-Gram) can be obtained, by these 4 one-gram word (uni-Gram) combination of two, it is possible to obtain " the China people ", " people's republicanism ", " republic " these 3 binary participles (bi-Gram).
Except extracting binary participle (bi-Gram) in combining from guidance, it is also possible to generating binary participle (bi-Gram) bluebeard compound from adjacent non-guidance, this is not any limitation as by the embodiment of the present invention.
Certainly, the extracting method of above-mentioned participle is intended only as example, when implementing the embodiment of the present invention, it is possible to arrange the extracting method of other participles according to practical situation, such as using ternary participle tri-Gram (N=3, tlv triple), this is not any limitation as by the embodiment of the present invention.It addition, except the extracting method of above-mentioned participle, those skilled in the art can also adopt the extracting method of other participle according to actual needs, and this is not also any limitation as by the embodiment of the present invention.
Step 202, obtains one or more query word Anchor Text that document is corresponding;
Step 203, arranges the total weight of query word Anchor Text to described query word Anchor Text;
Step 204, according to the participle configuration feature weight that described query word Anchor Text is comprised by the described total weight of query word Anchor Text;
Step 205, determines the described participle participle weight relative to described document based on described feature weight;
Step 206, generates document index according to described participle with relative to the participle weight of described document.
Step 207, stores described document index to data base.
In embodiments of the present invention, if the data genaration of document index completes, then its data can be carried out certain tissue, in the data base of write disk.
In a kind of alternative embodiment of the present invention, step 207 can include following sub-step:
Sub-step S41, by target designation information and target participle weight, with, target position information is stored separately in different files;
Wherein, described target code information can exceed the coding information of default first frequency threshold value for access frequency, described target participle weight can exceed the participle weight of default second frequency threshold value for access frequency, described target position information can be that access frequency is lower than the positional information presetting the 3rd frequency threshold.
Sub-step S42, is combined into one or more data block by one or more;
Sub-step S43, in each data block, is compressed processing at least one in the ownership number information of described data block, participle weight and positional information respectively.
Data in document index can comprise: number information (DocID), participle weight, positional information etc..
Storage time, it is considered to factor may include that
A, saving space, it is possible to compression related data;
Performance when b, retrieval, when retrieving certain participle, in order to the I/O amount read is little as far as possible, the data often accessed can store together, and the data often accessing and infrequently accessing can be stored separately, and the data decompression of compression should be fast as far as possible etc..
In embodiments of the present invention, it is possible to store in the following ways:
1, the number information (DocID) often accessed and participle weight are stored separately in different files from the positional information infrequently accessed;
2, data block storage;N (n is positive integer) individual document (Doc) forms a data block (block) for unit;
3, each data block (block) is internal, is compressed by modes such as pForDelta algorithms for data such as number information (DocID), participle weight, positional informationes.
The embodiment of the present invention is by target designation information and target participle weight, with, target position information is stored separately in different files, one or more sets of documentation are synthesized one or more data blocks, in each data block, is compressed processing at least one in the ownership number information of data block, participle weight and positional information respectively, save the space of storage on the one hand, on the other hand, it is ensured that the performance when retrieval, it is ensured that search efficiency.
For embodiment of the method, in order to be briefly described, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the embodiment of the present invention is not by the restriction of described sequence of movement, because according to the embodiment of the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, embodiment described in this description belongs to preferred embodiment, necessary to the involved action not necessarily embodiment of the present invention.
With reference to Fig. 3, it is shown that the structured flowchart generating device embodiment of a kind of according to an embodiment of the invention document index, specifically can include such as lower module:
Acquisition module 301, is suitable to obtain one or more query word Anchor Text that document is corresponding;
Module 302 is set, is suitable to described query word Anchor Text is arranged the total weight of query word Anchor Text;
Configuration module 303, is suitable to the participle configuration feature weight described query word Anchor Text comprised according to the described total weight of query word Anchor Text;
Determine module 304, be suitable to determine the described participle participle weight relative to described document based on described feature weight;
Generation module 305, is suitable to generate document index according to described participle with relative to the participle weight of described document.
In a kind of alternative embodiment of the present invention, described device can also include such as lower module:
Extraction module, is suitable to extraction participle from the document grabbed.
In a kind of alternative embodiment of the present invention, described participle can include one-gram word, and described extraction module can be adapted to:
The document grabbed is carried out word segmentation processing, it is thus achieved that one-gram word.
In a kind of alternative embodiment of the present invention, described participle can also include binary participle, and described extraction module can be adapted to:
One-gram word adjacent between two is combined, it is thus achieved that binary participle.
In a kind of alternative embodiment of the present invention, described configuration module 303 can be adapted to:
When described query word Anchor Text comprises a participle, the described total weight of query word Anchor Text is allocated to described participle.
In a kind of alternative embodiment of the present invention, described configuration module 303 can be adapted to:
When described query word Anchor Text comprises multiple participle, the described total weighted average of query word Anchor Text is allocated to each participle.
In a kind of alternative embodiment of the present invention, described determine that module 304 can be adapted to:
Calculate the feature weight sum of identical participle within said document, it is thus achieved that described participle is relative to the participle weight of described document.
In a kind of alternative embodiment of the present invention, described document can have number information, and described generation module 305 can be adapted to:
In one or more concordance lists, described participle is set to key;
The number information of described document, described participle weight and described participle are set to, at the positional information of described document, the value that described key is corresponding, it is thus achieved that one or more document index.
In a kind of alternative embodiment of the present invention, described generation module 305 can be adapted to:
Merge the one or more document index.
In a kind of alternative embodiment of the present invention, described device can also include such as lower module:
Memory module, is suitable to store to data base described document index.
In a kind of alternative embodiment of the present invention, described memory module can be adapted to:
By target designation information and target participle weight, with, target position information is stored separately in different files;
Described target code information is that access frequency exceedes the coding information of default first frequency threshold value, described target participle weight is that access frequency exceedes the participle weight of default second frequency threshold value, described target position information is that access frequency is lower than the positional information presetting the 3rd frequency threshold.
In a kind of alternative embodiment of the present invention, described memory module can be adapted to:
It is combined into one or more data block by one or more;
In each data block, it is compressed processing at least one in the ownership number information of described data block, participle weight and positional information respectively.
For device embodiment, due to itself and embodiment of the method basic simlarity, so what describe is fairly simple, relevant part illustrates referring to the part of embodiment of the method.
Not intrinsic to any certain computer, virtual system or miscellaneous equipment relevant in algorithm and the display of this offer.Various general-purpose systems can also with use based on together with this teaching.As described above, the structure constructed required by this kind of system is apparent from.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to utilize various programming language to realize the content of invention described herein, and the description above language-specific done is the preferred forms in order to disclose the present invention.
In description mentioned herein, describe a large amount of detail.It is to be appreciated, however, that embodiments of the invention can be put into practice when not having these details.In some instances, known method, structure and technology it are not shown specifically, in order to do not obscure the understanding of this description.
Similarly, it is to be understood that, one or more in order to what simplify that the disclosure helping understands in each inventive aspect, herein above in the description of the exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or descriptions thereof sometimes.But, the method for the disclosure should be construed to and reflect an intention that namely the present invention for required protection requires feature more more than the feature being expressly recited in each claim.More precisely, as the following claims reflect, inventive aspect is in that all features less than single embodiment disclosed above.Therefore, it then follows claims of detailed description of the invention are thus expressly incorporated in this detailed description of the invention, wherein each claim itself as the independent embodiment of the present invention.
Those skilled in the art are appreciated that, it is possible to carry out the module in the equipment in embodiment adaptively changing and they being arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit excludes each other, it is possible to adopt any combination that all processes or the unit of all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment are combined.Unless expressly stated otherwise, each feature disclosed in this specification (including adjoint claim, summary and accompanying drawing) can be replaced by the alternative features providing purpose identical, equivalent or similar.
In addition, those skilled in the art it will be appreciated that, although embodiments more described herein include some feature included in other embodiments rather than further feature, but the combination of the feature of different embodiment means to be within the scope of the present invention and form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can mode use in any combination.
The all parts embodiment of the present invention can realize with hardware, or realizes with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of some or all parts in the generation equipment of the document index that microprocessor or digital signal processor (DSP) can be used in practice to realize according to embodiments of the present invention.The present invention is also implemented as part or all the equipment for performing method as described herein or device program (such as, computer program and computer program).The program of such present invention of realization can store on a computer-readable medium, or can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described rather than limits the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment without departing from the scope of the appended claims.In the claims, any reference marks that should not will be located between bracket is configured to limitations on claims.Word " comprises " and does not exclude the presence of the element or step not arranged in the claims.Word "a" or "an" before being positioned at element does not exclude the presence of multiple such element.The present invention by means of including the hardware of some different elements and can realize by means of properly programmed computer.In the unit claim listing some devices, several in these devices can be through same hardware branch and specifically embody.Word first, second and third use do not indicate that any order.Can be title by these word explanations.

Claims (10)

1. a generation method for document index, including:
Obtain one or more query word Anchor Text that document is corresponding;
Described query word Anchor Text is arranged the total weight of query word Anchor Text;
According to the participle configuration feature weight that described query word Anchor Text is comprised by the described total weight of query word Anchor Text;
The described participle participle weight relative to described document is determined based on described feature weight;
Document index is generated according to described participle with relative to the participle weight of described document.
2. the method for claim 1, it is characterised in that also include:
Participle is extracted from the document grabbed.
3. the method as described in any one of claim 1-2, it is characterised in that described participle includes one-gram word, and the described step extracting participle from the document grabbed includes:
The document grabbed is carried out word segmentation processing, it is thus achieved that one-gram word.
4. the method as described in any one of claim 1-3, it is characterised in that described participle also includes binary participle, and the described step extracting participle from the document grabbed also includes:
One-gram word adjacent between two is combined, it is thus achieved that binary participle.
5. method as claimed in claim 1 or 2 or 3 or 4, it is characterised in that the step of the described participle configuration feature weight described query word Anchor Text comprised according to the described total weight of query word Anchor Text includes:
When described query word Anchor Text comprises a participle, the described total weight of query word Anchor Text is allocated to described participle.
6. a generation device for document index, including:
Acquisition module, is suitable to obtain one or more query word Anchor Text that document is corresponding;
Module is set, is suitable to described query word Anchor Text is arranged the total weight of query word Anchor Text;
Configuration module, is suitable to the participle configuration feature weight described query word Anchor Text comprised according to the described total weight of query word Anchor Text;
Determine module, be suitable to determine the described participle participle weight relative to described document based on described feature weight;
Generation module, is suitable to generate document index according to described participle with relative to the participle weight of described document.
7. device as claimed in claim 6, it is characterised in that also include:
Extraction module, is suitable to extraction participle from the document grabbed.
8. the device as described in any one of claim 6-7, it is characterised in that described participle includes one-gram word, and described extraction module is further adapted for:
The document grabbed is carried out word segmentation processing, it is thus achieved that one-gram word.
9. the device as described in any one of claim 6-8, it is characterised in that described participle also includes binary participle, and described extraction module is further adapted for:
One-gram word adjacent between two is combined, it is thus achieved that binary participle.
10. the device as described in any one of claim 6-9, it is characterised in that described configuration module is further adapted for:
When described query word Anchor Text comprises a participle, the described total weight of query word Anchor Text is allocated to described participle.
CN201410854769.XA 2014-12-31 2014-12-31 Generation method and device of document index Pending CN105808607A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410854769.XA CN105808607A (en) 2014-12-31 2014-12-31 Generation method and device of document index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410854769.XA CN105808607A (en) 2014-12-31 2014-12-31 Generation method and device of document index

Publications (1)

Publication Number Publication Date
CN105808607A true CN105808607A (en) 2016-07-27

Family

ID=56465259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410854769.XA Pending CN105808607A (en) 2014-12-31 2014-12-31 Generation method and device of document index

Country Status (1)

Country Link
CN (1) CN105808607A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416026A (en) * 2018-03-09 2018-08-17 腾讯科技(深圳)有限公司 Index generation method, content search method, device and equipment
CN112883294A (en) * 2019-11-29 2021-06-01 北京搜狗科技发展有限公司 Data processing method, device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916905A (en) * 2006-09-04 2007-02-21 北京航空航天大学 Method for carrying out retrieval hint based on inverted list
US20070271268A1 (en) * 2004-01-26 2007-11-22 International Business Machines Corporation Architecture for an indexer
US7308643B1 (en) * 2003-07-03 2007-12-11 Google Inc. Anchor tag indexing in a web crawler system
CN103294681A (en) * 2012-02-23 2013-09-11 北京百度网讯科技有限公司 Method and device for generating search result
CN103593460A (en) * 2013-11-25 2014-02-19 方正国际软件有限公司 Data hierarchical storage system and data hierarchical storage method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7308643B1 (en) * 2003-07-03 2007-12-11 Google Inc. Anchor tag indexing in a web crawler system
US20070271268A1 (en) * 2004-01-26 2007-11-22 International Business Machines Corporation Architecture for an indexer
CN1916905A (en) * 2006-09-04 2007-02-21 北京航空航天大学 Method for carrying out retrieval hint based on inverted list
CN103294681A (en) * 2012-02-23 2013-09-11 北京百度网讯科技有限公司 Method and device for generating search result
CN103593460A (en) * 2013-11-25 2014-02-19 方正国际软件有限公司 Data hierarchical storage system and data hierarchical storage method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416026A (en) * 2018-03-09 2018-08-17 腾讯科技(深圳)有限公司 Index generation method, content search method, device and equipment
CN112883294A (en) * 2019-11-29 2021-06-01 北京搜狗科技发展有限公司 Data processing method, device and medium

Similar Documents

Publication Publication Date Title
KR101443475B1 (en) Search suggestion clustering and presentation
US9069857B2 (en) Per-document index for semantic searching
US9317613B2 (en) Large scale entity-specific resource classification
US8316007B2 (en) Automatically finding acronyms and synonyms in a corpus
US20160140123A1 (en) Generating a query statement based on unstructured input
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
JP5616444B2 (en) Method and system for document indexing and data querying
US20070162448A1 (en) Adaptive hierarchy structure ranking algorithm
CN105808615A (en) Document index generation method and device based on word segment weights
JP7252914B2 (en) Method, apparatus, apparatus and medium for providing search suggestions
US20130339840A1 (en) System and method for logical chunking and restructuring websites
EP2307951A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
CN104715064A (en) Method and server for marking keywords on webpage
KR101577376B1 (en) System and method for determining infringement of copyright based on the text reference point
KR101651780B1 (en) Method and system for extracting association words exploiting big data processing technologies
JP2017220204A (en) Method and system for matching images with content using whitelists and blacklists in response to search query
WO2011088521A2 (en) Improved searching using semantic keys
CN106326236A (en) Webpage content identification method and system
CN104778232B (en) Searching result optimizing method and device based on long query
Sabri et al. Improving performance of DOM in semi-structured data extraction using WEIDJ model
CN105808607A (en) Generation method and device of document index
Soulemane et al. Crawling the hidden web: An approach to dynamic web indexing
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN109948015B (en) Meta search list result extraction method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160727

RJ01 Rejection of invention patent application after publication